Robert Schaefer Foundations of Global Genetic Optimization
Studies in Computational Intelligence, Volume 74 Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com
Vol. 64. Lorenzo Magnani and Ping Li (Eds.) Model-Based Reasoning in Science, Technology, and Medicine, 2007 ISBN 978-3-540-71985-4
Vol. 54. Fernando G. Lobo, Cl´audio F. Lima and Zbigniew Michalewicz (Eds.) Parameter Setting in Evolutionary Algorithms, 2007 ISBN 978-3-540-69431-1
Vol. 65. S. Vaidya, L.C. Jain and H. Yoshida (Eds.) Advanced Computational Intelligence Paradigms in Healthcare-2, 2007 ISBN 978-3-540-72374-5
Vol. 55. Xianyi Zeng, Yi Li, Da Ruan and Ludovic Koehl (Eds.) Computational Textile, 2007 ISBN 978-3-540-70656-4
Vol. 66. Lakhmi C. Jain, Vasile Palade and Dipti Srinivasan (Eds.) Advances in Evolutionary Computing for System Design, 2007 ISBN 978-3-540-72376-9
Vol. 56. Akira Namatame, Satoshi Kurihara and Hideyuki Nakashima (Eds.) Emergent Intelligence of Networked Agents, 2007 ISBN 978-3-540-71073-8 Vol. 57. Nadia Nedjah, Ajith Abraham and Luiza de Macedo Mourella (Eds.) Computational Intelligence in Information Assurance and Security, 2007 ISBN 978-3-540-71077-6 Vol. 58. Jeng-Shyang Pan, Hsiang-Cheh Huang, Lakhmi C. Jain and Wai-Chi Fang (Eds.) Intelligent Multimedia Data Hiding, 2007 ISBN 978-3-540-71168-1 Vol. 59. Andrzej P. Wierzbicki and Yoshiteru Nakamori (Eds.) Creative Environments, 2007 ISBN 978-3-540-71466-8 Vol. 60. Vladimir G. Ivancevic and Tijana T. Ivacevic Computational Mind: A Complex Dynamics Perspective, 2007 ISBN 978-3-540-71465-1 Vol. 61. Jacques Teller, John R. Lee and Catherine Roussey (Eds.) Ontologies for Urban Development, 2007 ISBN 978-3-540-71975-5 Vol. 62. Lakhmi C. Jain, Raymond A. Tedman and Debra K. Tedman (Eds.) Evolution of Teaching and Learning Paradigms in Intelligent Environment, 2007 ISBN 978-3-540-71973-1 Vol. 63. Wlodzislaw Duch and Jacek Ma´ndziuk (Eds.) Challenges for Computational Intelligence, 2007 ISBN 978-3-540-71983-0
Vol. 67. Vassilis G. Kaburlasos and Gerhard X. Ritter (Eds.) Computational Intelligence Based on Lattice Theory, 2007 ISBN 978-3-540-72686-9 Vol. 68. Cipriano Galindo, Juan-Antonio Fern´andez-Madrigal and Javier Gonzalez A Multi-Hierarchical Symbolic Model of the Environment for Improving Mobile Robot Operation, 2007 ISBN 978-3-540-72688-3 Vol. 69. Falko Dressler and Iacopo Carreras (Eds.) Advances in Biologically Inspired Information Systems: Models, Methods, and Tools, 2007 ISBN 978-3-540-72692-0 Vol. 70. Javaan Singh Chahl, Lakhmi C. Jain, Akiko Mizutani and Mika Sato-Ilic (Eds.) Innovations in Intelligent Machines-1, 2007 ISBN 978-3-540-72695-1 Vol. 71. Norio Baba, Lakhmi C. Jain and Hisashi Handa (Eds.) Advanced Intelligent Paradigms in Computer Games, 2007 ISBN 978-3-540-72704-0 Vol. 72. Raymond S.T. Lee and Vincenzo Loia (Eds.) Computation Intelligence for Agent-based Systems, 2007 ISBN 978-3-540-73175-7 Vol. 73. Petra Perner (Ed.) Case-Based Reasoning on Images and Signals, 2007 ISBN 978-3-540-73178-8 Vol. 74. Robert Schaefer Foundations of Global Genetic Optimization, 2007 ISBN 978-3-540-73191-7
Robert Schaefer
Foundations of Global Genetic Optimization Chapter 6 written by Henryk Telega
With 44 Figures
Prof. Robert Schaefer Department of Computer Science AGH University of Science and Technology Mickiewicza 30 30-059 Krak´ow Poland E-mail:
[email protected]
Chapter 6 written by: Henryk Telega Institute of Computer Science Jagiellonian University Nawojki 11 30-072 Krak´ow Poland E-mail:
[email protected]
Library of Congress Control Number: 2007929548 ISSN print edition: 1860-949X ISSN electronic edition: 1860-9503 ISBN 978-3-540-73191-7 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com c Springer-Verlag Berlin Heidelberg 2007 ° The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: deblik, Berlin Typesetting by the SPi using a Springer LATEX macro package Printed on acid-free paper SPIN: 11733416 89/SPi
543210
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
Global optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Definitions of global optimization problems . . . . . . . . . . . . . . . . . 7 2.2 General schema of a stochastic search . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Basic features of stochastic algorithms of global optimization . . 19 2.4 Genetic algorithms in action – solution of inverse problems in the mechanics of continua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3
Basic models of genetic computations . . . . . . . . . . . . . . . . . . . . . . 3.1 Encoding and inverse encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Binary affine encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Gray encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Phenotypic encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Objective and fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The individual and population models . . . . . . . . . . . . . . . . . . . . . 3.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Proportional (roulette) selection . . . . . . . . . . . . . . . . . . . . . 3.4.2 Tournament selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Elitist selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Rank selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Binary genetic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Multi-point mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Binary crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Features of binary genetic operations, mixing . . . . . . . . . 3.6 Definition of the Simple Genetic Algorithm (SGA) . . . . . . . . . . . 3.7 Phenotypic genetic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Phenotypic mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Phenotypic crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Phenotypic operations in constrained domains . . . . . . . .
31 31 34 37 38 38 39 40 41 42 42 42 43 44 44 46 47 48 48 49 50
VI
Contents
3.8 Schemes for creating a new generation . . . . . . . . . . . . . . . . . . . . . 51 3.9 µ, λ – taxonomy of single- and multi-deme strategies . . . . . . . . . 52 4
Asymptotic behavior of the artificial genetic systems . . . . . . 55 4.1 Markov theory of genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.1 Markov chains in genetic algorithm asymptotic analysis 57 4.1.2 Markov theory of the Simple Genetic Algorithm . . . . . . . 61 4.1.3 The results of the Markov theory for Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Asymptotic results for very small populations . . . . . . . . . . . . . . . 96 4.2.1 The rate of convergence of the single individual population with hard succession . . . . . . . . . . . . . . . . . . . . . 96 4.2.2 The dynamics of double individual populations with proportional selection . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3 The increment of the schemata cardinality in the single evolution epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.4 Summary of practicals coming from asymptotic theory . . . . . . . 111
5
Adaptation in genetic search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1 Adaptation and self-adaptation in genetic search . . . . . . . . . . . . 115 5.2 The taxonomy of adaptive genetic strategies . . . . . . . . . . . . . . . . 117 5.3 Single- and twin-population strategies (α) . . . . . . . . . . . . . . . . . . 122 5.3.1 Adaptation of genetic operation parameters (α.1) . . . . . . 122 5.3.2 Strategies with a variable life time of individuals (α.2) . 127 5.3.3 Selection of the operation from the operation set (α.3) . 130 5.3.4 Introducing local optimization methods to the evolution (α.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.5 Fitness modification (α.5) . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.6 Additional replacement of individuals (α.6) . . . . . . . . . . . 139 5.3.7 Speciation (α.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.3.8 Variable accuracy searches (α.8) . . . . . . . . . . . . . . . . . . . . . 142 5.4 Multi-deme strategies (β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.4.1 Metaevolution (β.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4.2 Island models (β.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.4.3 Hierarchic Genetic Strategies (β.3) . . . . . . . . . . . . . . . . . . 147 5.4.4 Inductive Genetic Programming (iGP) (β.4) . . . . . . . . . . 151
6
Two-phase stochastic global optimization strategies . . . . . . . . 153 6.1 Overview of two-phase stochastic global strategies . . . . . . . . . . . 153 6.1.1 Global phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.1.2 Global phase - why stochastic methods? . . . . . . . . . . . . . . 154 6.1.3 Local phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.1.4 Pure Random Search (PRS), Single-Start, Multistart . . 156 6.1.5 Properties of PRS, Single-Start and Multistart . . . . . . . . 157 6.1.6 Clustering methods in continuous global optimization . . 158
Contents
VII
6.1.7 6.1.8 6.1.9 6.1.10 6.1.11
Analysis of the reduction phase . . . . . . . . . . . . . . . . . . . . . 160 Density Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Single Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Mode Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Multi Level Single Linkage and Multi Level Mode Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.1.12 Topographic methods (TGO, TMSL) . . . . . . . . . . . . . . . . 173 6.2 Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.2.1 Non-sequential rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.2.2 Sequential rules - optimal and suboptimal Bayesian stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.2.3 Stopping rules that use values of the objective function . 179 6.3 Two-phase genetic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.3.1 The idea of Clustered Genetic Search (CGS) . . . . . . . . . . 179 6.3.2 Description of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . 181 6.3.3 Stopping rules and asymptotic properties . . . . . . . . . . . . . 183 6.3.4 Illustration of performance of Clustered Genetic Search 185 7
Summary and perspectives of genetic algorithms in continuous global optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
List of Figures
˜ x+ (¯ Level sets Lx+ (¯ y ), L y ) and the basin of attraction Bx+ of ˜ : R → R+ . . . . the local isolated minimizer x+ for the function Φ 2.2 General schema of the population-oriented, stochastic global optimization search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Initial shape of the truss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Final shape of the truss after topological optimization. . . . . . . . . 2.5 Loading of the plate at the defect identification problem. . . . . . . 2.6 Fitness of the best individual in consecutive epochs by defect identification in plate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 The defects identification by the best individual phenotype in consecutive epochs: 1 - 1th , 2 - 10th , 3 - 50th , 4 - 100th . . . . . . . . . 2.8 Loading and the sensor location on the defected rod. . . . . . . . . . . 2.9 Location of defects identified in road versus real life ones after 200 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 The results of Clustered Genetic Search for the strategy tuned for two noticeable local minimizers. . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 The results of Clustered Genetic Search for the strategy tuned for less noticeable local minimizers. . . . . . . . . . . . . . . . . . . . . . . . . . 2.1
12 14 22 23 24 25 25 26 26 28 29
3.1 3.2
Binary affine encoding in two dimensions . . . . . . . . . . . . . . . . . . . . 35 Producing new population by reproduction and succession . . . . . 52
4.1 4.2 4.3
Markov sampling scheme for the classical genetic algorithms. . . . One-dimensional Rastrigin function (see formula 4.31). . . . . . . . . The limit sampling measure associated with the one-dimensional Rastrigin function. . . . . . . . . . . . . . . . . . . . . . . . . . Twin Gauss peaks with different mean values. . . . . . . . . . . . . . . . . The limit sampling measure associated with twin Gauss peaks with different mean values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Twin Gauss peaks with different mean and different standard deviation values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 4.5 4.6
58 81 81 82 82 83
X
List of Figures
4.7
The limit sampling measure associated with twin Gauss peaks with different mean and different standard deviation values. . . . 83 4.8 The scheme of the evolutionary algorithm with elitist selection on the population level and deterministic passage of the best fitted individual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.9 Expected behavior of the two-individual population. . . . . . . . . . . 101 4.10 Trajectories for the expected two-individual population in case of the unimodal fitness f (x) = exp(−5x2 ). White circles mark the starting positions of populations, black dots the positions after the first epoch while crosses mark the positions after 20 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.11 Trajectories for the expected two-individual population in case of the bimodal fitness f (x) = exp(−5x2 ) − 2 exp(−5(x − 1)2 ). White circles mark the starting positions of populations, black dots the positions after the first epoch while crosses mark the positions after 20 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1 5.2 5.3 5.4 5.5
5.7
Sampling scheme for the adaptive genetic algorithms. . . . . . . . . . 116 The tree of adaptation techniques. Part 1. . . . . . . . . . . . . . . . . . . . 119 The tree of adaptation techniques. Part 2. . . . . . . . . . . . . . . . . . . . 120 The tree of adaptation techniques. Part 3. . . . . . . . . . . . . . . . . . . . 121 The fitness flattening resulting from the local method application by individual evaluation (Baldwin effect). . . . . . . . . . 134 One dimensional nested meshes (D ⊂ R) for Hierarchical Genetic Strategy in the case s1 = 2, s2 = 3, s3 = 5. . . . . . . . . . . . . 149 Genetic universa in the HGS-RN strategy . . . . . . . . . . . . . . . . . . . 150
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13
Modification of fitness function in recognized clusters. . . . . . . . . . 182 Joining of subclusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Rastrigin function for −0.5 ≤ x ≤ 0.5, −0.5 ≤ y ≤ 0.5. . . . . . . . . 187 Rastrigin function for −10 ≤ x ≤ 10, −10 ≤ y ≤ 10. . . . . . . . . . . . 188 Rastrigin function, graphic presentation of the results of CGS. . 188 Results of CGS after fine tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Rastrigin function for −10 ≤ x ≤ 10, −10 ≤ y ≤ 10. . . . . . . . . . . . 191 Results of CGS after fine tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Rosenbrock function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Sine of a product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 A test function with large plateau. . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Results of CGS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Results of CGS for different parameters. . . . . . . . . . . . . . . . . . . . . . 196
5.6
List of Algorithms
1 2 3 4 5 6 7 8 9
Draft of the ESSS-DOF strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Pure Random Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Density Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Single Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Mode Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Multi Level Single Linkage version 1 . . . . . . . . . . . . . . . . . . . . . . . . 171 Multi Level Single Linkage version 2 . . . . . . . . . . . . . . . . . . . . . . . . 172 Parallel version of CGS, Master. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Parallel version of CGS, slaves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
1 Introduction
Genetic algorithms today constitute a family of effective global optimization methods used to solve difficult real-life problems which arise in science and technology. Despite their computational complexity, they have the ability to explore huge data sets and allow us to study exceptionally problematic cases in which the objective functions are irregular and multimodal, and where information about the extrema location is unobtainable in other ways. They belong to the class of iterative stochastic optimization strategies that, during each step, produce and evaluate the set of admissible points from the search domain, called the random sample or population. As opposed to the Monte Carlo strategies, in which the population is sampled according to the uniform probability distribution over the search domain, genetic algorithms modify the probability distribution at each step. Mechanisms which adopt sampling probability distribution are transposed from biology. They are based mainly on genetic code mutation and crossover, as well as on selection among living individuals. Such mechanisms have been tested by solving multimodal problems in nature, which is confirmed in particular by the many species of animals and plants that are well fitted to different ecological niches. They direct the search process, making it more effective than a completely random one (search with a uniform sampling distribution). Moreover, well-tuned genetic-based operations do not decrease the exploration ability of the whole admissible set, which is vital in the global optimization process. The features described above allow us to regard genetic algorithms as a new class of artificial intelligence methods which introduce heuristics, well tested in other fields, to the classical scheme of stochastic global search. Right from the beginning, genetic algorithms have aroused the interest of users and scientists, who try to apply them when solving engineering problems (optimal design, stuff deposit investigation and defectoscopy etc.) as well as to explain the essence of their complex behavior. At least ten large international conferences are devoted to the theory and applications of genetic optimization. The most important and well established R. Schaefer: Foundation of Global Genetic Optimization, Studies in Computational Intelligence (SCI) 74, 1–6 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
2
1 Introduction
seems to be FOGA1 (Foundation of Genetic Algorithms), GECCO2 (Genetic and Evolutionary Computation), PPSN3 (Parallel Problem Solving from Nature) and CEC4 (IEEE Congress on Evolutionary Computation). Almost all conferences in artificial intelligence, optimization, distributed processing, CAD/CAE and various branches of technology contain sessions or workshops that gather contributions showing specialized applications of genetic computing and their theoretical motivations. Many important events were organized by national associations, in particular the National Conference in Genetic Algorithms and Global Optimization KAEiOG5 has taken place in Poland annually since 1996. There are also several scientific journals devoted solely to genetic algorithm theory and applications. It is worth highlighting IEEE Transactions on Evolutionary Computation 6 and Evolutionary Computation 7 among them. Besides the many research papers cited in this book, we would like to draw the reader’s attention to monographs that try to comprehend results from various branches of genetic computation and trace new directions in research and applications. The pioneering book in this area, entitled Adaptation in Natural and Artificial Systems, was written by Holland [85] in 1975. The author defined binary genetic operations and related them to the real modifications and inheritance of genetic code. He also tried to deliver a formal description and quantitative evaluation of the artificial genetic process by formulating a popular schemata theorem. Important bibliographical items that show the number and variety of evolutionary optimization techniques, as well as more formal descriptions of algorithms, are the books of Goldberg 1989, [74], Michalewicz 1992, [110] and Koza 1992, [99]. Due to its intentions and large scope, the monograph of Bäck, Fogel and Michalewicz from 1997, [15] as well as its compressed and improved version [10, 11], is impressive. An exceptional book which discusses parallel models and implementations of genetic computation was written by Cantú-Paz 2000, [45]. Another title of this type, published by Osyczka 2002, [124] summarizes the genetic algorithm applications to multi-criteria design and optimization. One well-known book, written in 1999 by Vose, [193] delivers the most important results concerning formal analysis of the so-called Simple Genetic Algorithm (SGA). His approach is based on SGA modeling as the Markov chain, whose trajectories are located in the space of states common for the class of algorithms with different population cardinality. The main results characterize the SGA asymptotic behavior by the number of iterations (genetic epochs) tending to infinity, as well as for an infinitely growing population size. 1 2 3 4 5 6 7
http://www.sigevo.org/foga-2007/ http://www.sigevo.org/gecco-2006/ http://ls11-www.cs.uni-dortmund.de/PPSN/ http://www.cec2007.org/ http://kaeiog.elka.pw.edu.pl/ http://ieee-cis.org/pubs/tec/ http://www.mitpressjournals.org/loi/evco?cookieSet=1
1 Introduction
3
The next important contributions in this area are the chapters written by Rudolph [143, 144, 145], Rudolph and Beyer [27], which are integral parts of three books, edited by Bäck, Fogel and Michalewicz [15, 10, 11]. They analyze the particular type of convergence of genetic algorithms with real number, phenotypic encoding. There are many distinguished books, which have been printed recently, dealing with genetic algorithm theory. Spears 2000, [178] discusses the role of mutation and recombination, in algorithms with the binary genetic universum, in detail. Beyer 2001, [26] presents an exhaustive analysis of the evolution strategy progress using strong regularity assumptions with respect to fitness. Langdon and Poli 2002, [102] extend some theoretical results which come from binary schemata theory to the case of genetic programming with the genetic universum as a space of graphs. Reeves and Rowe 2003, [134] present a critical view of the various approaches for studying genetic algorithm theory and discuss the perspectives in this area. Finally it worth mentioning two Polish books. The first one, written by Arabas [5], delivers the author’s original approach and comments to selected genetic techniques, preceded by a broad mathematical description concerning single- and multi-criteria optimization problems. The second one [149] marks the beginnings of this book. This work delivers a new approach for studying genetic algorithms by modeling them as dynamic systems which transform the probabilistic sampling measures (probability distributions on the admissible set of solutions) in a regular way. This approach allows us to show that genetic algorithms may effectively find the subsets of the search domain rather than isolated points, e.g. the central parts of the basins of attraction of the local minima rather than isolated minimizers. This feature reflects the character of elementary evolutionary mechanisms implemented here that lead to the whole flock (population) surviving by the fast exploration of new regions of feeding rather than care of the single flock member (individual). The attention of the reader will be turned to such kinds of genetic algorithm instances (e.g. two-phase methods, genetic sample clustering, and sensitivity analysis) for which the above features may guarantee that all solutions are found and the stopping rule is verified. It will also be shown that the traditional use of a genetic algorithm to solve local optimization problems may meet obstacles which arise from the inherent features of this group of methods. We will focus on the ability of genetic algorithms to solve global, continuous optimization problems so that the admissible solutions make the regular subset (with the Lipschitz boundary) of the positive Lesbegue measure in a dense, finite dimensional linear-metric space. We do not consider genetic algorithm instances which can only solve discrete optimization problems. We will also omit such features of the common algorithm instances that are valid only for the discrete search domain. Detailed definitions of standard continuous global optimization problems are given at the beginning of this book. Problems which involve finding all
4
1 Introduction
global extremes as well as the predefined class of local extremes are discussed in Section 2.1. New optimization problems, leading to recognizing and approximating sets which are the central parts of the basins of attraction of local minimizers, are also introduced and discussed in this section. In Sections 2.2, 2.3, a general, abstract scheme of the stochastic, population search and its basic qualitative, asymptotic features are specified. In the light of these formulations, the basic mechanisms of genetic computation, presented in Chapter 3, exhibit their real nature and working directions. Such an approach is perhaps much less mysterious than the traditional one based on biological analogy. It also allows the synthetic presentation of many details common for quite different algorithm classes (e.g. genetic algorithms with the finite set of codes and evolutionary strategies using phenotypic, real number encoding). All standard genetic operations presented in Chapter 3 lead to the stationary rule of sampling adaptation i.e. the rule does not depend on the genetic epoch in which it is applied. In other words, the probability distribution utilized for sampling is obtained in the same way, taking only the current population into account. The class of genetic algorithms that permit only stationary adaptation rules will be called self-adaptive genetic algorithms, because the transition rules of sampling probability distribution are not modified by any external control or by any feedback signals coming from the previous population monitoring. The taxonomy and a short description of adaptive genetic algorithms that break the principle of stationary sampling probability transition are given in Chapter 5. The core of this book, located in Chapter 4, synthesizes the mathematical models of genetic algorithm dynamics and their asymptotic features. The main results presented in this chapter are based on the stochastic model that operates on the space of states whose elements are populations or their unambiguous representations. Genetic algorithms are assumed to be self-adaptive, which implies the uniform Markovian rule of state transition. Asymptotic results obtained for the Simple Genetic Algorithm (see Section 4.1.2) are based on features of the so-called genetic operator, introduced by Vose and his co-workers, sometimes called the SGA heuristics (see e.g. [193]). In the same section, theorems concerning the transformation of sampling measures, and their transport from the space of states of the genetic algorithm to the search domain, are considered. These results motivate the application of genetic algorithms in searching and approximating the central parts of the basins of attractions of the local, isolated minimizers. Such results are also helpful in the analysis of two-phase global optimization strategies which utilize genetic algorithms during the first, exploration phase (see Chapter 6). Section 4.1.3 contains results of the Markov theory of evolutionary algorithms (µ + λ)-type with the elitist selection. Finally, Section 4.2 comprehends asymptotic results obtained for genetic algorithms with very small populations and Section 4.3 delivers some comments which lead to the precise formulation and verification of the schemata theorem for SGA.
1 Introduction
5
An important part of this book, written by Henryk Telega, is located in Chapter 6. It delivers a survey of two-phase stochastic global optimization methods. Such strategies consist of finding the approximation of extreme attractors in the first phase, called global phase, and pass to the detailed, local search in each attractor in the second phase, called local phase. Probabilistic asymptotic correctness as well as stopping rules of two-phase strategies are also discussed. A new global phase strategy, called Clustered Genetic Search (CGS), which utilizes the genetic sample to recognize the central parts of attractors, is introduced. The advantageous features of genetic algorithms that regularly transform sampling measures to ones which become denser close to the local extrema guarantee the proper definition of such a strategy. In particular, by using theorems formulated in Chapter 4, the probabilistic asymptotic correctness and the stopping rule of CGS are verified. We have omitted detailed, technical proofs of some cited theorems, remarks and formulas due to the necessary limitation of volume and the assumed engineering profile of this book. Readers are extensively referred to sources in each particular case. Several important computation examples are placed in Section 2.4 in order to demonstrate the skill of genetic algorithms in solving optimal design problems, formulated as continuous global optimization ones. The second group of tests exhibits characteristic features of the clustered genetic search (CGS) running for a small set of classical multimodal benchmarks (see Chapter 6). Readers only require a basic mathematical preparation and a maturity at a level typical for the MS courses in science, especially in the area of real valued function analysis, linear algebra as well as probability theory and stochastic processes. The book may be recommended in particular to readers who have a basic insight into genetic-based computational algorithms and are looking for an explanation of their quantitative features and for their advanced applications and further development. It may be helpful to engineers in solving difficult global optimization problems in technology, economics and the natural sciences. This is especially true in the cases of multimodality, weak regularity of the objective function and large volumes of the search domain. It may also inspire researchers employed in studying stochastic optimization and artificial intelligence.
6
1 Introduction
I would like to thank everyone who has played a part in the preparation of this book. Special thanks are directed to Professors Tadeusz Burczyński, Iwona Karcz-Dulęba and Katarzyna Adamska who have permitted me to enclose valuable results and computational examples, published in their research papers. I am grateful to Professors Krzysztof Malinowski, Roman Galar, Mariusz Flasiński, Zdzisław Denkowski, Marian Jabłoński, Jarosław Arabas and Kazimierz Grygiel for their critical view of ideas presented in the manuscript and their many useful and detailed suggestions. Finally, I would like to thank my wife Irena for her constant support and considerable help.
Robert Schaefer
2 Global optimization problems
This chapter introduces readers to the world of continuous global optimization problems. We start with detailed definitions of the search space, admissible domain and objective function. The most conventional problem, which consists of finding all admissible points in which the objective function has its global extreme, is the basis of further considerations. The next problems concern finding local extremes. We also consider the approximate problems that allow finite accuracy of data representation. Much space is devoted to the definition of the basins of attraction of local isolated extremes and the problem of their approximate recognition. Next we introduce the scheme of population-oriented stochastic global optimization search. Two important instances: random walk and Pure Random Search are defined. We have focused on a more formal definition of populations (random samples) and the mathematical operations on them. Moreover, definitions that classify search possibilities and some kind of convergence are formulated and commented on. The chapter contains several computation examples, which show the potential skill of various genetic global optimization strategies by solving difficult continuous engineering problems.
2.1 Definitions of global optimization problems Let us denote by V the space of solutions which is the complete metric space with a distance function d : V × V → R+ and a topology top(V ) induced by this metric (see e.g. [162]). Here R+ = {x ∈ R; x ≥ 0} and top(V ) is the family of open sets in V . The space V is a dense set i.e. each point x ∈ V is the concentration point in V . We also impose that V is the space of points of a finite dimensional affine structure. More precisely, the “space of directions” Vˆ exists which is the finite dimensional Hilbert space (dim(Vˆ ) = N < +∞) and two mappings so that: • the first of them, V × V (x, y) → y − x ∈ Vˆ assigns the “joining” vector to the ordered pair of points, R. Schaefer: Foundation of Global Genetic Optimization, Studies in Computational Intelligence (SCI) 74, 7–30 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
8
•
2 Global optimization problems
the second one, V × Vˆ (x, v) → x + v ∈ V translates the point by the vector.
We assume moreover, that the distance function induced by the Euclidean norm ·2 in Vˆ is topologically equivalent to the original one in V ∃ α1 , α2 > 0; ∀x, y ∈ V α1 y − x2 ≤ d(y, x) ≤ α2 y − x2 v2 = (v, v), v ∈ Vˆ
(2.1)
where (·, ·) stands for the scalar product in Vˆ . A detailed definition of the affine structure may be found in many books in linear algebra and geometry, particularly in Spivak [179]. Such an extended structure of the search space provides us with all the necessary tools for analyzing optimization problems and methods. In particular, we are able to consider the neighborhood of the extreme x∗ ∈ V as a ball with respect to the metric d, as well as study the convergence of the minimization sequence {xi } ⊂ V with respect to the topology top(V ). Moreover, we may precisely define the meaning of searching in the direction v ∈ Vˆ starting from x0 ∈ V which results in the new point x0 + v ∈ V , while the step length is v2 . In almost all the cases discussed later the space of solutions and the direction space will be supported by the same set RN (both V, Vˆ are RN ), so we will refer all elements of the affine structure to V for the sake of simplicity. Detailed comments will be delivered in other cases. The subset D ⊂ V will denote the set of admissible solutions of each optimization problem defined later. We will frequently assume that D is compact, which implies that D is bounded (diam(D) < +∞) and closed in the finite dimensional case. In several cases the Lipschitz boundary of the admissible set will be required, i.e. ∂D will be a finite composition of (N − 1)-dimensional, C 1 -regular hypersurface pieces, continuously glued, without infinitely narrow edges (blades) (see e.g. Zeidler [207] for details). Definition 2.1. The objective function is the well defined, bounded mapping Φ : D → R+ ; 0 ≤ Φ(x) ≤ M < +∞, ∀x ∈ D that evaluates admissible points from D.
(2.2)
The well defined function on D means the function which is countable for every x ∈ D (see e.g. Cromen, Leiserson, Rivest [51], Manna [108]). In some situations the classical, continuous differentiability up to the second order of the objective function will be necessary, which will be denoted as Φ ∈ C l (D), l = 0, 1, 2. Because D may be closed, then we assume Φ ∈ C l (A), l = 0, 1, 2 for some open set A ∈ top(V ) and D ⊂ A in such cases. We will consider four basic global optimization problems:
2.1 Definitions of global optimization problems
9
Problem Π1 : Find all points x∗ ∈ D so that Φ(x) ≤ Φ(x∗ ), ∀x ∈ D being global maximizers for the objective function Φ on D.
(2.3)
Problem Π2 : Find all points x+ ∈ D so that ∃A ∈ top(V ); x+ ∈ A, Φ(x) ≤ Φ(x+ ), ∀x ∈ A ∩ D being local maximizers for the objective function Φ on D.
(2.4)
Problem Π3 : Find all points x+ ∈ D so that ∃A ∈ top(V ); x+ ∈ A, Φ(x) < Φ(x+ ), ∀x ∈ A ∩ D being local isolated maximizers for the objective function Φ on D.
(2.5)
Taking into consideration the restricted accuracy of real computations, the following alternative, approximate global optimization problems may be defined together: a1 a1 ∗ Problem Πa1 1 (Π2 , Π3 ): For each point x being a solution to Π1 (for each + point x being a solution to Π2 , Π3 ) find at least one point x ∈ Aε (x∗ ) (x ∈ Aε (x+ )), where
Aε (y) = {x ∈ V ; x − y < ε} ∩ D
(2.6)
for an arbitrary ε > 0.
The main disadvantage of the above definition is its dependency on the norm · in the space V . We may alternatively define Aε (y) = {x ∈ D; Φ(x) ≥ Φ(y) − ε}
(2.7)
without using a norm in V . The problems obtained in this way, denoted a2 a2 as Πa2 1 , Π2 , Π3 , have quite a different structure than the previous ones a1 ({Πi }, i = 1, 2, 3). Following the idea presented by Rinnooy Kan, Timmer [137] and Betró [22] the another definition for the extrema neighboring sets may be suggested Aε (y) = {x ∈ D; φ(Φ(x)) ≤ ε} , φ(z) =
meas({ξ ∈ D; Φ(ξ) ≥ z}) meas(D)
(2.8)
where meas denotes the Lesbegue measure on V . In this way we obtained the a3 a3 new group of approximate problems denoted by Πa3 1 , Π2 , Π3 . The last two
10
2 Global optimization problems
propositions {Πaj i }, i = 1, 2, 3; j = 2, 3 may exhibit substantial inaccuracy, because for y = x∗ (or y = x+ ) the set Aε (y) defined by formula 2.7 (or 2.8) may be not a connected set. All the global optimization problems {Πi , Πaj i }, i = 1, 2, 3; j = 1, 2, 3 already defined lead to finding global or local maximizers or their sufficiently closed neighborhood. One can also define the analogous problems ˜ aj }, i = 1, 2, 3; j = 1, 2, 3 that consist of finding global or local min˜ i, Π {Π i imizers for the objective function Φ on the set D. The meaning of symbols x∗ , x+ , Aε , φ will depend on the current context. Remark 2.2. Let Φ be the objective function associated with one arbitrary global maximization problem from the class {Πi , Πaj i }, i = 1, 2, 3; j = 1, 2, 3 discussed above. We may establish the equivalent global minimization problem ˜ aj }, i = 1, 2, 3; j = 1, 2, 3 by setting a new objective ˜ i, Π from the class {Π i ˜ = M −Φ where M is the upper bound for Φ on the set D. Moreover function Φ the formulas 2.7, 2.8 have to be rewritten to the form ˜ ˜ ≤ Φ(y) +ε (2.9) Aε (y) = x ∈ D; Φ(x) ˜ ≤ε , Aε (y) = x ∈ D; φ(Φ(x)) φ(z) =
˜ meas({ξ ∈ D; Φ(ξ) ≤ z}) meas(D)
(2.10)
originally introduced by Rinnooy Kan, Timmer [137] and Betró [22]. The formula 2.6 is also valid for the minimization problems. ˜ the objective function of the minimization problem that Similarly, for Φ, satisfies Definition 2.1, the analogous maximization problem may be obtained ˜ where M stands for the upper bound for Φ ˜ in this case. by setting Φ = M − Φ, It should be underlined that the constant M must be well known in both cases. Of course, the way of obtaining equivalent minimization and maximization problems presented above is not a unique one, but we will restrict ourselves to the problems which satisfy the assumptions of Remark 2.2 for the sake of simplicity. Remark 2.3. All considerations contained in Section 2.1 may be extended to the class of well defined functions Ψ : D → R bilaterally bounded m ≤ Ψ (x) ≤ M, ∀x ∈ D for some finite m, M ∈ R. It is enough to set Φ = Ψ +max{0, −m} in order to satisfy the conditions of definition 2.1. Similarly, as in Remark 2.2 we have to know the values of both constants m, M the lower and upper bounds of Ψ on the set D. In the remaining part of this section we will study minimization problems for which the equivalent maximization ones may be established. All problems
2.1 Definitions of global optimization problems
11
˜ i, Π ˜ aj }, i = 1, 2, 3; j = 1, 2, 3 which satisfy the assumptions from the class {Π i of Remarks 2.2, 2.3 are good candidates. The function to be minimized that ˜ : D → R+ . constitutes the objective of these problems will be denoted by Φ Further considerations need to assume that the proper local optimization method loc exists. It can be started from an arbitrary point x0 ∈ D and then generates the sequence of points in D which always converges to some loc(x0 ) ∈ D, which is the local minimizer attainable from the starting point x0 . The local method may be interpreted as the mapping loc : D x0 → loc(x0 ) ∈ D. Next we will distinguish the important group of local methods (see Rinnooy, Kan, Timmer [137], Dixon Gomulka Szegö [61]). Definition 2.4. The local method loc will be called strictly descent on D if for each starting point x0 ∈ D and an arbitrary norm · in V it generates the sequence {xi }, i = 0, 1, 2 . . . ⊂ D so that xi+1 = xi + αi pi ∀i = 0, 1, 2 . . . pi = 1, αi ≥ 0
(2.11)
Moreover, the sequence converges to x+ = loc(x0 ), the local minimizer of the ˜ and satisfies objective function Φ, ˜ i + βpi ) ≤ Φ(x ˜ i + αpi ) ∀i = 0, 1, 2 . . . ∀α, β; 0 ≤ α < β ≤ αi ⇒ Φ(x
(2.12)
Definition 2.5. Let loc be the strictly descent method on D and x+ the iso˜ in D. The set lated local minimizer of Φ (2.13) x ∈ D; loc(x) = x+ Rxloc + = will be called the set of attraction of x+ with respect to the local method loc. We will frequently simplify its notation to Rx+ . The above definition follows the idea given by Zieliński [208]. Next we introduce four important quantities: •
˜ ˜ L(y) = {x ∈ D; Φ(x) ≤ y} the level set of the objective function Φ.
• Lx (y) the connected part of L(y) that contains x.
⎧ + ˜ minimizer of Φ, ⎪ ⎨ inf y : ∃z+ local if z + exists z = x+ , z + ∈ Lx+ (y) • y¯x+ = ⎪ ⎩ ˜ maxx∈D {Φ(x)} in the other case
•
the cutting level associated with an arbitrary, isolated local minimizer x+ . This description will be further simplified to y¯ if the context is clear. ˜ ˜ x+ (¯ y ) = x ∈ Lx+ (¯ y ); Φ(x) < y¯ L
12
2 Global optimization problems
Definition 2.6. The basin of attraction Bx+ ⊂ D of a local isolated minimizer ˜ : D → R+ is the connected part of L ˜ x+ (¯ y ) that contains x+ of the function Φ x+ . The basin of attraction Bx+ ⊂ D of a local isolated maximizer x+ of the function Φ : D → [0, M ] ⊂ R+ ; M < +∞ is the basin of attraction of the local ˜ = M − Φ. isolated minimizer x+ of the function Φ
˜ x+ (¯ Fig. 2.1. Level sets Lx+ (¯ y ), L y ) and the basin of attraction Bx+ of the local ˜ : R → R+ . isolated minimizer x+ for the function Φ
Remark 2.7. (see Rinnooy Kan, Timmer [137], Dixon, Gomulka, Szegö [61]) Every strictly descent method loc has the following features: ˜ 1. Let x ∈ D be an arbitrary admissible point and y ≤ Φ(x) then loc(x0 ) ∈ Lx (y) for all starting points x0 ∈ Lx (y). 2. loc(x0 ) = x+ for all starting points x0 ∈ Bx+ . 3. Bx+ ⊂ Rxloc +.
˜ x+ (¯ Figure 2.1 shows the differences between L(¯ y ), L y ) and Bx+ in the one dimensional case D ⊂ R. We are ready now to define one more global optimization problem that consists of finding the approximation to the central parts of the basins of attraction. Problem Π4 : For each isolated local maximizer x+ ∈ D so that meas(Bx+ ) > 0, being a solution of problem Π3 find the closed set C ⊂ Bx+ so that meas(C) > 0, and x+ ∈ C.
2.2 General schema of a stochastic search
13
˜ 4 associated with Π4 that Similarly, as in Remark 2.2, the problem Π consists of finding the approximation to the basins of attraction of local minimizers may be defined. The importance of the problems defined above may be motivated twofold: 1. Stochastic, population based global optimization methods are well suited for solving such kind of problems. This hypothesis may be roughly motivated by the analysis of sampling measure dynamics. Sampling measures defined on the admissible set D tend toward the ones that become more dense close to the local minimizers. Such behavior is observed for some genetic algorithms and for the Monte Carlo algorithms equipped with some additional mechanisms. The detailed motivation will be delivered in Chapters 4 and 6. ˜ 4 allow for effective solution of problems 2. The solution of problems Π4 , Π aj ˜ ˜ {Πi , Πi }, i = 1, 2, 3; j = 1, 2, 3 if the objective function is sufficiently regular (usually the C 2 regularity is required) in the neighborhood of the local minimizers. After the central parts of the basins of attraction have been recognized, the single, accurate local optimization method may be run in each basin. Local methods may be process in parallel. This approach will be discussed in Chapter 6.
2.2 General schema of a stochastic search ˜ 1 additionally satisfy the conditions that If optimization problems Π1 or Π guarantee the existence and uniques of solution, as well as conditions that imply the convergence, and verify the stopping rule of local method (usually based on the evaluation of the norms of gradient and the Hesse matrix of the objective function Φ, see e.g. [66]), there is no need to use the stochastic methods, such as genetic algorithms, that are characterized by a far worse ratio between accuracy and computational cost than the local ones. Real life optimization problems are usually much less regular. In particular they may exhibit the following irregularities: 1. The objective function Φ in not globally convex. More than one point in which the objective function has its local or global extremes exists in the admissible domain D. 2. The objective function Φ is not globally continuously differentiable up to the second order (Φ ∈ / C 2 (D)). Sometimes, it preserves the regularity i C (A), i = 0, 1, 2 for A as the neighborhood of some local extremes. 3. The gradient vector and the Hesse matrix are difficult or impossible to compute explicitly. Complex approximation procedures have to be applied.
14
2 Global optimization problems
4. We do not have enough information about the “geography” of extreme locations in the admissible domain D in order to run more local methods, approximately one for each extreme (the multistart approach).
Fig. 2.2. General schema of the population-oriented, stochastic global optimization search.
The situations described above usually make stochastic, population oriented strategies much more efficient than the local, convex methods. The “skeleton” of such a strategy is shown in Figure 2.2. We will denote by {Pt }, t = 0, 1, 2, . . . the consecutive random samples called populations. Populations are the multisets of “clones” from the admissible domain D, which roughly means that a single point x ∈ D may have more than one “clone” in the particular population. There are several ways to formalize the definition of multisets. Definition 2.8. (see e.g. [168]) Let Z be the set of patterns and η : Z → Z+ the occurrence function. The multiset A of elements (clones) from Z is the pair A = (Z, η) and may be understood as the set to which belong η(x) clones of the pattern x ∈ Z. The cardinality of the multiset is given by #A = x∈Z η(x). The definition above allows us to introduce simple two-argument operations on multisets analogous to union, intersection, subtraction and the Cartesian product of sets. Definition 2.9. Let us consider two multisets A1 = (Z, η1 ) and A2 = (Z, η2 ).
2.2 General schema of a stochastic search
15
•
The union of multisets will be defined as A1 ∪ A2 = (Z, η) so that η(x) = η1 (x) + η2 (x) ∀x ∈ Z.
•
The intersection of multisets is defined as A1 ∩ A2 = (Z, η) so that η(x) = min{η1 (x), η2 (x)} ∀x ∈ Z.
•
If η1 (x) ≥ η2 (x) ∀x ∈ Z then the subtraction of A2 from A1 may be defined as A1 \ A2 = (Z, η) so that η(x) = η1 (x) − η2 (x) ∀x ∈ Z. The opposite subtraction A2 \ A1 does not make sense in this case.
•
The cartesian product of multisets will be defined as A1 × A2 = (Z 2 , η) so that η(x, y) = η1 (x)η2 (y) ∀(x, y) ∈ Z 2 . Moreover, the inclusion relation may be extended to the multiset case.
Definition 2.10. Let us consider two multisets A1 = (Z, η1 ) and A2 = (Z, η2 ). • The inclusion of multisets A1 ⊆ A2 holds if ∀x ∈ Z η1 (x) ≤ η2 (x). • The strict inclusion of multisets A1 ⊂ A2 holds if ∀x ∈ Z η1 (x) ≤ η2 (x) and ∃y ∈ Z so that η1 (x) < η2 (x). Remark 2.11. If the cardinality of the multiset A = (Z, η) is finite (#A < +∞) then the representation given by definition 2.8 is equivalent to the following A = (Z, η˜) where η˜ : Z → [0, 1] turns back the occurrence frequency of the particular pattern, i.e. η˜(x) = η(x)/#A, ∀x ∈ Z. Obviously x∈Z η˜(x) = 1. Please note that the assumption #Z < +∞ is unnecessary in this case. Remark 2.12. The multiset representation A = (Z, η˜) allows us to study multisets with infinite cardinality (#A = +∞) by setting η˜(x) =
lim
#A → +∞
η(x) #A
It is unlikely such a representation is ambiguous because the arbitrary finite number of clones of x ∈ Z in A gives η˜(x) = 0 frequency in the limit case #A = +∞. Although the multiset representation given by definition 2.8 is well suited for multiset calculations, this approach has some disadvantages. In particular, it makes it impossible to distinguish two clones of the same pattern that belong to A. Sometimes we need to handle multisets like the regular sets in which two clones of the single pattern constitute separate elements. This concept presented below is based on the permutational power of the set used in topology introduced in the early papers of Smith [173] and Richardson [136]. More rigorous definitions and some interesting features of permutational power were delivered by Kwietniak [101]. Assuming the final cardinality #A = µ < +∞ the alternative way of multiset definition (see e.g. [93, 149]) is based on the equivalence eqp ⊂ (Z µ )2 so that:
16
2 Global optimization problems
∀(z1 , . . . , zµ ), (y1 , . . . , yµ ) ∈ Z µ (z1 , . . . , zµ ) eqp (y1 , . . . , yµ ) ⇔ ∃σ ∈ Sµ ; zi = yσ(i) , i = 1, . . . , µ
(2.14)
where Sµ denotes the group of permutations of the µ-element set. The simple motivation that eqp is really the equivalence relation is left as the reader’s exercise. Definition 2.13. The multiset A of elements (clones) from Z of final cardi nality µ < +∞ is a member of the quotient set A ∈ Z µ /eqp. The above definition 2.13 informs us that the ordering of elements in A is insignificant, and we should take the abstraction class of the µ-dimensional vector from Z µ with respect to eqp as the multiset model. This second multiset definition is rather difficult for multiset operations like those mentioned in definitions 2.9, 2.10, but will be helpful for introducing the useful multiset notation. Remark 2.14. Both definitions 2.8, 2.13 are equivalent for finite multisets.
Proof. We have to show that it is possible to introduce the one-to-one mapping that assigns each multiset A represented as the pair (Z, η) to the equivalence class from Z µ /eqp. The occurrence function satisfies χ = #supp(η) ≤ µ < +∞, where supp(η) = {y ∈ Z; η(y) > 0}. Then support η may be represented as {z1 , . . . , zχ }. Let us introduce ⎧ z1 , i = 1, . . . , η(z1 ) ⎪ ⎪ ⎨ z2 , i = η(z1 ) + 1, . . . , η(z2 ) w = (w1 , . . . , wµ ) ∈ Z µ ; wi = (2.15) ... ... ⎪ ⎪ ⎩ zχ , i = η(zχ−1 ) + 1, . . . , η(zχ ) Then we may assign [w]eqp to (Z, η), so the mapping (Z, η) → [w]eqp is well defined. We may easily check that this mapping is onto (is surjective). Let us take the arbitrary ω ∈ Z µ /eqp and w ∈ ω for which we may uniquely define the function η : Z x → #{i; wi = x} ∈ Z+ . Such a definition is invariant with respect to the coordinate permutations of w, so η is also uniquely assigned to the whole ω. Now let us assume that there is ω ∈ Z µ /eqp assigned to two pairs (Z, η1 ), (Z, η2 ). Immediately from the construction of ω (see 2.15) supp(η1 ) = supp(η2 ) = ZA and for each x ∈ ZA , η1 (x) = η2 (x), so η1 = η2 which proves the injectivity of the mapping. Remark 2.15. The membership relation has to be commented upon for both multiset representations given by the definitions 2.8, 2.13. If x ∈ Z will be the member of the multiset A = (Z, η), then of course x ∈ supp(η), which constitutes the necessary condition for multiset membership relation which can be drawn from the definition 2.8. For multisets of finite cardinality #A = µ < +∞ the definition 2.13 allows the representation [w]eqp of the multiset A, where w is given by formula 2.15.
2.2 General schema of a stochastic search
17
The relation x ∈ A means that x is one of the values of the finite sequence w which is the instance of A ∈ Z µ /eqp. We may introduce the intuitive notation for the multiset A that will support many further considerations:
z1 , z1 , . . . , z1 , z2 , z2 , . . . , z2 , . . . , zχ , zχ , . . . , zχ η(z1 ) times
η(z2 ) times
(2.16)
η(zχ ) times
where supp(η) = {z1 , . . . , zχ }, χ ≤ µ < +∞. Angle brackets < > play a similar role in the multiset notation as curl brackets { } in the set notation. They enclose the list of elements while the order of elements is insignificant. This notation allows us to distinguish two members of A that represent the same pattern, however its mathematical correctness is far from ideal. After the necessary, broad digression about multiset formalisms we turn back to stochastic strategies of global optimization. If #Pt = 1, t = 0, 1, 2, . . . (each sample is the singleton set) then we get perhaps the simplest strategy called random walk. This algorithm produces the maximization (minimization) sequence x0 , x1 , x2 , . . . such that xt ∈ D, {xt } = Pt , t = 0, 1, 2, . . .
(2.17)
the features of which completely differ from features that characterize sequences obtained by local convex optimization methods. In particular the sequence is not necessary monotone. Examples of random walk algorithms will be discussed in Section 4.2.1. Let us comment now more extensively on the steps of stochastic strategy depicted in Figure 2.2. Creation of the initial sample P0 consists of sampling, usually using multiple sampling procedure, some points from the set D. We will denote by M(D) the space of probabilistic measures on the admissible domain D. The probability distribution h0 ∈ M(D) is usually uniform on D or close to the uniform one. h0 (A) =
meas(A) , where A ⊂ D is the measurable set meas(D)
(2.18)
Such sampling reflects the typical situation appearing at the start of global optimization searches when no sampling regions of D are preferred because no information of extremes is available a’priori. When the information about the extreme locations is accessible at the start of computation, h0 may be concentrated in some regions of D. In computational practice the initial sample P0 may be obtained by using computer generators which deliver pseudo-random numbers with a uniform distribution on [0, 1]. Another possibility is to use the Halton, Sobol or Foure algorithms that generate the sequences of low discrepancy. More information concerning initial sample generation may be found in the papers of Arabas and Słomka [8] and Arabas [5].
18
2 Global optimization problems
Evaluation of the random sample Pt lies in computing effectively the values of the objective function for all sample members i.e. {Φ(x)}, x ∈ Pt . If the sample is a multiset represented by Pt = (D, ηt ) (see definition 2.8) then it is enough to compute {Φ(x)}, x ∈ supp(ηt ). Modification of the random sample Pt appears only in some stochastic strategies of global optimization e.g. Clustered Genetic Search (see Chapter 6) and Lamarcean evolution (see Section 5.3.4). The modification may consist of sample reduction that removes low evaluated members (i.e. members x ∈ Pt with the low objective value Φ(x)) or removes members that lie too far from other members of the same sample Pt (with respect to the distance function in V ). In extreme cases of sample reduction only the single, best evaluated member may remain. The following mapping will be useful for the best element selection. b(Pt ) = x; Φ(x) = max{Φ(ξ), ξ ∈ supp(ηt )}, Pt = (D, ηt )
(2.19)
Another possibility of sample modification is replacing each element x ∈ Pt by the result of some simple (low complex) local method loc(x). The transformed sample may be used for the rough location of the basins of attractions in the first phase in a two-phase strategy (see Chapter 6). The modification of the sample Pt may consist also of retaining some points from previous samples {Pi }, i = 0, 1, . . . , k − 1 that are located in the central parts of the basins of attractions (in sense of the problem Π4 definition). The modified sample may replace or complement the current one in the next computation step of the stochastic strategy (see e.g. Section 5.3.4) or may be utilized only by the stopping rule and post processing of final results. The stopping rule is one of the most difficult parts of a stochastic global optimization strategy to define and formally verify. Roughly speaking, we have to decide whether the main goal of the strategy is reached at a satisfactory degree. The simplest possibility consists of tracing the mean objective value ¯t = Φ
1 1 Φ(x) = #Pt #Pt x∈Pt
Φ(y)ηt (y)
(2.20)
y∈supp(ηt )
¯t exceeds for consecutive samples P0 , P1 , P2 , . . .. The strategy is stopped if Φ the assumed threshold. Another possibility is to use the Bayesian stopping rules that estimate the expected number of local extremes which remain to be detected (see e.g. Zieliński [208], Zieniński, Neuman [209], Betró [22], Hulin [87]). Making the next epoch sample Pt+1 may be obtained by some random operations performed on the previous sample Pt members (e.g. genetic operations described in Sections 3.5, 3.7) or sampled from Pt according to the probability distribution known explicitly. Some information coming from earlier epochs
2.3 Basic features of stochastic algorithms of global optimization
19
Pt−1 , Pt−2 , . . . may also be involved, but the range of such retrospective is strictly constrained in the case of particular strategies. No matter how the new sample is obtained, we will denote by ht+1 ∈ M(D) the probability distribution that characterized the appearance of clones of points from D in the multiset Pt+1 . As we mentioned just before, ht+1 is computed explicitly for some algorithms (e.g. simple Monte Carlo strategies) while in other cases the sampling rule can only be established (e.g. genetic algorithms). Sometimes we are also able to compute ht+1 explicitly for genetic algorithms (e.g. Simple Genetic Algorithm), however, its complexity greatly exceeds the complexity of heuristic, stochastically equivalent sampling rules. The features of probability distributions {ht }, t = 0, 1, 2, . . . determine the behavior of the whole strategy and, in particular, its efficiency. In the case of adaptive stochastic strategies the next epoch probability distribution ht+1 may depend on the control parameter vector u(t) that comprehends some feedback information (e.g. standard deviations of previous samples) or external controls set by user. We will finish this section by the presentation of the Pure Random Search (PRS) strategy which is perhaps one of simplest stochastic global searches ˜ 1 . It will be used further as that may be used for solving problems Π1 or Π the reference strategy due to its simplicity and good asymptotic features. This strategy works according to the scheme shown in Figure 2.2, while: •
Initial sample P0 is obtained by the µ-times (µ ∈ N) multiple sampling according to the uniform probability distribution over D. Moreover, the variable x ˆ is initialized by an arbitrary point from D.
•
Sample evaluation consists of computing {Φ(x)}, x ∈ supp(ηt ) where ηt is the occurrence function of the sample Pt .
•
Modification of the random sample consists of finding x ˆt = b(Pt ), which is the best evaluated sample member at the epoch t. The variable x ˆ is ˜ x) ≥ x) ≤ Φ(ˆ xt ) in the case of Π1 solution or if Φ(ˆ updated x ˆ=x ˆt if Φ(ˆ ˜ xt ) by solving Π ˜ 1. Φ(ˆ
•
The stopping rule takes into account only the current values of x ˆ and Φ(ˆ x).
•
Making the next epoch sample Pt+1 is independent of the previous samples Pt−1 , Pt−2 , . . .. It consists of µ-times multiple sampling according to the uniform probability distribution over D, like at the initial step.
2.3 Basic features of stochastic algorithms of global optimization We start with two definitions that try to formalize the ideas of asymptotic correctness and asymptotic guarantee of success mentioned in the literature
20
2 Global optimization problems
(see e.g. [83], [137]). Both features are related to solving global optimization ˜ i , Πaj , Π ˜ aj }, i = 1, 2, 3; j = 1, 2, 3 by using stochastic strateproblems {Πi , Π i i gies. Definition 2.16. We can say that the global optimization stochastic strategy is asymptotically correct in the probabilistic sense if it finds the global maximizer (minimizer) with the probability 1 after the infinite number of epochs. Definition 2.17. We can say that the global optimization stochastic strategy has the asymptotic guarantee of success if it finds all local maximizers (minimizers) with the probability 1 after the infinite number of epochs. Infinite working time which is mentioned in the above definitions 2.16, 2.17 may result in the arbitrarily large number of sample points that have to be created and evaluated. In fact, both definitions characterize not only the stochastic global optimization strategy but also the particular optimization problem to be solved. More correctly, the stochastic global optimization strategy satisfies the asymptotic correctness condition or has the asymptotic guarantee of success if the proper conditions are satisfied for the particular admissible set D and the objective function Φ. Both conditions defined above do not deliver the useful criterion for stopping the stochastic strategy. The global/local minimizer may appear in the population Pt , but we have no information about its appearance. The asymptotic guarantee of success and asymptotic correctness are the only necessary conditions that have to be satisfied. In other words, the strategy that does not satisfy the asymptotic correctness never samples the global minimizers/maximizers (they lie in its “tabu” region with the constant zero sampling ˜ 1 . The probability), so it is completely useless for solving problems Π1 , Π same comment may be assigned to the asymptotic guarantee of success in the ˜ i }, i = 2, 3 problems. context of solving {Πi , Π The next feature of the stochastic global optimization strategies that will be considered is the global convergence (see Rudolph [143], Beyer, Rudolph [27]). Let x∗ ∈ D be the solution to the problem Π1 (x∗ is one of the global maximizers of Φ), and Φ(x∗ ) is the maximum objective value on D. Let us consider the random sequence {Yt = Φ(b(Pt ))}, t = 0, 1, 2, . . .
(2.21)
where b is the mapping defined by the formula 2.19 which selects the best evaluated member of the sample. Definition 2.18. We can say that the stochastic global optimization strategy is globally convergent if the random sequence {Yt }, t = 0, 1, 2, . . . defined by the formula 2.21 converges completely to Φ(x∗ ), which means that
2.4 Genetic algorithms in action
⎛ ∀ε > 0 ⎝ lim
t→+∞
t
21
⎞ Pr {(Φ(x∗ ) − Yj ) > ε}⎠ < +∞
j=0
The condition defined above will be discussed extensively in Section 4.1.3. Features of stochastic global optimization strategies that will be helpful in analysis of problem Π4 are currently difficult to specify. Some technical quantities and notions like sampling measures and strategy heuristics have to be prepared earlier. They will be discussed in Section 4.1.2.
2.4 Genetic algorithms in action – solution of inverse problems in the mechanics of continua This section presents three complex computational examples that show the skill of genetic strategies in solving continuous global optimization problems. Problems under consideration are related to the optimal design and defectoscopy of mechanical structures. The example selection was motivated in particular by their clear definitions and the possibility of impressive, graphical presentation of results. All the presented examples show solutions of inverse problems (topological optimization, optimization of physical parameters of structures etc.) where each objective evaluation needs the solution of the direct problem which is usually given as the boundary-value problem for partial differential equations or its discrete, algebraic versions obtained by a particular numerical method (e.g. finite element method, boundary element method, see [210], [38] for details). Various genetic optimization techniques (classical evolutionary algorithm, Lamarcean evolution, Clustered Genetic Search, etc.) suitable for continuous problem solution will be applied. Example 2.19. Topological optimization of a crane-arm truss. This example is taken from the research paper of Burczyński, Beluch, Długosz, Kuś and Orantek [39]. The optimization problem under consideration may be ˜ 1 . It consists of finding the truss geometry e.g. the number of classified as Π truss members, member cross sections and the connection topology, so that the resulting structure will be the lightest one. We assume the structure to be linearly elastic, so it has to satisfy the linear state equations Ku = F
(2.22)
where K is the stiffness matrix of the truss, u and F stand for the displacement vector and the external force vector at the truss joints respectively (we refer to Zienkiewicz, Taylor [210] for details). Moreover, the following constraints have to be satisfied:
22
2 Global optimization problems
Fig. 2.3. Initial shape of the truss.
•
stresses in truss member and displacement at each truss joint have to be less than admissible ones,
•
normal forces have to be less than bulking ones.
Decision variables are composed of three groups that characterize: 1. information about existing truss members, 2. information about areas and cross sections of truss members, 3. coordinate of free truss joints. All three groups of decision variables encoded as a real-valued vector constitute the genotype of the individual (see Section 3.1). A constant size population (random sample) of 5000 individuals was processed. Rank selection (see Section 3.4.4), classical phenotypic mutation (see Section 3.7.1) and arithmetic crossover (see Section 3.7.2) were utilized in order to make new populations in each epoch. After about 1500 epochs the population was stabilized (no significant improvement of the best fitted individual was observed), so the further processing of up to 2000 epochs did not bring the essential improvement. The total weight of the crane-arm truss decreased from the initial 5497,44 kg to the final one 3406,07 kg. The transformation of the truss topology is shown in Figures 2.3, 2.4. Example 2.20. Identification of internal defects in lumped elastic structures. The genetic computations presented below are applied to the detection of undesirable voids (cracks and lumped voids) inside the massive (lumped), elastic structures. The problem is formulated as a global optimization one, where the control variables determine the number, shapes and locations of voids. Decision variables may be transformed unambiguously to the physical parameter distribution inside the structure body. The structure was subjected to cyclic loading, then external strains and eigenfrequencies were measured.
2.4 Genetic algorithms in action
23
Fig. 2.4. Final shape of the truss after topological optimization.
Genetic optimization allows us to find such decision variables that get the closest simulated response of the structure to the real, measured one. The example description follows the research paper written by Burczyński, Kuś, Nowakowski and Orantek [40]. Let us assume that, for all the structures analyzed further, the elastodynamic partial differential equations were satisfied: divσ − ρ¨ u = 0, in S × [0, T ] σ = Cε, ε = 12 (∇u + ∇uT )
(2.23)
where u is the displacement field and u ¨ its second partial derivative with respect to the time variable, σ and ε are second order tensor fields of stress and strain respectively, C stands for the 4th order tensor field of the elastic constitutive law, finally ρ is the material density field (see e.g. Derski [58]). In the above formulas S denotes the structure domain and [0, T ] the time interval in which we are looking for the solution. The symbols div and ∇ denote the divergence computed from the tensor field and the gradient operator computed separately for each coordinate of the vector-valued function respectively. The proper boundary conditions were satisfied on the external boundary of the structure Γ and on the internal boundary of defects. The adequate initial condition is also satisfied inside the structure S at t = 0. If the body S undergoes free vibration the governing equation 2.23 takes the form: (2.24) divσ − ω 2 ρu = 0, in S × [0, T ] where ω denotes the circular eigenfrequency of the structure. ˜ will penalize discrepancies between the observed The objective function Φ displacement u ˆ and consecutive circular eigenfrequencies ω ˆ i and simulated values of such quantities u, ωi . The objective is given as the linear combination:
24
2 Global optimization problems
Fig. 2.5. Loading of the plate at the defect identification problem.
˜ = w1 Φ ˜1 + w2 Φ˜2 Φ ˜1 = 0.5 (ˆ ωi − ωi )2 Φ T ˜2 = 0.5 (ˆ u − u)2 dσdt Φ 0
(2.25)
Γ
where w1 , w2 are the proper numeric weights. ˜ 1 was solved. It consists of The global optimization problem of type Π ˜ finding the global minimizers of Φ with respect to u, ωi . However u, ωi are not the target decision variables of the identification problem. The target ones are the number of internal defects and their shape parameters. The dependency between the target decision variables and variables u, ωi as well as the ˜ D2 Φ˜ is established objective function Φ˜ and its first and second variables DΦ, by solving the system 2.23, 2.24 by the boundary element method (see Burczyński [38]). As usual in the case of identification problems the objective value is zero in each global minimizer. Genotypes are real-valued vectors that encode defect shape parameters. Parameter vectors take their values from some brick D ⊂ RN which is the admissible set of solutions. The evolutionary algorithm utilized for global optimization problems applies the tournament and rank selections (see Sections 3.4.3 and 3.4.4), phenotypic mutation and arithmetic crossover (see Sections 3.7.1 and 3.7.2) and the special genetic operation called “gradient mutation”. Detailed description of the last operation may be found in Section 5.3.4. The first computations were performed for the plate (see Figure 2.5). The defectoscopy problem consists of identification of 3 defects (0, 1, 2 or 3 defects) of the elliptical shape. Each defect is parametrized by five real numbers: two center coordinates, two half-axis lengths and a measure of the angle between the axis 0x1 and the large half-axis. Only the first three eigenfrequencies
2.4 Genetic algorithms in action
25
Fig. 2.6. Fitness of the best individual in consecutive epochs by defect identification in plate.
ω1 , ω2 , ω3 were taken into account. The plate was loaded by vibratory, periodical boundary loading p(t) = p0 sin(ωt) with the frequency ω = 15708 rad/sec and the amplitude p0 = 40 kN/m. The vibration impulse was activated in the time period t ∈ [0, 600 µs]. The plate material was homogeneous, isotropic and linearly elastic with a Young modulus 0.2×1012 Pa, Poisson ratio 0.3 and den3 sity ρ = 7800 kg/m . The left and lower sides of the plate were instrumented with 64 sensors that measured u ˆ and ω ˆ i , i = 1, 2, 3.
Fig. 2.7. The defects identification by the best individual phenotype in consecutive epochs: 1 - 1th , 2 - 10th , 3 - 50th , 4 - 100th .
26
2 Global optimization problems
The genotype of each individual was composed of 15 coordinates (5 coordinates for each of three defects). The population contains 3000 individuals at each epoch. Figure 2.6 shows the objective value of the best fitted individual in the consecutive genetic epochs. The next Figure 2.7 delivers the snapshots from the identification process at its four selected phases.
Fig. 2.8. Loading and the sensor location on the defected rod.
Fig. 2.9. Location of defects identified in road versus real life ones after 200 epochs.
The next computations were performed for a cubic rod with an edge length 0.2 m (see Figure 2.8). The rod is stiffly fastened at its back wall, and periodically loaded at its front wall. The loading has the norm uniformly distributed
2.4 Genetic algorithms in action
27
in the whole wall area and the direction different in each quarter. The norm of the loading depends on time according to the formula p(t) = p0 sin(ωt) with the frequency ω = 31 rad/sec and the amplitude p0 = 15000 kN/m. The loading causes the periodical twisting deformation of the rod. The mass density 3 of the rod is 100 kg/m , the shear modulus 106 Pa and the Poisson ratio 0.25. The rod is instrumented by 64 sensors uniformly located in the four back walls. The structure contains two spherical defects, each described by four real valued parameters (three center coordinates and radius length). As a consequence, the genotype is composed of 8 coordinates (four coordinates for a single defect). The shape and the location of the identified voids after 2000 epochs were shown in Figure 2.9. The results mentioned were obtained for exact values of u ˆ and ω ˆ i , i = 1, 2, 3. Additional tests that assume randomly perturbed values of measured displacements and eigenfrequencies did not bring a significant decrement in the void identification accuracy. This strategy using “gradient mutation” is considerably less computationally complex than the simple evolutionary one and makes the results more accurate. Example 2.21. Optimal pretraction design in the simple cable structure. ˜ 4 is considered. It consists of The global optimization problem of type Π ˜ which finding the central parts of the basins of attraction of the objective Φ mainly expresses the energy of internal strains of the simple cable structure (hanging roof). The recognition of the basins of attraction was the first phase of the two-phase stochastic global optimization strategy (see Chapter 6). In the second phase fast local optimization methods were started separately in each basin finding accurate approximations of local minimizers. Additionally, parameters such as the basin diameter and “depth” (the objective variation inside the basin) may be helpful in the sensitivity analysis of the obtained minimizers. The presented results were taken from research papers written by Telega [184] and Telega, Schaefer [186]. The structure under consideration is composed of unconnected cables stretched in two perpendicular directions lying in a single horizontal plane. Cables are fastened at their ends to the stiff square frame of the area S = [0, 1]2 ⊂ R2 in such a way that the frame sides are parallel or perpendicular to each cable. The resulting network structure is loaded by forces perpendicular to the frame plane. The loading is characterized by its surface intensity q. The linearized and homogenized state equation for such a structure is the following (see Cabib, Davini, Chong-Quing Ru [42]):
−div(σDu) = q in S (2.26) u=0 on ∂S where ∂S stands for the frame contour and u for the cable’s perpendicular displacement field. The strain tensor is given by the formula:
28
2 Global optimization problems
Fig. 2.10. The results of Clustered Genetic Search for the strategy tuned for two noticeable local minimizers.
σ(x) = diag(σ1 (x2 ), σ2 (x1 ))
(2.27)
where σi (xj ) denotes the pretraction displacement of cables parallel to the axis 0xi which depends only on the variable xj , for i, j = 1, 2. The objective function is given by: 2 ˜= Φ σ |uσ | dx + P (σ) (2.28) S
where uσ denotes the cable displacement obtained by pretractions σ1 , σ2 . The function P denotes the middle penalty function that penalizes the deviations from the typical cable parameters and turns back the zero value for parameters offered by manufacturers. It was one of the reasons that the objective function is multimodal. Pretractions σ1 , σ2 and loading intensity q satisfy the constraints:
2.4 Genetic algorithms in action
29
Fig. 2.11. The results of Clustered Genetic Search for the strategy tuned for less noticeable local minimizers.
0 < λ < σ1 , σ2 < Λ, λ, Λ ∈ R (σ + σ2 ) dx ∈ [2λ, 2Λ] S 1 q dx = 0 S
(2.29)
Using some additional assumptions (see [42]) the minimization problem ˜ and pretractions σ1 , σ2 satisfying them 2.29 have more than one global for Φ minimizer. The 10 × 10 cable network was assumed for computations. The loading q is displaced centrally in the 6×6 segment of cables. The pretraction’s constraints were assumed as λ = 102 kN, Λ = 103 kN. Computational results were only presented for the two cables closest to the frame ∂S because of the double symmetry of the problem. The admissible set was D = [λ, Λ]2 in this case. Figure 2.10 shows the graph of the function Φ˜ obtained by the multiple solving of the direct problem 2.26. The objective function has two noticeable “deep” minima with large basins of attraction and more “shallow” local minimizers (there is a small objective variation inside the basin of attraction).
30
2 Global optimization problems
The basins of attractions were recognized by the raster based strategy described in Chapter 6. A raster composed of 400 cells and a small population of 40 individuals were utilized. The Simple Genetic Algorithm (SGA) with a small mutation rate pm = 0.05 was applied in the global phase. The genetic algorithm well tuned (see definition 4.63) to the noticeable “deep” minima allowed us to recognize two basins of attraction by the analysis of individual density in raster cells. Simultaneously, the strategy filters out the basins of attraction of “shallow”, less noticeable minimizers (see Figure 2.10). Less rigorous density analysis also allowed us to recognize some “shallow” basins of attractions with much greater values of local minima (see Figure 2.11). The possibility of local extreme filtering is one of the unique features that distinguishes the Clustered Genetic Search (CGS) among other two-phase stochastic global optimization strategies.
3 Basic models of genetic computations
In this chapter we would like to present the basic objectives and operations that are used to build the genetic optimization model. We start with the fundamental encoding operation that transforms the original global optimization problem to the domain of genetic codes, called the genetic universum. We will enrich the relation between the admissible domain and the genetic universum with the decoding operation, which will be especially helpful in the case of searching in continuous domains. Next, the basic group of operations (selections, mutations and crossovers) will be discussed as methods of random transformation of the population to the next epoch. Special emphasis is put on features of binary genetic operations (operations dealing with binary genetic code) that will be intensively used in the next chapter. The detailed model of the Simple Genetic Algorithm (SGA) is also defined. Schwefel’s taxonomy of evolutionary single- and multi-population (multi-deme) computation models is presented at the end of this chapter.
3.1 Encoding and inverse encoding One of the basic mechanisms used in genetic computations is the special method of representation of random sample members, that are called here individuals. This representation is obtained by the proper encoding of some points from the admissible domain D. There are at least two goals of individual encoding in genetic computations: 1. Selection of the arbitrary subset of admissible points from D which can be effectively checked during the optimization process. 2. The transformation of design variables that enables the special stochastic operations (genetic operations), that can replace sampling procedure, to be performed. Let us introduce the necessary notation: R. Schaefer: Foundation of Global Genetic Optimization, Studies in Computational Intelligence (SCI) 74, 31–53 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
32
3 Basic models of genetic computations
Dr ⊆ D
Set of points in the admissible domain which constitute the grid that will be effectively checked. Elements of Dr will be called phenotypes.
r = #Dr
Finite or infinite cardinality of the set of phenotypes.
U
Genetic universum as the whole set of genetic codes used for marking phenotypes. Members of the genetic universum will be called genotypes. We sometimes assume that the genetic universum is closed with respect to the set of genetic operations.
The necessary coherency condition among cardinalities of the above sets has to be satisfied: (3.1) #U = r = #Dr Please note that the above condition holds for either infinite sets of phenotypes or genetic universum. The variable r will be a proper cardinal number in this case. Definition 3.1. Encoding will be one-to-one mapping (bijection) code : U → Dr We can try to define the new mapping, which assigns to the genotype x ∈ U admissible points from D \ Dr lying close to the point code(x) ∈ D. Definition 3.2. The inverse encoding (decoding) associated with the encoding function code : U → Dr is the partial mapping, which is not necessarily defined for all points in D ( Dom(dcode) ⊂ D where the operator Dom turns back the function domain): dcode : D −→ U that satisfies the following conditions: 1. dcode(code(x)) = x, ∀x ∈ U 2. ∀x ∈ U, ∀y ∈ D \ Dr , ∀z ∈ U \ {x}, d(y, code(x)) < d(y, code(z)) ⇒ dcode(y) = x, where d is the metric in the space V . Remark 3.3. The partial function dcode satisfies the following conditions: 1. It is not defined for the admissible points that lie in the middle of two or more neighboring phenotypes, in particular for x ∈ D \ Dr so that ∃χ, λ ∈ Dr ; d(x, χ) = d(x, λ) and there is no other phenotype γ ∈ Dr in such a way that γ = χ, γ = λ and d(x, γ) < d(x, χ), d(x, γ) < d(x, λ).
3.1 Encoding and inverse encoding
33
2. It is always onto (is surjective). 3. The restriction dcode |Dr : Dr → U is a one-to-one function as the inverse of encoding function code. 4. It can be arbitrarily extended to the function defined on the whole admissible domain D → U . One possible solution is to follow the lexicographic total order “” which may be introduced in Dr (∀χ, λ ∈ Dr χ λ ⇔ χi ≥ λi , i = 1, . . . , N, ∃k ∈ {1, . . . , N }; χk > λk ). If x ∈ D \ Dr and χ, λ ∈ Dr are such as in the first item, then dcode(x) may be set dcode(λ) if χ λ or dcode(χ) otherwise. Such an extension is also onto. Remark 3.4. The most interesting for further consideration will be the case for which #Dr < +∞, and meas(Dom(dcode)) = meas(D) ∀x ∈ U meas(dcode−1 (x)) > 0
(3.2)
which also implies meas(D) = x∈U meas(dcode−1 (x)). The above conditions may be satisfied if D is convex and sufficiently regular, so that the Voronoi Tasselation can be successfully performed for the finite set of phenotypes Dr in D (see e.g. [131]). Remark 3.5. In some special cases U will be equipped with the metric function dU : U × U → R+ . Arabas (see [5], Section 4.2) calls for another condition for well-conditioned encoding functions: ∀x1 , x2 , x3 ∈ Dr d(x1 , x2 ) ≥ d(x1 , x3 ) ⇒ dU (dcode(x1 ), dcode(x2 )) ≥ dU (dcode(x1 ), dcode(x3 ))
(3.3)
This condition would prevent the creation of the artificial local extrema in the genotype domain due to the poor topological condition of encoding mapping. The probabilistic measures θ ∈ M(Dr ) play a crucial rule in the process of genetic sampling intensively discussed in the following sections. Each measure of such a type induces another measure θ ∈ M(D) which is concentrated on the set Dr ⊂ D. Such simple extensions have some serious disadvantages. In particular, if the phenotype set is discrete Dr < +∞, then it is possible to get the measurable set A ⊂ D so that A ∩ Dr = ∅ and then θ (A) = 0, while its Lesbegue measure will be strictly positive (meas(A) > 0) and arbitrarily close to meas(D). In order to avoid this problem we will introduce a new measure unambiguously induced by the discrete measure θ. If the condition 3.2 holds, then perhaps the simplest option is to take the measure with the density defined by using dcode mapping:
34
3 Basic models of genetic computations
ρθ (y) =
θ(code(z)) if y ∈ dcode−1 (z) meas(dcode−1 (z))
(3.4)
It is easy to see that ρθ is a piecewise constant function defined almost everywhere in D. “Almost everywhere in D” means for all points in D possibly excluding points that belong to some subset of a zero Lesbegue measure. Of course, assuming any strictly positive discrete measure on phenotypes >0 θ ∈ M(Dr ); θ(y) > 0 ∀y ∈ Dr for all sets A ⊂ D the condition meas(A) always implies A ρθ dx > 0. Moreover, meas(A) = 0 holds only if A ρθ dx = 0. 3.1.1 Binary affine encoding This is perhaps the oldest and most traditional encoding technique, introduced by Holland in his first well known work [85]. Its application to the global search in continuous multidimensional domains D needs a much more formal description than currently found in the literature. The genetic universum U is traditionally denoted by Ω in this case and it is composed of all the binary strings of the finite, prescribed constant length l ∈ N. (3.5) Ω = {(a0 , a1 , . . . , al−1 ); ai ∈ {0, 1}, i = 0, 1, . . . , l − 1} Although binary genotypes from Ω are usually associated with chromosomes and are often used to label various discrete structures, it is easy to identify them with integers from the range [1, 2, . . . , r − 1]. Each binary string i ∈ Ω may be treated as the integer that equals its numerical value and belongs to the above range. The value r − 1 = 2l − 1 constitutes the upper value of this range, then we have: #Ω = r = 2l < +∞
(3.6)
The binary genetic universum Ω may also be identified with the l – times Cartesian product (3.7) Z × Z2 ×, . . . , × Z2 2 l−times
where Z2 is the group with modulo 2 addition defined on the set {0, 1}. The object given by 3.7 also makes a group with the component-wise modulo 2 addition, denoted by ⊕, defined on the set {0, 1}l . Let us restrict ourselves to the case in which V = Vˆ = RN . We assume moreover, that the admissible set D is convex and contains a brick N
[ui , vi ] ⊂ D
(3.8)
i=1
being the multiple Cartesian product of intervals [u1 , v1 ], [u2 , v2 ], . . . , [uN , vN ].
3.1 Encoding and inverse encoding
35
Fig. 3.1. Binary affine encoding in two dimensions
Let u ˆi , vˆi ∈ R; u ˆi < vˆi < +∞, i = 1, . . . , N be the lower and upper bounds ˆ i ≤ yi ≤ for the coordinates of admissible points i.e. ∀y = (y1 , . . . , yN ) ∈ D u vˆi , i = 1, . . . , N . The affine encoding mapping codea will be defined in the following steps: N 1. We select numbers l1 , l2 , . . . , lN ∈ N so that i=1 li = l. 2. We define the phenotype set in the form li jN Dr = ; yi0 = ui , yi2 −1 = vi , yiji < yiji +1 , y1j1 , . . . , yN ji ∈ {0, . . . , 2li − 2}, i = 1, . . . , N .
(3.9)
3. Each genotype x ∈ Ω may be represented in two equivalent forms x = (a1,1 , . . . , a1,l1 , a2,1 , . . . , a2,l2 , . . . , aN,1 , . . . , aN,lN ), ai,j ∈ {0, 1}, x = (j1 , . . . , jN ), ji ∈ [0, . . . , 2li − 1] (3.10) where (ai,1 , . . . , ai,li ) stands for the binary code of the integer ji .
36
3 Basic models of genetic computations
4. Using both of the above representations of the genotype x ∈ Ω we may write codea (x) = codea (a1,1 , . . . , a1,l1 , a2,1 , . . . , a2,l2 , . . . , aN,1 , . . . , aN,lN ) jN = y1j1 , . . . , yN (3.11) where ji =
l i −1
ai,(li −k) 2k .
(3.12)
k=0
The inverse binary affine encoding dcodea may be obtained as follows: 1. We create the set of numbers {bsi } , s = 0, . . . , 2li , i = 1, . . . , N, li ˆi , bji = 12 yij + yij−1 , j = 1, . . . , 2li − 1, b2i = vˆi . b0i = u
(3.13)
2. Next we define the family of open bricks N ϑ(j1 ,...,jN ) , (j1 , . . . , jN ) ∈ Ω; ϑ(j1 ,...,jN ) = bji i −1 , bji i .
(3.14)
i=1
3. dcodea (y) = (j1 , . . . , jN ) ∈ Ω ⇔ y ∈ D ∩ ϑ(j1 ,...,jN ) . It is easy to see that the above mapping is not defined on hyperplanes separating neighboring bricks 3.14. The above definitions of codea and dcodea operations follow the paper Schaefer, Jabłoński [154]. Remark 3.6. The construction of both operations codea and dcodea presented above satisfies the definitions 3.1, 3.2. In particular: 1. Dr ⊂ D and codea is one-to-one which is the effect of the phenotype set construction (see 3.9) and unambiguity of genotype representations 3.10. 2. ∀x ∈ Ω dcodea (codea (x)) = x, because x ∈ ϑx . 3. The set ϑx ∩ D constitutes the Voronoi neighborhood for codea (x) in D (see Preprata, Shamos [131]). 4. meas x∈Ω (ϑx ∩ D) = meas(D) and meas(dcode−1 a (x)) = meas(ϑx ∩ D) > 0 ∀x ∈ Ω which prove conditions 3.2 postulated in Remark 3.4. The above conditions also enable us to effectively define the density function ρθ for the arbitrary discrete measure θ ∈ M(Dr ), according to the formula 3.4. This construction will be discussed and applied in Section 4.1.2. The construction of the affine binary encoding may be partially extended by using the curvilinear coordinates to parametrize the admissible domain. The second condition postulated in definition 3.2 may be dropped in the case of non convex domains.
3.1 Encoding and inverse encoding
37
3.1.2 Gray encoding Gray encoding may be understood as another case of the affine encoding described in Section 3.1.1. Let us define the mapping v : Ω → Ω so that: v(a) = b, a = (a0 , a1 , . . . , al−1 ), b = (b0 , b1 , . . . , bl−1 ) ⇔ bi = ⊕i−1 j=0 aj , i = {0, 1, . . . , l − 1}
(3.15)
It is easy to check that v is one-to-one (bijection) mapping, and its inverse v −1 is given by: v −1 (b) = a, a = (a0 , a1 , . . . , al−1 ), b = (b0 , b1 , . . . , bl−1 )
bi i=0 ⇔ ai = bi−1 ⊕ bi i > 0
(3.16)
where ⊕ denotes the modulo 2 addition (see e.g. [10]). Let us now introduce the Hamming metrics in the space of binary genotypes Definition 3.7. Hamming metrics on Ω will be given by the mapping dH : Ω × Ω → {0, 1, . . . , l} such that for the arbitrary genotypes a = (a0 , a1 , . . . , al−1 ), b = (b0 , b1 , . . . , bl−1 ) we have: dH (a, b) =
l−1
ai ⊕ bi
i=0
It will also be easy to check that: Remark 3.8. Let j1 , j2 ∈ {0, 1, , . . . , 2l − 1} be two integers and a1 , a2 ∈ Ω their standard binary representation, then the following implication holds: |j1 − j2 | = 1 ⇒ dH (v(a1 ), v(a2 )) = 1. Gray encoding mapping will be given by the composition: codeG : Ω → D; codeG = codea ◦ v.
(3.17)
The inverse Gray encoding is a partial function which will also be given by the composition: dcodeG : D → Ω; dcodeG = v −1 ◦ dcodea .
(3.18)
38
3 Basic models of genetic computations
Both mappings codeG and dcodeG satisfy all the conditions imposed for binary affine encoding and inverse binary affine encoding in Remark 3.6. The Gray encoding may be helpful in “dulling” discrepancies between the Hamming distance of two genotypes vs distance of respecting phenotypes (frequently called “Hamming cliffs”) that appear by the affine binary encoding. This feature makes Gray encoding more closely satisfy the condition 3.3, which is helpful when solving continuous problems. 3.1.3 Phenotypic encoding This is the trivial case of individual representation applied in many instances of evolutionary algorithms (EA). In order to define this type of coding we distinguish the affine isometry between RN and the search space I : RN → V . If V = RN then I will be the identity mapping. The genetic universum will be given by: (3.19) U = I −1 (D) The role of the encoding operator will be played by I|U and inverse coding by I −1 |D. Please note that I is bijection (one-to-one mapping between RN and V ) and I|U is the bijection among U and D. The above mappings satisfy all the conditions demanded by the definitions 3.1, 3.2. In most instances of evolutionary algorithms studied later we will identify U with D. The important difference with respect to the previous, binary case is that the genetic universum (set of genetic codes) is infinite and uncountable (its cardinality is continuous) (see e.g. Schwartz [162]).
3.2 Objective and fitness During the genetic optimization process crucial operations are performed on the individual codes, the elements of the genetic universum U (or Ω in case of binary encoding). As a natural consequence, the objective function that evaluates potential solutions has to be transported in some way to the new domain U . The new function f : U → [0, M ] traditionally called fitness function, that evaluates genotypes has to be strongly correlated with the objective Φ : D → [0, M ]. Perhaps the simplest possibility is to define: f (x) = Φ(code(x)) ∀x ∈ U
(3.20)
Sometimes, we would like to change selection pressure with respect to the worst fitted individuals. Such a feature may be obtained by the linear or non-linear scaling of fitness with respect to the original objective value by using strongly monotone “scaling” function (see e.g. Goldberg [74], Bäck, Fogel, Michalewicz [15], [10]):
3.3 The individual and population models
Scale : [0, M ] → [0, M ]
39
(3.21)
where M < +∞ is the new upper bound for the individual evaluation. Using the scaling function we obtain: f (x) = Scale(Φ(code(x))) ∀x ∈ U
(3.22)
In the next parts of this book we do not distinguish between constants M and M treating them as the single, generic constant M . Other, more sophisticated ways to define the fitness mapping will be described in Sections 5.3.4, 5.3.5.
3.3 The individual and population models Genetic algorithms generally fall into the schema of stochastic population search described in Section 2.2 and roughly presented in Figure 2.2. Because genetic algorithms always represent design variables from the admissible set D (exactly from its subset Dr ) as the genetic codes U by using encoding function (see Section 3.1) we will consider populations {Pt }, t = 0, 1, . . . as multisets of clones from the genetic universum U rather than from the set of phenotypes Dr . Population members in populations processed by genetic algorithms are called individuals. The consecutive iterations of genetic algorithm indexed by t = 0, 1, 2, . . . will be called genetic epochs or simpler epochs. According to the definition 2.8 we will represent Pt as a pair (U, ηt ), t = 0, 1, . . . where ηt stands for the occurrence function in the tth genetic epoch. If the population Pt is finite and distinction among clones of the same genotype is necessary we will represent it in the set-like form postulated in Remark 2.15. Summing up, the individual x, being a member of the population Pt in the arbitrary genetic epoch t, is characterized by the triple: x ∈ supp(ηt ) ⊂ U
the individual’s genotype
code(x) ∈ Dr ⊂ D
the individual’s phenotype
(3.23)
f (code(x)) ∈ [0, M ] ⊂ R+ the individual’s fitness the set supp(ηt ) stands for the genetic material of the population Pt . In the case of more extended genetic mechanisms (see e.g. Beyer [26]) discussed in Chapter 5 the basic representation of the individual x is enriched by the vector s, containing a finite number of parameters, which controls the genetic operations transforming the individual x to the next genetic epoch (see Section 5.3.1). Because the vectors s may be different for two individuals of the identical genotype, the set-like representation of Pt delivered by the formula 2.16 has to be used in this case.
40
3 Basic models of genetic computations
3.4 Selection The first operation that starts to transform the population Pt in the tth epoch according to the scheme of stochastic search 2.2 is selection operation. We briefly describe four kinds of selection: proportional (roulette) selection, tournament selection, elitist selection and rank selection. Selection is mainly a random operation (except the pure elitist one) and may be modeled as multiple sampling from the population Pt , although its implementation may take quite different forms. Roughly speaking, the sampling probability distribution of selection self will be an element of the space of probabilistic measures on Pt self ∈ M(Pt ).
(3.24)
The lower index f informs us that it always depends on the fitness function f in some way. The detailed form of self depends on the population cardinality and representation. We will present three forms of the selection sampling measure denoted by selfi , i = 1, 2, 3. In future considerations we will use the same generic description self for all representations. The detailed meaning of this symbol will depend on the particular context. In the case of the finite population #Pt = µ < +∞ we may use the set-like representation (see Remark 2.15)
z1 , z1 , . . . , z1 , z2 , z2 , . . . , z2 , . . . , zχ , zχ , . . . , zχ η(z1 ) times
η(z2 ) times
(3.25)
η(zχ ) times
where the set supp(ηt ) = {z1 , . . . , zχ } ⊂ U, χ ≤ µ < +∞. The finite distribution (3.26) self1 = {p1 , . . . , pµ } may distinguish among two individuals with the same genotype, but this possibility is rarely applied in practise. The probability pi is assigned to the sampling of the ith element from the list 3.25. Note that the order of elements in 3.25 is crucial in this case. In this same case #Pt = µ < +∞, if the distinction among clones of the same genotype is not necessary, we may process selection as multiple sampling from the set supp(ηt ). Selection sampling measure will also be finite in this case (3.27) self2 = {p1 , . . . , pχ } ∈ M(supp(ηt )) where pi denotes the probability of sampling the genotype zi ∈ supp(ηt ). The probability of selecting a single individual from Pt with a genotype zi ∈ supp(ηt ) equals pi (3.28) ηt (zi )
3.4 Selection
41
and is identical for all ηt (zi ) such individuals of the same genotype zi . Neither of the above representations of self1 , self2 depend on the cardinality of genetic universum U . If now #U = r < +∞ we may easily extend the measure given by the formula 3.27 to the discrete measure defined on the whole genetic space U by the formula self3 = {p1 , . . . , pr } ∈ M(U ); self3 ({x}) =
self2 ({x}) for x ∈ supp(ηt ) 0
(3.29)
otherwise
Selection may be understood formally as multiple sampling from U with the sampling measure 3.29 in this case. If now #Pt = +∞, but #U = r < +∞ we may use the population representation postulated by Remark 2.12. Both forms of selection sampling measures self2 , self3 given by formulas 3.27, 3.29 remain valid except formula 3.28 which evaluates the probability of sampling a single individual, which does not make sense in this case. 3.4.1 Proportional (roulette) selection Proportional selection may be explained as the multiple sampling from supp(ηt ) with a probability distribution self ∈ M(supp(ηt )) given by the formula f (x) ηt (x) , x ∈ supp(ηt ) (3.30) self ({x}) = y∈supp(ηt ) f (y) ηt (y) The denominator y∈supp(ηt ) f (y) ηt (y) in the right hand side of the above formula 3.30 stands for the total fitness of population Pt . The meaning of the attribute “proportional” is immediately seen from formula 3.30. The probability of sampling x ∈ supp(ηt ) is proportional to its impact on the total population fitness expressed by the product of its fitness f (x) and the number of individuals ηt (x) of the genotype x in the population Pt . The meaning of the attribute “roulette” comes from the standard implementation which consists of sampling a single point from the range [0, 1] ⊂ R that is divided into #supp(ηt ) parts according to the distribution 3.30. Location of the sampled point may be identified with the final position of the roulette wheel with the unit perimeter. Roulette, proportional selection is one of the few cases in which the sampling probability distribution self is available explicitly. It is also easy to see that if f |supp(ηt ) > 0, then each individual from the population Pt has a chance to pass to the next step of a genetic algorithm.
42
3 Basic models of genetic computations
3.4.2 Tournament selection Single sampling in this kind of selection is performed in two steps: 1. Selecting in some way the finite sub-multiset (see definition 2.10) from the population Pt . The selected multiset of the cardinality k ≤ #Pt , k < +∞ is called tournament mate. 2. Selecting one of the best fitted individuals from the tournament mate already undertaken. Typical size of the tournament mate is taken as k = 2 (the couple of individuals that will take part in the tournament) and the usual methods of mate selection is the simple k-time sampling (without returning) according to the uniform probability distribution on the whole Pt , or by the multiple sampling from supp(ηt ) according to the distribution 3.30 (see e.g. Goldberg [74]). The explicit evaluation of the selection sampling probability distribution is rather complicated in this case. Such evaluation is delivered by Vose [193] under some rather restrictive assumptions for the case of binary genotypes (binary encoding). 3.4.3 Elitist selection The main idea of elitist selection is to ensure the passage of the best fitted individuals from the current population Pt to the next steps of the algorithm. The sampling procedure may be formalized as a two-phase one: 1. We select in a deterministic way the sub-multiset Elite ⊂ Pt which gathers the best fitted individuals from the current population. Elite is passed to the next epoch population with the probability 1. 2. Other individuals are selected according to other, probabilistic rules, e.g. the proportional selection rule described in Section 3.4.1. The most popular case of the elite set is the singleton Elite = {ˆ z } where z ) ≥ f (x) ∀x ∈ Pt . Because of the partially deterministic way of zˆ ∈ Pt ; f (ˆ sampling, it is difficult to express the elitist selection in terms of multiple sampling with the prescribed probability distribution. 3.4.4 Rank selection Rank selection establishes the arbitrary way to designate the sampling probabilities sel({z}), z ∈ Pt by defining their ranks R(z), z ∈ Pt . The set-like population’s representation (see formula 3.25) allows the most general rank assignment in the case of finite populations #Pt = µ < +∞. Perhaps the simplest role of rank assignment is based on the arbitrary total order in Pt represented as the list 3.25 of individuals which satisfies the condition:
3.5 Binary genetic operations
∀y, z ∈ Pt y z ⇒ f (y) ≥ f (z)
43
(3.31)
The rank assignment is performed in the following steps: 1. R(z) = R0 for the lowest element in Pt with respect to the order (∀y ∈ Pt ; y = z y z). 2. R(x) = R(y) + 1 if y is the immediate predecessor of x in Pt with respect to the order (y x and ¬{∃w ∈ Pt ; y w x}). In this case the rank assignment mapping R : Pt → [R0 , R0 + µ] ∩ Z+
(3.32)
is one-to-one (bijection). Another case of rank assignment that may lead to rank ambiguity can be obtained by replacing the second step of the above algorithm by the following one 1. If y is the immediate predecessor of x in Pt with respect to the order and f (x) > f (y) then R(x) = R(y) + 1 and R(x) = R(y) if f (x) = f (y). We may find various formulas that allow us to compute selection sampling probabilities by the given rank distribution 3.32. Two sample formulas of this type are quoted after the monograph [5]. " ! R(z) self ({z}) = a + k 1 − Rmax (3.33) b
self ({z}) = a + k (Rmax − R(z))
where Rmax = maxz∈Pt {R(z)} and the parameters a, k, b ∈ R are chosen so that the probability distribution self is well defined self ({z}) = 1, 0 ≤ self ({z}) ≤ 1 ∀z ∈ Pt . (3.34) z∈Pt
3.5 Binary genetic operations Genetic operations are used to produce genotypes of new individuals from one or more genotypes of individuals, which come from the current populations. Binary genetic operations are especially designed to operate on binary genotypes, being the elements of Ω (see formula 3.5). Their implementations may be formalized as the mappings Ω p → Ω, p ∈ N. Genetic operations have a stochastic character, so their values are random variables and a more accurate description may be stressed as follows Ω p → M(Ω), p ∈ N.
(3.35)
Our goal will be to present probability distributions from M(Ω) associated with the above variables as well as selected implementation issues. We will present the detailed description of various types of binary mutation and crossover, for which p = 1, 2.
44
3 Basic models of genetic computations
3.5.1 Multi-point mutation Binary multi-point mutation consists of producing the binary code x ∈ Ω from the code x ∈ Ω by using the mutation mask i ∈ Ω according to the formula (3.36) x = x ⊕ i. The point-wise, modulo 2 addition of the mask to the parental code results in the bit inversion in x on positions that coincide with the positions of ones in the mask code i. The mask i = (a0 , a1 , . . . , al−1 ) is sampled independently for the mutation of each particular parental code x and independently of the value of this code. Each bit ai , i = 1, . . . , l − 1 of the mask string is sampled independently from the other aj , j = i during the single mask sampling procedure. The probability of sampling 1 in the single bit sampling is assumed to be constant and equals pm . Taking the above assumptions into account, the probability ξi of sampling the particular mask i ∈ Ω can be computed using the classical Bernouli model of independent trials with the binary result (see e.g. Billingsley [28]). ξi = (pm )(1,i) (1 − pm )l−(1,i)
(3.37)
where 1 = (1, . . . , 1) l−times
denotes the binary vector of l units and (1, i) the Euclidean scalar product of binary vectors 1 and i. The multi-point binary mutation then has the single parameter pm ∈ [0, 1] called mutation rate. Assuming that the parental code x ∈ Ω is changed by the binary multi-point mutation then the probability distribution of its new value mutx ∈ M(Ω) is given by the formula mutx ({x }) = ξi [x ⊕ i = x ] (3.38) i∈Ω
where [·] denotes the binary evaluation of the logical expression so that 0 if w is true [w] =
(3.39) 1 if w is false
3.5.2 Binary crossover Binary crossover is the operation that produces the binary code z ∈ Ω from two parental codes x, y ∈ Ω by using the binary string i ∈ Ω called crossover mask. The code z is sampled from the pair of two “children” (y ⊗ i) ⊕ (ˆi ⊗ x), (x ⊗ i) ⊕ (ˆi ⊗ y) (3.40)
3.5 Binary genetic operations
1
45
1
with the uniform probability distribution 2 , 2 . The binary vector ˆi = 1⊕i ∈ Ω stands for the inversion of the mask i. The action of the mask in formula 3.40 results in bit exchanging among parental codes on positions on which appears 1 in the mask code. The crossover mask i is sampled from Ω according to the probability distribution (ζ0 , ζ1 , . . . , ζr−1 ) ∈ M(Ω) given by the formula pc typei
i>0
(3.41) 1 − pc + pc type0 i = 0 where the vector type = (type0 , type1 , . . . , typer−1 ); i∈Ω , typei ≥ 0 ∀i ∈ Ω constitutes another probability distribution called crossover type in the space M(Ω). The number pc ∈ [0, 1] will be called the crossover rate. For the most classical one-point crossover introduced by Holland [85] the crossover type takes the form ⎧ 1 ⎨ if ∃k ∈ (0, l) ∩ Z; j = 2k − 1 (3.42) typej = l − 1 ⎩ 0 otherwise ζi =
It is easy to see that the above distribution assigns the non-zero, equal prob1 to each string of the form ability l−1 ⎛
⎞
⎜ ⎟ j = 2k − 1 = ⎝ 0, . . . , 0 , 1, . . . , 1⎠ , k = 1, 2, . . . , l − 1.
(3.43)
l−k−times k−times
Other strings are postponed by the zero probability. Coupling both formulas 3.41 and 3.42 we obtain the mask probability distribution for one-point crossover ⎧ i=0 1 − pc ⎪ ⎪ ⎪ ⎨ p c i = 2k − 1, k = 1, 2, . . . , l − 1 (3.44) ζi = l − 1 ⎪ ⎪ ⎪ ⎩ 0 otherwise The probability 1 − pc is assigned to only one mask i = 0, which performs trivial crossover (simply no crossover, the resulting string z is sampled from the parents x, y only). Note that the mask i = 1 that also trivially transforms parents to children has a zero sampling probability. All masks of the type 3.43 pc and refer to the non-trivial crossover have the same, non-zero probability (l−1) operation. This operation results in tearing the parent strings between the k and k + 1 positions and then replacing the obtained parts among them. The formal description presented above was introduced by Vose (see e.g. [193]) and allows us to flexibly handle other types of binary crossover met in
46
3 Basic models of genetic computations
practice. One important example is the uniform crossover for which the type distribution is given by (3.45) typei = 2−l , ∀i ∈ Ω The trivial crossover masks i = 0, i = 1 now have the total sampling 1 pc while each other mask i ∈ Ω; i = 0, i = 1 has probability 1 − 1 − 2l−1 the equal probability p2cl . The binary crossover operation then has two parameters: the crossover rate pc ∈ [0, 1] which controls the overall rate of crossover intensity and type ∈ M(Ω) which specifies the kind of bit exchanging operation among parents. The probability distribution crossx,y ∈ M(Ω) that determines the stochastic result of crossing strings x, y ∈ Ω is given by the formula crossx,y ({z}) =
ζk + ζˆ k [(x ⊗ k) ⊕ (kˆ ⊗ y) = z] 2
(3.46)
k∈Ω
3.5.3 Features of binary genetic operations, mixing We will concentrate in this section on the crucial features of binary genetic operations, which allow us to formulate rigorous mathematical models of genetic algorithms that handle binary encoding and may facilitate their implementation. All results are taken from or strictly based on Vose’s monograph [193]. Definition 3.9. The binary mutation described in Section 3.5.1 will be called independent if the probability ξj of sampling the arbitrary mutation mask j ∈ Ω satisfies the condition ξi⊕j ξi⊕j ∀k ∈ Ω (3.47) ξj = k⊗i=0
ˆ k⊗i=0
Theorem 3.10. (see Vose [193], Theorem 4.1) If the mutation mask is sampled with the probability distribution given by formula 3.37, then mutation is independent. The mutation independence can affect the coupling of both crossover and mutation in the following way. Theorem 3.11. (see Vose [193], Theorem 4.2) If the binary mutation is independent, then, for the arbitrary binary crossover the two following events have the same probability: •
Obtain the string z ∈ Ω by crossing x and y , where x , y resulting from the mutation of codes x, y respectively.
• Obtain the string z ∈ Ω mutating the result of crossover of strings x, y.
3.6 Definition of the Simple Genetic Algorithm (SGA)
47
The next two theorems show the common probabilistic effect of mutation and crossover composition without any additional assumptions concerning mutation. Theorem 3.12. (see Vose [193], Theorem 4.3) If we mutate two strings and next cross their offspring, then the common probability distribution mx,y ∈ M(Ω) of obtaining a new string z from the parental strings x, y is given by the formula ζk + ζkˆ [(x ⊕ i) ⊗ k ⊕ kˆ ⊗ (y ⊕ j) = z] (3.48) ξi ξj mx,y ({z}) = 2 i,j,k∈Ω
If mutation is applied to x, y which are already crossed, then ζk + ζkˆ [x ⊗ k ⊕ kˆ ⊗ y ⊕ j = z] mx,y ({z}) = ξj 2
(3.49)
j,k∈Ω
Theorem 3.13. (see Vose [193], Theorem 4.3) Both probability distributions, given by formulas 3.48, 3.49 satisfy the conditions mx,y ({z}) = my,x ({z}) = mx⊕z,y⊕z ({0})
(3.50)
where 0 ∈ Ω stands for the binary vector of l zeros.
Remark 3.14. If the binary mutation is independent, then both probability distributions given by formulas 3.48, 3.49 are equivalent which is the result of Theorem 3.12. It holds, in particular, for multi-point mutation with the mask sampling according to formula 3.37 (see Theorem 3.10) and the binary crossover, for the arbitrary operation parameters pm , pc and type. Definition 3.15. If the mutation is independent, then we may define the operation called mixing as a composition of mutation and crossover. The order of component operation does not affect its probabilistic result, so the probability distribution of mixing may be computed using both formulas 3.48, 3.49.
3.6 Definition of the Simple Genetic Algorithm (SGA) The Simple Genetic Algorithm (SGA) is a method of transforming a binary represented population Pt to the next epoch population Pt+1 . Both samples are multisets of binary strings (genotypes) from the binary genetic universum Ω (see formula 3.5 in Section 3.1.1). We have to fix five parameters of SGA: the population size µ ∈ N, the fitness function f , mutation and crossover rates pm , pc ∈ [0, 1] and the crossover type type ∈ M(Ω), which is the probabilistic vector of the length r. All parameters remain unchanged during the whole computation. The algorithm consists of running in the finite loop, three random operations described in items 2–4 until the condition contained in item 5 is satisfied.
48
3 Basic models of genetic computations
1. Create an empty population Pt+1 . 2. Select two individuals x, y from the current population Pt by multiple sampling according to the probability distribution 3.30 (proportional, roulette selection). 3. Produce the binary code z ∈ Ω from selected x, y ∈ Ω using mixing with the probability distribution 3.48 (or 3.49). 4. Put z to the next epoch population Pt+1 . 5. If Pt+1 contains less than µ individuals, go to 2. SGA is one of a few instances of genetic computations for which the probability distribution of sampling the next epoch population can be delivered explicitly. This problem will be discussed in Chapter 4.
3.7 Phenotypic genetic operations Phenotypic genetic operations are used in genetic computation models in which the genetic universum U is simply the admissible set D which is the subset of the finite dimensional space V usually identified with RN (see Sections 2.1, 3.1.3). Basic versions of most phenotypic operations have the form of simple geometric formulas, which utilize the multidimensional normal probability distribution (see e.g. [28]). Binary encoded genetic algorithms search only in D by checking the discrete subset Dr ⊂ D of phenotypes. They meet the constraints imposed by the global optimization problem, which restrict its solution to the set D. For genetic algorithms with the phenotypic encoding (frequently called evolutionary algorithms) the problem of restricted search in the bounded domain D is usually non-trivial (see e.g. Arabas [5], Bäck, Fogel, Michalewicz [15]). Such a problem may be solved by modifying the objective function by adding a penalty or by modification of genetic operations in such a way that they will turn back only the individuals representing the admissible points in D. Some remarks considering constrained genetic search by phenotypic encoding will be placed in Section 3.7.3. Readers interested in this matter are referred to books [5, 15, 11, 110]. 3.7.1 Phenotypic mutation Mutation stands for the basic, most important genetic operation in evolutionary algorithms. One of the simplest and most frequently used kinds of phenotypic mutation is called normal phenotypic mutation (see e.g. Schwefel [163]). It creates the offspring individual x from the parental one x ∈ U = D by the formula (3.51) x = x + N (0, C)
3.7 Phenotypic genetic operations
49
where N (0, C) is the realization (result of sampling) of the N dimensional random variable that has the multidimensional, normal distribution with mean 0 ∈ V and the N × N dimensional covariance matrix C that constitutes the parameters of this operation. Please note that because the normal perturbation N (0, C) may take an arbitrary value in V with a positive probability, then also x may lie in the whole V regardless of whether the parent x is in D or not. Many other mutation models are based on the formula 3.51 of genotype modification, changing only the probability distribution of perturbation which is here N (0, C). A review and detailed discussion of such solutions may be found in broadly cited monographs [5, 15, 11, 110]. One new and interesting way of improving perturbation distributions in many dimensions was shown by Obuchowicz [119] who tries to eliminate the “surrounding effect” that appears by normal mutation. This effect consists of concentrating offspring x at a distance from x that approximately equals the mean eigenvalue of C. A completely different type of mutation that may be applied in evolutionary algorithms falls into the group of Lamarcean operations, using a local optimization method for making individual perturbation. Such an approach will be described and roughly discussed in Section 5.3.4. 3.7.2 Phenotypic crossover The simple phenotypic crossover rule, frequently called arithmetic crossover, may be given after Wright [203]. (3.52) x = x1 + U[0, 1] x2 − x1 where x1 = x11 , . . . , x1N , x2 = x21 , . . . , x2N ∈ D are parental individuals, U[0, 1] is the realization (result of sampling) of the one-dimensional random variable with a uniform distribution over the interval [0, 1]. Using the above formula we obtain the child x located inside the N -dimensional interval [x1 , x2 ]. This formula may ensure the coherency of crossover (x ∈ D ∀x ∈ D) only if D is strictly convex. Arabas [5] delivers an alternative formula to 3.52 (3.53) xi = x1i + U[0, 1] x2i − x1i , i = 1, . . . , N in which we use the separate realization of the random variable U[0, 1] in each direction. The resulting child individual x will be located in the N dimensional segment spanned by the parents x1 , x2 so it is not necessarily included in D even if the admissible set is convex. Another possibility proposed by Michalewicz [111] uses geometric means for the parental coordinates in order to obtain the child coordinates % (3.54) xi = x1i x2i , i = 1, . . . , N
50
3 Basic models of genetic computations
Note that this operation is deterministic with respect to parents already selected. The above operation is called geometric crossover. Both arithmetic and geometrical crossover may be easily extended to the operations that can cross k > 2 parents x1 , x2 , . . . , xk ∈ D (see e.g. [15]). In particular, the barycentric combination of parents x =
k
αi xi , αi ≥ 0, i = 1, . . . , k,
i=1
k
αi = 1
(3.55)
i=1
extends the arithmetic crossover 3.52, while xi =
k α j xji , i = 1, . . . , N
(3.56)
j=1
extends the geometric crossover 3.54. The generic coefficients α1 , . . . , αk may be set in a deterministic or random way. Another possibility to define multi-parent crossover is the simplex crossover introduced by Renders and Bersini [135]. Two parents, the worst fitted x ˇ and the best fitted x ˆ among the parent mate x1 , x2 , . . . , xk ∈ D, were selected. Next, the centroid c of the parent mate without x ˇ was computed. The offspring is the reflection of x ˆ with respect to the centroid ˆ) x = c + (c − x
(3.57)
The fitness-based multi-parent crossover was also utilized by Eiben et all [63]. 3.7.3 Phenotypic operations in constrained domains As we mentioned at the beginning of Section 3.7 phenotypic genetic operations do not generally exhibit the coherency condition Pr{x ∈ D} = 1, ∀x ∈ D. The following ways of handling the constraints imposed by the global optimization problem are possible: 1. Standard modification of a constrained global optimization problem to an unconstrained one by transformation of coordinates or introducing the penalty function. This approach is described in many monographs e.g. [66, 5]. 2. The special kind of encoding in which the assumed genetic operations will be coherent. Such a condition is always satisfied if U = RN . This method mainly falls into the case of coordinate transformation mentioned previously. 3. Special kinds of genetic operations that do not exceed the admissible domain D. The examples of such operations (boundary mutation) may be found in books [110, 5].
3.8 Schemes for creating a new generation
51
4. Additional genetic operations called reparing operations which are functions of the type R : V → D. They affect each offspring x obtained from the parent(s) in the following way R(x ) =
x
if x ∈ D
/D x ∈ D if x ∈
(3.58)
where x is the “repaired”, admissible individual. The action of the operation consists mainly of projection of x on the boundary ∂D according to the prescribed projection rules. Such a rule results usually in enormous growth of boundary individuals, which may decrease the exploration skill of the evolutionary algorithm under consideration. Another possibility of individual repairing will be suggested in the next item. 5. One possible way to create the repairing operations that does not increase the number of boundary individuals is to utilize the “internal reflection rule”. Let us consider the vector x − x1 that joins the offspring with the / D then there is a (x )1 ∈ ∂D which is the first of its parents x1 . If x ∈ first boundary point in the direction x − x1 . We will compute a new point (x )1 so that the vector (x )1 −(x )1 is the vector that lies in the reflection direction satisfying the condition " ! " ! (x)1 − x (x )1 − (x )1 ,n = ,n (x )1 − (x )1 (x)1 − x where n in the internal normal versor of the boundary ∂D at the point (x )1 . Moreover, & & & & 1 &(x ) − (x )1 & + &(x)1 − x & = x − x . If (x )1 ∈ D then x = (x )1 . If not, then the procedure is repeated until the consecutive j-th point satisfies (x )j ∈ D. The determination of the reflection points (x )i ∈ ∂D, i = 2, . . . , j − 1 and the normal vectors at these points are necessary. If for any (x )i the normal versor is not unambiguously determined, then we set the reflection direction as opposite to the falling direction. The repairing procedure should act properly in the case of admissible domains with the Lipschitz boundary ∂D (see Section 2.1). A broad review of the methods of the genetic constrained optimization may be found in the monographs [5, 15, 11, 110].
3.8 Schemes for creating a new generation Genetic algorithms may offer a more general routine for making the next epoch population Pt+1 than SGA. We may distinguish two phases of such a
52
3 Basic models of genetic computations
Fig. 3.2. Producing new population by reproduction and succession
routine: reproduction and succession (see [5]). Relations among phases were shown in Figure 3.2. Reproduction consists of creating the intermediate sample Pt by multiple sampling from the current population Pt . The individuals that belong to Pt will be called parental individuals. Reproduction is performed by using one of the selection techniques described in Section 3.4. The cardinality of the intermediate sample #Pt is one of the important reproduction parameters. Genetic operations process all parental individuals from Pt to offspring individuals that form the new multiset Ot called simply offspring. Succession makes the next epoch population Pt+1 by sampling only from offspring Ot in the case of so-called (µ, λ)-type genetic algorithms, or from the multiset Pt ∪ Ot in the case of so-called (µ + λ)-type algorithms. According to the widely-accepted Schwefel notation discussed in the next section, the parameters µ and λ stand for the cardinality of the current and next epoch populations µ = #Pt = #Pt+1 and the offspring cardinality λ = #Ot respectively. Such a rule may be broken in the case of algorithms with a variable life time of individuals, as mentioned in Section 5.3.2. Succession is performed mainly by the proper selection technique, but usually a different type than in the case of reproduction. The most popular kind of succession is the pure elitist one in which µ = #Elite, then we select µ best individuals from Pt ∪ Ot or only from Ot with the probability 1.
3.9 µ, λ – taxonomy of single- and multi-deme strategies We try to comprehend the main idea of the useful notation introduced by Schwefel [163] and then extended by Rechenberger [133] and Rudolph [145]. This notation allows us to distinguish the various types of succession that may happen in the classical scheme of stochastic searches, in which a single sample called population is processed, as well as in the case in which the structure of demes, sometimes called subpopulations, is processed. (µ + λ)
Algorithms of such types produce λ offspring individuals then samples µ individuals from Pt ∪ Ot in order to createPt+1 in each epoch. The notation probably comes
3.9 µ, λ – taxonomy of single- and multi-deme strategies
53
from the cardinality of the sampling domain µ + λ = #Pt ∪ Ot . The particularly important cases are: (1 + 1) genetic random walk with only mutation operation, the (µ + 1) algorithm that reproduces only one individual in each epoch. (µ, λ)
Algorithms of such types produce λ offspring individuals then samples µ individuals from Ot in order to create Pt+1 in each epoch. We usually assume that λ ≥ µ in order to prevent the genetic material degeneration. The case µ = λ = 1 is utilized only by the classical random walk search.
(µ +, λ)
This notation comprehends both groups of algorithms mentioned before (exactly as their union). The kind of succession is arbitrary, only the population size µ and the offspring number λ is crucial.
(µ, κ, λ)
This group gathers strategies of (µ +, λ) type for which selection is performed by comparison of the individual life time with its maximum value κ. Such strategies will be described in Section 5.3.2.
γ (γ ' This is a group of multi-deme, nested genetic µ +, λ µ +, λ algorithms. The random sample is composed of µ demes, each of the cardinality µ. They were used to produce λ new demes, each of the cardinality µ. For these new demes, the (µ +, λ) genetic algorithm is processed by γ epochs. The overall multi-deme scheme is processed γ epochs. Multi-deme genetic algorithms will be discussed in Section 5.4. According to the above taxonomy and notation, the Simple Genetic Algorithm SGA is of the (µ, λ) type. Moreover, µ = λ and reproduction is performed by using proportional selection. Succession is performed by the elitist selection for which #Elite = µ (the whole offspring Ot passes to the next population Pt+1 ).
4 Asymptotic behavior of the artificial genetic systems
This chapter discusses the possibility of the formal analysis of stochastic global genetic searches that utilizes genetic-like mechanisms to obtain the consecutive population from the current one. Their asymptotic behavior is interesting, as is usual in the case of iterative algorithms. Most definitions have been introduced and features proved for genetic algorithms that can be modeled as the Markov chain with the space of states that represents all possible populations. Selected results of the Markov approach to the Simple Genetic Algorithm and special instances of evolutionary algorithms obtained in groups led by Michael Vose and Günter Rudolph were reported in the first, dominating Section 4.1. The study of genetic algorithm sampling measures will be proposed as the new tool for the analysis in cases of continuous global optimization problems (see Section 4.1.2). The Markov approach to very small population dynamics was reported in Section 4.2. Moreover, the classical schemata approach to the analysis of the single-step transition of the Simple Genetic Algorithm was revisited in Section 4.3.
4.1 Markov theory of genetic algorithms The application of the Markov theory of stochastic processes with discrete time in the analysis of genetic algorithms has to be preceded by the definition of the space of states that can fully characterize the progress of genetic computation. This space will be constructed on the basis of the genetic universum U that may contain only a finite or infinite, continuous number of codes. One important case is the binary genetic universum Ω, #Ω = r < +∞ (see formula 3.5). Definition 4.1. The space of states E of the genetic algorithm will be the set to which all populations (or their unambiguous representations) that may be produced by this algorithm belong. R. Schaefer: Foundation of Global Genetic Optimization, Studies in Computational Intelligence (SCI) 74, 55–113 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
56
4 Asymptotic behavior of the artificial genetic systems
If the individual is fully characterized by its genotype, which is true for almost all the algorithms described in Chapter 3, then the space of states satisfies the inclusion E ⊂ U µ /eqp, 0 < µ < +∞
(4.1)
where µ denotes the finite, common cardinality of all the populations created by the genetic algorithm, and eqp the relationship defined by the formula 2.14, in which Z is substituted by U . If the genetic universum is finite #U = r < +∞ (which in particular holds for the binary universum Ω), then we may identify it with the finite set of integer indices U ∼ {0, 1, . . . , r − 1} (4.2) In this case, each population P ∈ E may be identified with its frequency vector P ∼ (x0 , x1 , . . . , xr−1 ); xi ∈ [0, 1],
r−1
xi = 1, xi =
i=0
η(i) µ
(4.3)
where η is the occurrence function of the population, namely P = (U, η) according to the multiset definition 2.8. The above relation shows that, assuming finite genetic universum, the space of states E may be identified with the finite set Xµ which is the subset of the (r − 1)-dimensional simplex Λr−1 in Rr . E ∼ Xµ ⊂ Λr−1 = =
x = (x0 , x1 , . . . , xr−1 ); 0 ≤ xi ≤ 0, i = 0, . . . , r − 1,
r−1
) xi = 1
(4.4)
i=0
Remark 4.2. (see Vose [193]) If µ < +∞ and #U = r < +∞ (the case of finite population and finite genetic universum) then the number of possible populations is also finite and equals ! " r+µ−1 < +∞ n = #Xµ = µ−1 ! " · where denotes Newton’s binomial operator. · If the population cardinality µ grows, then the number of possible states n = #Xµ will increase even if the number of genetic codes remains unchanged #U = r = constant. Because the elements of Xµ are evenly distributed in Λr−1 , then the following remark may be drawn.
4.1 Markov theory of genetic algorithms
57
Remark 4.3. (see Nix and Vose [116], Vose [191, 193]) lim Xµ = Λr−1
µ→+∞
The above remark allows us to study the abstract genetic algorithms that utilize the finite genetic universum and work with infinite, yet countable populations (#U = r < +∞, #P = #Z). The space of states of such algorithms E = Λr−1 is infinite, but bounded and compact in Rr . Let us turn our attention to several characteristic states of such algorithms: Remark 4.4. (j)
• Vertices x(j) = (0, . . . , 0, 1 , 0, . . . , 0), j = 1, . . . , r of the simplex Λr−1 represent populations (P, η) so that η(j) > 0, η(k) = 0, k = j. We sometimes call such populations monochromatic ones. " ! 1 1 (S) ,..., of the simplex Λr−1 represents the popula• The center x = r r tion (P, η) in which all genotypes are uniformly represented (η(i) = η(j) > 0, ∀i, j ∈ U ). 4.1.1 Markov chains in genetic algorithm asymptotic analysis Each genetic algorithm may be interpreted as a system that transforms the population Pt ∈ E into another one Pt+1 ∈ E in each genetic epoch t = 0, 1, 2, . . . then produces the sequence {Pt }, t = 0, 1, 2, . . . . Because this transformation has a stochastic character, the populations {Pt }, t = 0, 1, 2, . . . may be handled as a family of random variables defined on the common probabilistic space (ΩP , Σ, Pr) and take their values in the space of states E. The random sequence is associated with the family of probability distributions, {π t }, t = 0, 1, 2, . . . which are measures from the space M(E) of probabilistic measures defined over the space of states E. The action of the genetic algorithm in the single genetic epoch t may be explained as one-time sampling from E according to the probability distribution π t ∈ M(E). More precisely, ∀t = 0, 1, 2, . . . , ∀A ⊂ E, A measurable in E Pr{Pt ∈ A} = π t (A).
(4.5)
The sequence of populations produced by a genetic algorithm may be modeled as a stochastic process with the discrete time t = 0, 1, 2, . . . and the space of states E. Remark 4.5. The passage from the population Pt to the next epoch population Pt+1 is implemented by multiple use of selection and proper genetic operations rather than single-time sampling according to the probability distribution π t . The effective determination of π t is rarely possible. Even if it is possible, this way is computationally much more expensive.
58
4 Asymptotic behavior of the artificial genetic systems
Remark 4.6. The probability distribution π t ∈ M(E) can generally depend on all populations P0 , P1 , P2 , . . . , Pt−1 and on the vector u(t) of parameters that control genetic operations in the epoch t. This general case will appear in the case of adaptive strategies described in Chapter 5. For classical genetic algorithms described in Chapter 3, genetic and selection operations have no memory of previous epochs, i.e. do not utilize any information from previous populations P0 , . . . , Pt−1 by producing Pt+1 . This feature allows the sequence {Pt }, t = 0, 1, 2, . . . to satisfy the Markov condition Pr {Pt+1 ∈ A|P0 , P1 , P2 , . . . , Pt } = Pr {Pt+1 ∈ A|Pt } , t = 0, 1, 2, . . .
(4.6)
where A is an arbitrary measurable set in E. If the parameters of selection and all genetic operations are constant (do not change with respect to the epoch number), then the following shift condition will be additionally satisfied. Pr {Pt+k ∈ A|Ps+k } = Pr {Pt ∈ A|Ps } , ∀k, s ∈ N ∪ {0}, ∀t ∈ N, t > s (4.7) The above considerations allow us to observe: Remark 4.7. If the parameters of the genetic operations (mutation and crossover operation parameters) and the selection parameters are constant, then the classical genetic algorithms described in Sections 3.5, 3.6 and 3.7 can be modeled by using the uniform Markov chain with the space of states E and the Markov transition function τ : E → M(E); τ (Pt ) = π t+1 , t = 0, 1, 2, . . . and the initial distribution π 0 ∈ M(E) (Pr{P0 ∈ A} = π 0 (A), A ⊂ E, A is the measurable set, see Billingsley [28]). Moreover, both conditions 4.6, 4.7 hold.
Fig. 4.1. Markov sampling scheme for the classical genetic algorithms.
The scheme of state evolution for classical genetic algorithms is shown on the diagram (see Figure 4.1). The dotted arrow in this figure represents the passage between two consecutive populations by implementing selection and stochastic genetic operations.
4.1 Markov theory of genetic algorithms
59
Remark 4.8. The mapping τ (t) : E → M(E) so that for each measurable set A ⊂ E and P ∈ E τ (t) (P )(A) =
τ (P )(A)
E
t=1
τ (t−1) (y)(A)τ (P )(dy) t > 1
will be called the iterate of degree t = 0, 1, 2, . . . of the Markov transition function τ . Remark 4.9. If the genetic algorithm can be modeled by the uniform Markov chain then for each measurable set A ⊂ E Pr{Pt ∈ A} =
E
π 0 (A)
t=0
τ (t) (y)(A)π 0 (dy) t > 1
where π 0 ∈ M(E) is the initial probability distribution.
Remark 4.10. Let the conditions of the genetic algorithm assumed in remark 4.7 hold. Moreover, if the transition between consecutive populations Pt , Pt+1 , t = 0, 1, 2, . . . is performed by the sequential application of k ∈ N selection and genetic operations then the Markov transition function τ may be composed from the transition functions τ1 , . . . , τk associated with particular operators τ (P )(A) = (τ1 ◦, . . . , ◦τk )(P )(A) = ⎛ ⎞ k−2 =⎝ τj (yj )(dyj + 1)⎠ τk−1 (yk−1 )(dyk )τk (yk )(A) j=1
E
E
where y1 = P , and yj , j = 2, . . . , k are the intermediate populations produced by consecutive operations. If the space of states of the genetic algorithm is finite #E = n < +∞ then the Markov transition function τ may be described by the finite number of transition probabilities which are the entries of the matrix Q of dimension n×n (4.8) (Q)P,P = τ (P )({P }), P, P ∈ E The probability distributions {π t }, t = 0, 1, 2, . . . are also discrete and finite in this case (probability vectors of the dimension n = #E). The well known discrete Markov transition rule holds τ (t) (P )({P }) = (Qt )P,P , t = 0, 1, 2, . . . where Qt denotes the t-th power of the matrix Q.
(4.9)
60
4 Asymptotic behavior of the artificial genetic systems
If the genetic algorithm with the finite number of states produces a new population by sequential application of k ∈ N operations then, similarly as in Remark 4.10 the probability transition matrix Q can be computed as the product k Q(i) (4.10) Q= i=1 (i)
where the matrices Q , i = 1, . . . , k are associated with the consecutive operations. If the genetic algorithm utilizes the finite genetic universum #U = r < +∞ and processes finite populations #P = µ < +∞ then entries of the probability transition matrix Q may be indexed by pairs of states from the set Xµ ⊂ Λr−1 , so Q = {(Q)x,x } ; x, x ∈ Xµ ⊂ Λr−1
(4.11)
The formal approaches presented in this section are cited mainly from the series of papers published by Michael Vose and his collaborators [191, 193, 116], Rudolph [143, 144], Beyer and Rudolph [27], Grygiel [80, 82] and mathematical monographs Feller [65], Chow and Teicher [49], Billingsley [28]. The above results constitute the basis for further research in the following directions: • Determining the Markov transition function τ or the transition probability matrix Q for particular genetic operations and algorithms. • Studying the asymptotic behavior of genetic iterations by studying features of probability transitions in the single step of a genetic algorithm (the passage from one genetic epoch to the consecutive one). • The particular case of the previous item may consist of studying the ergodicity of the Markov chain which models the genetic algorithm behavior. This feature usually implies the asymptotic correctness in the probabilistic sense or the asymptotic guarantee of success (see definitions 2.16, 2.17) as well as the global convergence (see definition 2.18). • Finding the mapping called heuristics for the particular class of genetic algorithms. This mapping is the transition rule of the idealized instance of the algorithm, which processes the infinite populations. Its features deliver important information about the potential searching ability of this class of algorithms. In particular, the existence and stability of the fixed points of heuristics are of great importance (see Section 4.1.2). • Studying the existence and features of the invariant measures on the space of states E, i.e. measures which are limits of the sequences {π t }, t → +∞ of measures that determine the population sampling in consecutive genetic epochs (see Remark 4.7).
4.1 Markov theory of genetic algorithms
•
61
Examining the dynamics of sampling measures defined on the search domain D, induced by populations (see formula 3.4). The results of such considerations will be helpful in the justification of finding the central parts of the basins of attractions of global and local extrema of the objective function (see problem Π4 , Section 2.1).
4.1.2 Markov theory of the Simple Genetic Algorithm The Simple Genetic Algorithm (SGA) defined in Section 3.6 works with binary genotypes of the form (a0 , . . . , al−1 ), ai ∈ {0, 1} where the length of the code l ∈ N stands for a fixed parameter for each particular SGA instance. Binary codes constitute the genetic universum Ω so that #Ω = r = 2l (see Section 3.1.1). Such algorithms processing populations of the finite, constant cardinality µ so that 1 < µ < +∞ have the finite space of states E = Xµ = n < +∞ whose cardinality may be computed using the formula contained in Remark 4.2. States of SGA may be identified with the population frequency vectors x = (x0 , . . . , xr−1 ) (see formula 4.3) which unambiguously represent the particular SGA populations if the parameter µ is fixed. The frequency vector entity xi expresses the portion of the population x occupied by individuals of the binary code which is equal to i ∈ Z. The set of states Xµ is contained in the unit (r − 1)-dimensional simplex Λr−1 ⊂ Rr (see formula 4.4). We will also study the idealized, limit case of SGA in which the genetic universum Ω is finite (r < +∞) but populations are infinite (µ = #Z), for which the space of states is also infinite E = Λr−1 (see Remark 4.3). If the mutation and crossover parameters pm , pc and crossover type vector type (see Section 3.5) are constant then the Simple Genetic Algorithm that processes finite populations may be modeled by the uniform Markov chain with the space of states Xµ . The Markov kernel (transition function τ ) may be characterized by the n × n transition probability matrix Q. Let us denote by πµ0 ∈ M(Xµ ) the n-dimensional probabilistic vector that characterize the sampling of the initial population P0 which is associated with the frequency vector x0 ∈ Xµ . The Markov transition rule implies that πµt+1 = Qπµt , t = 0, 1, 2, . . .
(4.12)
where πµt ∈ M(Xµ ), t = 1, 2, . . . is the n-dimensional probabilistic vector whose entries (πµt )x define the probability of the state x ∈ Xµ (as well as population unambiguously assigned to the state x) occurrence in the epoch t. Genetic operator In the case of the binary genetic universum Ω of the final cardinality r < +∞ the fitness function f : Ω → [0, M ], M < +∞ is represented by the vector of its values (4.13) f ∼ (f0 , f1 , . . . , fr−1 ) ∈ Rr ; fi = f (i), i ∈ Ω.
62
4 Asymptotic behavior of the artificial genetic systems
We will denote by diag(f ) the r × r diagonal matrix, whose diagonal consists of f entries. Definition 4.11. The proportional selection operator is the mapping F : Λr−1 → Λr−1 ; F (x) =
diag(f ) x . (f, x)
The scalar product (f, x) appearing in the above formula represents the mean fitness of the population represented by the state x ∈ Λr−1 . Remark 4.12. Let P = (Ω, η) be the arbitrary SGA population represented by the state x ∈ Xµ , then ∀i ∈ Ω self3 ({i}) = (F (x))i where self3 is the proportional selection distribution defined by formulas 3.29, 3.30. The above observation shows us that the probability of selecting the individual with the genotype i ∈ Ω from the population represented by the state x ∈ Xµ is equal to the ith coordinate of the value of the proportional selection operator F (x). It is enough to divide both numerator and denominator at the right hand side of the formula 3.30 by the population cardinality µ in order to justify this observation. It also justifies the following simple remark. Remark 4.13. Each value of the proportional selection operator F (x), x ∈ distribution on the genetic universum Ω (∀i ∈ Ω 0 ≤ Λr−1 is the probability r−1 (F (x)) ), so this operator may also be (F (x))i ≤ 1, i = 1 ∀x ∈ Λ i∈Ω handled as the mapping F : Λr−1 → M(Ω). Next, we define the mixing operator that comprehends the results of mutation and crossover genetic operations. Definition 4.14. The symmetric matrix M = {mi,j ({0})} , i, j ∈ Ω of the dimension r × r whose entries are defined by formulas 3.48, 3.49 will be called the mixing matrix. Definition 4.15. The mixing operator is the mapping M : Λr−1 → Λr−1 given by the formula (M (x))i = (σi x)T Mσi x, x ∈ Λr−1 , i ∈ Ω where σi stands for the r × r dimension permutation matrix with the entries (σi )j,k = [j ⊕ k = i], i, j, k ∈ Ω.
4.1 Markov theory of genetic algorithms
63
Definition 4.16. The genetic operator is the mapping G : Λr−1 → Λr−1 , which is a composition of both: proportional selection and mixing operators G = M ◦ F. Remark 4.17. Proportional selection, mixing and genetic operators are continuously differentiable on the unit simplex Λr−1 i.e. F, M, G ∈ C1 (Λr−1 ). This means that they belong to the class C1 (A), where A is an open set in Rr so / A. that Λr−1 ⊂ A and 0 ∈ Let us study the extensions of proportional selection mixing and genetic operators to the whole Rr in order to justify the above remark. Remark 4.18. The extension of the mixing operator M is continuously differentiable in the whole Rr because it is simply the quadratic form of its argument. The extension of the proportional selection operator F is continuously differentiable in Rr except for zero, so this same range of differentiability is preserved for their composition G = M ◦ F . Because the distance between Λr−1 and 0 is strictly positive, we may easily select the open set A that contains Λr−1 and does not contain 0 which is necessary to justify the thesis of Remark 4.17. Theorem 4.19. Let xt ∈ Λr−1 be the frequency vector of the SGA population in the genetic epoch t ≥ 0, then the i-th coordinate (G(xt ))i of the genetic operator value stands for the sampling probability of the individual with the genotype i ∈ Ω for the next epoch population Pt+1 . The above thesis may be drawn from the structuring process of the genetic operator whose coordinates are based on probability selection and mixing distributions 3.29, 3.30, 3.48, 3.49 and the symmetry condition delivered by Theorem 3.13. A detailed justification of the thesis of Theorem 4.19 can be found in the Vose monograph [193]. Let us consider two SGA populations in the consecutive genetic epochs Pt , Pt+1 and corresponding frequency vectors xt , xt+1 ∈ Λr−1 . The state vector xt+1 produced by the algorithm may be interpreted as the random variable with the probability distribution π t+1 which depends only on xt with respect to the Markov condition 4.6 for the family {Pt }, t = 0, 1, 2, . . . . Theorem 4.20. (see Vose [193], Theorem 3.3) Let xt ∈ Λr−1 be the frequency vector of the SGA population in the genetic epoch t ≥ 0, then the expected frequency vector in the next epoch t + 1 equals G(xt ) (we can write using a short form E(xt+1 ) = G(xt )). The next theorem determines the important relation between the genetic operator value and the Markov transition function (represented by the probability transition matrix) associated with the Simple Genetic Algorithm.
64
4 Asymptotic behavior of the artificial genetic systems
Theorem 4.21. (see Nix and Vose [116]) If the parameters of SGA genetic operations: mutation rate, crossover rate and type pm , pc , type (see Sections 3.5.1, 3.5.2) are constant (do not depend on the epoch number t), then the n × n transition probability matrix Q may be computed by the formula Q = {(Q)x,y }x,y∈Xµ , (Q)x,y = µ!
r−1 j=0
µyj
((G(x))j ) (µyj )!
.
The above theorem allows the effective computation of the transition probability matrix entries {(Q)x,y } , x, y ∈ Xµ if the fitness vector f = (f0 , . . . , fr−1 ), population cardinality µ and binary code length l as well as genetic operator parameters pm , pc , type are given. Additional information and formulas necessary for effective evaluation of the genetic operator G are contained in definitions 4.11–4.16. The genetic operator plays a crucial role in the analysis of the Simple Genetic Algorithm and the above theorems are the first examples of its application. The next results will be presented in the following sections. Remark 4.22. As far as the genetic operator G is completely determined by the parameters of the particular SGA instance (binary code length l, fitness vector f = (f0 , . . . , fr−1 ), mutation rate pm , crossover rate and type pc , type) it may correspond to an infinite, but countable class of SGA instances, which differs only in the population size µ ∈ N or in the starting population P0 . Such a class of SGA instances will be called associated with or spanned by the genetic operator G. The Simple Genetic Algorithm as a dynamic semi-system Definition 4.23. (see Pelczar [125]) The dynamic semi-system is a triple (T, B, φ), where T is the topological space, (B, +) is the topological alternating semi-group (there is no opposite element) and φ : T × B → T the mapping that satisfies the following conditions: 1. φ(·, e) = I(·) where e ∈ B is the neutral element in B and I stands for the identity mapping on T , 2. ∀p, t ∈ B, ∀x ∈ T φ(φ(x, p), t) = φ(x, p + t), 3. the mapping φ is continuous with respect to both variables.
Let us first discuss the SGA with finite populations {Pt }, t = 0, 1, 2, . . . of the cardinality µ < +∞. The frequency vector of Pt will be denoted by xt ∈ Xµ ⊂ Λr−1 for t = 0, 1, 2, . . . . The Markov transition function turns back the probabilistic measure τ (xt ) = πµt+1 ∈ M(Xµ ) like in the formula contained in Remark 4.7. Because Xµ is finite then also πµt are vectors of
4.1 Markov theory of genetic algorithms
65
the finite dimension n = #Xµ (see Remark 4.2). The vector πµt may also be handled as a discrete measure which is an element of the space M(Λr−1 ), concentrated on the discrete set of points Xµ ⊂ Λr−1 . Using the formula recursively πµt+1 = Qπµt (see 4.12) we obtain ∀t, p ∈ Z+ πµt+p = Qp πµt = Qt+p πµ0 .
(4.14)
Moreover, we have Qt+p = Qt Qp = Qp Qt ∀t, p ∈ Z+ .
(4.15)
Let us now set T := M(Xµ ) which is a set of n-dimensional probabilistic vectors with the topology induced from Rn . We may also equivalently set T as the space of discrete probabilistic measures on Λr−1 concentrated in the points from Xµ . Let B be the semi-group of transformations T → T spanned by the consecutive iterates of the matrix Q, i.e. {Qp }, p = 0, 1, 2, . . . with the mapping composition as the group operation ” + ”. Each iterate Qp is the continuous mapping of T to itself as the linear mapping of the finite dimensional vector-topological space. Now we can define φ : T × B (x, Qp ) → Qp x ∈ T where Qp , p ∈ Z+ represents the arbitrary element of the semi-group B. The continuity condition for φ with respect to the first variable was already proved. Continuity with respect to the second variable is trivial because B is a discrete set. Moreover, φ satisfies condition 2 of the definition 4.23 because φ(φ(x, Qp ), Qt ) = Qt (Qp x) = Qt+p x = φ(x, Qt+p ).
(4.16)
We have justified then the following remark: Remark 4.24. The Simple Genetic Algorithm that transforms finite populations of the cardinality µ < +∞ may be modeled as a dynamic semi-system whose states belong to the space of discrete probabilistic measures on Λr−1 concentrated in points from Xµ . If the Simple Genetic Algorithm processes finite populations (µ < +∞), then the transition from the current state x ∈ Xµ to the state y ∈ Xµ in the next genetic epoch is performed with the probability (Q)x,y . If µ tends to infinity then Xµ becomes dense in Λr−1 (see Remark 4.3) and each frequency vector x ∈ Λr−1 may represent the infinite population. The next epoch population represented by xt+1 ∈ Λr−1 that follows xt ∈ Λr−1 in the infinite population SGA is obtained by infinite sampling with the probability distribution depending only on xt . What does the Markov transition rule τ (xt ) look like in this case? The answer may be found in the following theorem.
66
4 Asymptotic behavior of the artificial genetic systems
Theorem 4.25. (see Vose [193], Theorem 13.2) ∀K > 0, ∀ε > 0, ∀ν < 1 ∃N > 0 independent of x0µ ∈ Λr−1 so that ∀ 0 ≤ t ≤ K & & Pr µ > N ⇒ &xtµ − Gt (x0µ )& < ε > ν where xtµ ∈ Λr−1 is the frequency vector of the population Pt of the cardinality µ produced by the SGA and x0µ ∈ Λr−1 stands for the frequency vector of the initial population P0 of this same cardinality. The above thesis may be interpreted as follows. When the number of the SGA population is sufficiently large then the population xt will be followed by xt+1 = G(xt ) with a probability which is arbitrarily close to one. The transition rule becomes deterministic, i.e. it will be simply the mapping Λr−1 → Λr−1 . Let us now set T := Λr−1 ⊂ Rr with the induced topology and B := p {G }, p = 0, 1, 2, . . . where G0 = I is the identity mapping on Λr−1 . Each element of B of the form Gp , p ∈ Z+ is a continuous mapping on T (see Remark 4.17), then B constitutes the semi-group of continuous mappings with the mapping composition as the group operation ” + ”. Because Gp Gk = Gk Gp = G(p+k) for arbitrary integers p, k ≥ 0 then B is the alternating semigroup of mappings. Setting φ(x, Gp ) := Gp (x), x ∈ T, Gp ∈ B we obtain the mapping which is continuous in both variables. The continuity with respect to the first variable is the simple matter of G continuity (see once more Remark 4.17). The continuity with respect to the second variable is trivial in the discrete topology on B. Moreover, we have: φ(φ(x, Gp ), Gk ) = Gk (Gp (x)) = Gk+p (x) = φ(x, Gk+p )
(4.17)
which completes the proof of both conditions appearing in definition 4.23 and the following remark may be drawn: Remark 4.26. (see also Grygiel [80]) The Simple Genetic Algorithm with the infinite population (µ = +∞) may be modeled as the dynamic semi-system whose states belong to Λr−1 . The above discussion as well as Theorem 4.25 also leads to another observation: Remark 4.27. The initial range of the trajectory {xt }, t = 0, . . . , K, K < +∞ of the finite population (µ < +∞) Simple Genetic Algorithm is located arbitrarily close to the trajectory of the infinite population SGA with an arbitrarily large probability if the population cardinality µ is sufficiently large. The considerations presented in this section show what kind of objects can be transformed regularly by the Simple Genetic Algorithm. They are probabilistic measures from the space M(Xµ ) in the case of finite populations of the size µ and in infinite populations their representations from the set
4.1 Markov theory of genetic algorithms
67
Λr−1 . The transformation rule is static in both cases (does not depend on the genetic epoch counter t) and deterministic if µ = +∞. It is represented by the constant transition probability matrix Q in the first case and the genetic operator G in the second one. Asymptotic results Lemma 4.28. (see Vose [193], Theorem 4.7) If the mutation rate is strictly positive (pm > 0) then the genetic operator is strictly positively defined i.e. (G(x))i > 0, ∀i ∈ Ω, ∀x ∈ Λr−1 . It is enough to observe that if pm > 0 then the mixing matrix M has strictly positive entries mi,j ({0}) > 0, i, j ∈ Ω in order to justify the above theorem. It also means that, starting from the arbitrary population x ∈ Λr−1 , another arbitrary population y ∈ Λr−1 may be reached in a single iteration step of the Simple Genetic Algorithm with a positive probability in this case. In other words, such a feature is caused by mutation which ensures the passage between two arbitrary states x, y ∈ Λr−1 with a positive probability when pm > 0. Lemma 4.28 leads to another important observation: Remark 4.29. The probability of producing an individual with an arbitrary genotype i ∈ Ω is strictly positive if the mutation rate is strictly positive (pm > 0). This feature does not depend on the number of the genetic epoch. Moreover, combining Lemma 4.28 with the formula contained in Theorem 4.21 we can obtain: Remark 4.30. If the mutation rate is strictly positive (pm > 0) then the transition probability matrix Q is also strictly positive i.e. (Q)x,y > 0, ∀x, y ∈ Xµ . Now we are able to discuss the first result concerning SGA asymptotic behavior. Theorem 4.31. If the mutation rate is strictly positive (pm > 0) then the Markov chain describing the finite population SGA (µ < +∞) is ergodic and a weak limit πµ ∈ M(Xµ ) exists, so that lim πµt = lim Qt πµ0 = πµ
t→+∞
t→+∞
for the arbitrary initial measure πµ0 ∈ M(Xµ ).
The first part of the above thesis (the ergodicity) is the immediate conclusion of Remark 4.30 and the second may be drawn from the ergodic Theorem (see Feller [65]).
68
4 Asymptotic behavior of the artificial genetic systems
One important outcome of the Markov chain ergodicity is that the SGA will visit all states x ∈ Λr−1 , then as a result will search all the set of phenotypes Dr regardless of the initial population P0 . Therefore ergodicity guarantees that the SGA becomes the well-defined global optimization algorithm if pm > 0. In particular we have: Remark 4.32. If the mutation rate is strictly positive (pm > 0) then the Simple Genetic Algorithm is asymptotically correct in the probabilistic sense (see definition 2.16). Moreover, it has the asymptotic guarantee of success (see definition 2.17). In other words, each local and global extreme will appear in at least one population produced by the SGA if the mutation rate is strictly positive and the number of genetic epochs is sufficiently large. The next two theorems deliver information about the behavior of SGA state probability distributions when the number of individuals in the population grows to infinity. Theorem 4.33. (see Vose [193], Nix and Vose [116]) If the population cardinality grows to infinity (µ → +∞) then the sequence of limit measures {πµ } ⊂ M(Xµ ) defined by the thesis of Theorem 4.31 contains the subsequence {πµk } ⊂ {πµ } that weakly converges to some measure π ∗ ∈ M(Λr−1 ). The proof of this important feature is based on the Prochorow Theorem (see Feller [65]). Definition 4.34. The genetic operator G : Λr−1 → Λr−1 will be called focusing if for all x ∈ Λr−1 the sequence x, G(x), G2 (x), G3 (x), . . .
converges in Λr−1 .
Let w ∈ Λr−1 be the limit of G iterates for some starting point x ∈ Λr−1 . The continuity of G guarantees that: ! " G(w) = G lim Gt (x) = lim Gt+1 (x) = w (4.18) t→+∞
t→+∞
then w is also the fixed point of G. Let us denote by K = {w ∈ Λr−1 ; G(w) = w}
(4.19)
the set of all fixed points of the genetic operator G. If G is focusing, then K = ∅ and K constitute the attractor of the dynamic semi-system associated with the infinite population SGA. The final important theorem presented in this section determines the value of the limit measure π ∗ on the fixed set K.
4.1 Markov theory of genetic algorithms
69
Theorem 4.35. (see Nix and Vose [116], Theorem 3) If the genetic operator G : Λr−1 → Λr−1 associated with the class of Simple Genetic Algorithms is focusing and K ⊂ Λr−1 is the set of its fixed points, then π ∗ (K) = 1. The above thesis may be commented upon as follows: if the genetic operator is focusing, then the infinite population SGA will oscillate among the fixed points of its genetic operator after a sufficiently large number of genetic epochs. Other states are achieved with a probability equal to zero. Fixed points of the genetic operator and their stability Let us start with the most convenient definition of fixed point stability of the iterating system on the SGA space of states Λr−1 . Definition 4.36. 1. The fixed point w ∈ Λr−1 of the mapping g : Λr−1 → Λr−1 is stable (in the Lapunov sense) if, and only if, for each neighborhood U1 ; w ∈ U1 another neighborhood U2 ; w ∈ U2 exists so that for each starting point y ∈ U2 the trajectory {y, g(y), g 2 (y), . . .} lies in U1 ({y, g(y), g 2 (y), . . .} ⊂ U1 ). 2. The fixed point w ∈ Λr−1 of the mapping g : Λr−1 → Λr−1 is asymptotically stable if there is the neighborhood U ; w ∈ U so that for each y ∈ U the trajectory {y, g(y), g 2 (y), . . .} converges to w. The following characterization may be drawn for fixed points of focusing, continuously differentiable mappings. Theorem 4.37. Let w ∈ Λr−1 be the fixed point of the continuously differentiable mapping g ∈ C1 (Λr−1 → Λr−1 ). If the spectral radius (the maximum eigenvalue) of the differential Dg|w is greater than 1, then w is unstable. If the spectral radius of the differential Dg|w is less than 1, then w is asymptotically stable. The fixed point w is called hyperbolic if neither eigenvalue of the differential Dg|w equals 1. Let us recall that the genetic operator G : Λr−1 → Λr−1 (see definition 4.16) is the composition of the proportional selection operator F : Λr−1 → Λr−1 (see definition 4.11) and the mixing operator M : Λr−1 → Λr−1 (see definition 4.15). Moreover, M can be extended to the continuously differentiable operator Rr → Rr while G and F can be extended to continuously differentiable operators from Rr \ {0} to Rr (see Remark 4.18). In order to study the stability of the genetic operator it is reasonable to study the stability of its components. Lemma 4.38. (see Vose [193], Theorem 7.1) The differential DF |y ∈ L(Rr → Rr ) of the extension of the proportional selection operator F : Rr \ {0} → Rr computed at y ∈ Rr \ {0} is given by the formula
70
4 Asymptotic behavior of the artificial genetic systems
"
! DF |y =
1 f ·x I− (f, x)
* diag(f ) ** (f, x) *x=y
where f · x denotes the matrix {fi xj } of the dimension r × r which is the tensor product of the vectors f and x. Let us assume, for a while, that the fitness function f : Ω → [0, M ] is injective i.e. fi = fj , i = j, i, j ∈ Ω. In this case we have ∃! θ ∈ Ω; fθ = f (θ) = max{fj , j ∈ Ω}.
(4.20)
If we assume, moreover, that fitness is non trivial i.e. fθ > 0 then two simple observations are valid (see Vose [193]): Remark 4.39. (see Vose [193], Theorem 10.3) The spectrum (set of eigenval(θ)
ues) of the differential operator DF |xθ , xθ = (0, . . . , 0, 1 , 0, . . . , 0) equals
fi ; 0 ≤ i ≤ r − 1, i = θ spec( DF |xθ ) = fθ Remark 4.40. (see Vose [193], Theorem 10.4) If the fitness function f : Ω → [0, M ] is injective and non trivial then xθ is the only stable fixed point of the proportional selection operator F . Other fixed points of F are hyperbolic. The next part of this section will be devoted to the fixed points of the mixing operator M : Λr−1 → Λr−1 . Theorem 4.41. (see Vose [193], Theorem 6.13, also Kołodziej [94]) The differential of the extension of the mixing operator DM |y ∈ L(Rr → Rr ) computed at y ∈ Rr is given by the formula * r−1 * * σk−1 M∗ σk xk * DM |y = 2 * k=0
x=y
where σk is the r × r dimension permutation matrix with entries (σi )j,k = [j ⊕ k = i], i, j, k ∈ Ω, and M∗ is the r × r dimension matrix of entries (M∗ )i,j = (M)i⊕j,j , i, j ∈ Ω and (M)i,j , i, j ∈ Ω are entries of the mixing matrix M (see formula 4.14). Remark 4.42. The derivative of the extension of the mixing operator ) r−1 −1 σk M∗ σk xk ∈ L(Rr → Rr ) Rr x → 2 k=0
is a linear mapping.
4.1 Markov theory of genetic algorithms
71
Let us study the features of the matrix operator M∗ as the introductory step to studying fixed points of the operator M . Theorem 4.43. (see Vose [193], Theorem 6.3) 1. The maximum eigenvalue of the matrix M∗ equals 1 and corresponds to the left eigenvector 1 which is perpendicular to the unit simplex Λr−1 in Rr . 2. If the mutation rate is strictly positive (pm > 0) then other eigenvalues of the matrix M∗ have their modules less than 0.5. The above theorem partially motivates the next theses. Theorem 4.44. (see Vose [193], Theorem 6.13) 1. spec( DM |x ) = 2(1, x) spec(M∗ ) ∀x ∈ Rr 2. The maximum eigenvalue of the differential DM |x equals 2(1, x) and corresponds to the eigenvector 1 for all x ∈ Rr . If we restrict the mapping M to the unit simplex Λr−1 , then this eigenvalue has to be ignored in the stability analysis, because 1 is perpendicular to Λr−1 and does not affect the behavior of the operator M on Λr−1 . 3. If the mixing operator M is strictly understood, according to the definition 4.15, as the mapping Λr−1 → Λr−1 then spec( DM |x ) does not depend on x ∈ Λr−1 . Studying the stability of the fixed points of the mixing operator M is much more complicated than in the case of the proportional selection operator F . We restrict ourselves only to features derived by Vose (see Vose [193], Theorem 10.8), which are important for future considerations. Theorem 4.45. If the mutation rate is strictly positive (pm > 0), then the centroid of the simplex 1r 1 ∈ Λr−1 is the fixed point of the mixing operator M . Using the well known chain rule for differentiating the composition of functions we obtain: DG|x = DM |F (x) ◦ DF |x ∀x ∈ Rr \ {0}.
(4.21)
A more detailed formula for DG is also delivered by Proposition 2.3 in [195]. (i)
The vertices e0 , . . . , er−1 , ei = (0, . . . , 0, 1 , 0, . . . , 0), i = 0, . . . , r − 1 of the simplex Λr−1 are analyzed as potential fixed points of the genetic operator G by Vose and Wright [195]. Their results may be stressed as follows: Remark 4.46. If the mutation vanishes (pm = 0) then the vertices e0 , . . . , er−1 of the simplex Λr−1 are fixed points of the genetic operator G.
72
4 Asymptotic behavior of the artificial genetic systems
Each vertex ei , i ∈ Ω represents the monochromatic population P that contain only individuals of the single genotype i ∈ Ω (see Remark 4.4). Neither selection nor crossover can modify the genotype of the individual from P , so the expected population that follows ei is also G(ei ) = ei . The stability of the Λr−1 vertices may be analyzed in the above case by analyzing the spectral radius of DG|ek . Theorem 4.47. (see Vose and Wright [195], Theorem 3.4) If there is no mutation (pm = 0) then ) r−1 fi⊕k (ηk + ηkˆ )[k ⊕ i = 0], i = 1, . . . , r − 1 ∪ {0} spec( DG|ek ) = fk k=0
r−1
Moreover, Vose and Wright suggested in [195] that Λ vertices are the only candidates for stable fixed points of G if the mutation vanishes in SGA (pm = 0) (see [195], Conjecture 4.4). Next, we will study the behavior of the genetic operator G in cases where there is no crossover (pc = 0) and mutation is positive (pm > 0). This case has been studied by Grygiel [80]. The mixing component M of the genetic ˜ i,j } of operator is reduced to the linear operator Rr → Rr with the matrix {m coefficients l−(1,i⊕j) (1 − pm ) . (4.22) m ˜ i,j = p(1,i⊕j) m The genetic operator can be expressed by the formula G(x) = M (F (x)) =
1 H x, x ∈ Λr−1 (f, x)
(4.23)
where H = {m ˜ i,j fj } is the r × r matrix. Theorem 4.48. (see Grygiel [80], Proposition 2) If the SGA genetic operations are restricted only to mutation (pc = 0, pm > 0) then there is exactly one fixed point w of the genetic operator G. The fixed point w has strictly positive coordinates (w ∈ int(Λr−1 )). Moreover, w is the eigenvector which corresponds to the maximum eigenvalue λmax of the matrix H i.e. (H w = λmax w). The fixed point w is asymptotically stable, moreover, the whole simplex Λr−1 is the attractor of w, i.e. ∀x ∈ Λr−1 lim Gt (x) = w. t→+∞
The last problem discussed in this section will be the characteristic of hyperbolic fixed points of the genetic operator G. Definition 4.49. The genetic operator G : Λr−1 → Λr−1 will be called regular if for each set C ⊂ Λr−1 with a zero measure in Λr−1 the set G−1 (C) also has a zero measure in Λr−1 . We will understand the measure in Λr−1 as the measure on the (r − 1)-dimensional hyperplane that contains Λr−1 .
4.1 Markov theory of genetic algorithms
73
The two following theorems more precisely define the condition by which the genetic operator G is regular and invertible as well as inform us about the number of fixed points of such operators. Theorem 4.50. (see Vose [193], Theorems 9.3 and 13.6) If the crossover rate is strictly less than one (pc < 1) and the mutation rate satisfies 0 < pm < 12 then the genetic operator G is invertible on Λr−1 and G−1 ∈ C1 (int(Λr−1 )). Moreover, G is regular on Λr−1 . Theorem 4.51. (see Vose [193], Theorem 12.1) If the genetic operator G has only hyperbolic fixed points, then it has a finite number of fixed points in Λr−1 . An attempt to the evaluation the rate of convergence – logarithmic convergence Definition 4.52. The genetic operator G : Λr−1 → Λr−1 will be called logarithmic convergent if ∀ρ ∈ M(Λr−1 ), ∀ε > 0 ∃A ⊂ Λr−1 ; ρ(A) = 1 − ε so that ∀x ∈ A, ∀δ ∈ (0, 1) ∃k ≥ 1; k = O(− log(δ)) (k converges with the same order as (− log(δ))), moreover, & & k &G (x) − ω(x)& < δ, where ω(x) = lim Gt (x) t→+∞
If the particular genetic operator G is logarithmic convergent, then we can evaluate the rate of convergence of the infinite population SGA which starts from the set A and tends to the fixed point ω(x), x ∈ A. The set A may have the measure arbitrarily close to the measure of the whole simplex Λr−1 . The logarithmic convergence of the genetic operator G holds under the following conditions. Theorem 4.53. (see Vose [193], Theorem 13.10) If the genetic operator G : Λr−1 → Λr−1 is focusing and regular (see definitions 4.34, 4.49) and all the fixed points of G are hyperbolic, then G is logarithmic convergent. The approximation of the fixed points of the genetic operator Let us consider the class of instances of the Simple Genetic Algorithm spanned by the single genetic operator G (see Remark 4.22). In this short section we try to answer how close finite populations, generated by the above algorithms
74
4 Asymptotic behavior of the artificial genetic systems
during a finite number of genetic epochs, can approximate fixed points of their spanning genetic operator. We restrict ourselves to the case in which G is focusing and the set K of fixed points is finite, so all fixed points are isolated. Sufficient conditions for #K < +∞ are delivered by the Theorem 4.51. Theorem 4.54. (see Telega [184], Cabib, Schaefer, Telega [43]) Let us assume that the genetic operator G : Λr−1 → Λr−1 is focusing and its set of fixed points is finite (#K < +∞). Let us define Kε = x ∈ Λr−1 ; ∃y ∈ K; d(x, y) < ε being the open ε-envelope of K in the (r − 1)-dimensional hyperplain that contains Λr−1 , where d(·, ·) stands for the Euclidean distance in Rr−1 . We assume moreover, that mutation is strictly positive (pm > 0), then ∀ε > 0, ∀η > 0, ∃N ∈ N, ∃W (N ) ∈ N; ∀µ > N, ∀k > W (N ) πµk (Kε ) > 1 − η. where πµk are measures associated with the infinite sub-class of Simple Genetic Algorithms spanned by G. In other words, if G is focusing, then sufficiently large SGA populations will concentrate close to the set K with an arbitrarily large probability 1 − η after the sufficiently large number of genetic epochs. The above theorem will be intensively used when studying the asymptotic features of sampling measures presented in the next section. The proof of Theorem 4.54 will be preceded by two lemmas and one technical remark concerning the relation between the support of measures πµ , πµk and the set Kε . Lemma 4.55. There is an infinite sub-class of Simple Genetic Algorithms spanned by G so that under the assumptions of Theorem 4.54 ∀ε > 0, ∀η > 0 ∃N ∈ N; ∀µ > N πµ (Kε ) > 1 − η Proof. Let us consider the limit measures πµ associated with the finite population SGA (see Theorem 4.31). According to Theorem 4.33 the sequence {πµ } contains the infinite sub-sequence {πµξ } that converges to π ∗ . Let us select the sub-class of Simple Genetic Algorithms spanned by G whose measures {πµkξ } converge to the elements πµξ of this sub-sequence if the number of genetic epochs tends to infinity (k → +∞). For the sake of simplicity we will denote by πµ the elements of the sub-sequence πµξ in the remaining part of the proof.
4.1 Markov theory of genetic algorithms
75
Because the measure π ∗ is concentrated on K (see Theorem 4.35) then all the sets Kε are π ∗ -continuous which means that π ∗ (∂Kε ) = 0 (see Billingsley [28], Section 29). The weak convergence πµ → π ∗ implies πµ (Kε ) → π ∗ (Kε ) (see Billingsley [28], Theorem 29.1), then for the arbitrary ε > 0 we have ∀η > 0 ∃N ; ∀µ > N |π ∗ (Kε ) − πµ (Kε )| < η then also 1 − πµ (Kε ) < 1 because π ∗ (Kε ) = π ∗ (K) = 1 and 0 ≤ πµ (Kε ) ≤ 1 which completes the proof. Remark 4.56. Let us assume that µ ∈ N is arbitrary and such, that the space of ε ∈ (0, ε] states Xµ ⊂ Λr−1 for the finite population SGA exists then ∀ε > 0 ∃ˆ such, that: 1. Kεˆ is πµ -continuous and πµ (Kεˆ) = πµ (Kε ), 2. ∀k ∈ N Kεˆ is πµk -continuous and πµk (Kεˆ) = πµk (Kε ).
Proof. Let us recall that the measures πµ and πµk are concentrated on the finite set Xµ ⊂ Λr−1 for all k ∈ N. Let us define: A = Kε ∩ Xµ , B = K¯ε ∩ Xµ , C = B \ A = ∂Kε ∩ Xµ . Observe that all the above sets A, B, C and K are finite. We may consider two separate cases: 1. C = ∅ which implies πµ (∂Kε ) = 0 and then the set Kε is πµ -continuous. We may set εˆ = ε. 2. #C > 0 then we may define
c=
max
y∈Kε ∪Xµ
min{d(x, y)} . x∈K
The constant c always exists, because both sets on which maximum and minimum are computed are finite. Obviously, from the definition of Kε we get c < ε. Now it is enough to set εˆ ∈ (c, ε) because then Kεˆ ⊂ int(Kε ) and ∂Kεˆ ∩ ∂Kε = ∅ which implies C ∩ ∂Kεˆ = ∅. Moreover, ∂Kεˆ ∩ A = ∅ because A ⊂ int(Kεˆ). Finally, ∂Kεˆ ∩ Xµ = ∅ then πµ (∂Kεˆ) = 0 which completes the proof of the first thesis of the remark. The second thesis of the remark can be proved identically to the first one. Lemma 4.57. If the assumptions of Theorem 4.54 hold, then ∀ε > 0 ∀η > 0 ∀µ ∈ N ∃K(µ); * * ∀k > K(µ) *πµ (Kε ) − πµk (Kε )* < η.
76
4 Asymptotic behavior of the artificial genetic systems
Proof. Similarly to the previous instance πµ and πµk may be handled as probabilistic measures defined on the r − 1 hyperplane containing Λr−1 that vanishes out of Λr−1 . Let us select η > 0, µ ∈ N and 0, εˆ ≤ ε so that Kεˆ is πµ -continuous and πµ (Kεˆ) = πµ (Kε ) (see remark 4.56). From Theorem 4.31 the sequence πµk → πµ weakly for k → +∞. Then there is K(µ) so that ∀k > K(µ) we have * * *πµ (Kεˆ) − πµk (Kεˆ)* < η. Using remark 4.56 we directly obtain the thesis of Lemma 4.57. Proof of Theorem 4.54. Let us select ε, η > 0 and set ς = 4.55 we may derive
η 2.
From Lemma
η ∃N ; ∀µ > N πµ (Kε ) > 1 − ς = 1 − . 2 For an arbitrary, proper µ ∈ N we have from Lemma 4.57 * * η ∃K(µ); ∀k > K(µ) *πµ (Kε ) − πµk (Kε )* < 2 which also implies that πµk (Kε ) − πµ (Kε ) > −
η 2
which added to the previous inequality yields πµk (Kε ) > 1 − η. Asymptotic features of sampling measures The considerations presented in some earlier sections (especially in The Simple Genetic Algorithm as a dynamic semi-system and Asymptotic results) show that the Simple Genetic Algorithm can regularly transform measures on its space of states Λr−1 . Theorems 4.31, 4.33, 4.35 and 4.54 also show their asymptotic features. Let us try to define when the Simple Genetic Algorithm can generate the sequence of sampling measures over the admissible set D which is convergent and its limit can be helpful in solving global optimization problem Π4 (see Section 2.1). We recall that problem Π4 consists of finding sets which are the central parts of the basins of attraction for the objective function Φ. Moreover, we try to formulate the condition that forces the SGA to satisfy the above need. The requested sequence of sampling measures belonging to M(D) can be formulated in two steps: Step I.
4.1 Markov theory of genetic algorithms
77
We will intensively use the following features of the genetic operator called SGA heuristics (see Vose [193]) originally defined as G : Λr−1 → Λr−1 (see definition 4.16, Theorems 4.19, 4.20, 4.25 and remark 4.26): 1. G(x) is the expected population in the epoch that immediately follows the epoch in which the population vector x ∈ Λr−1 appears, 2. G is the evolutionary low of the abstract, infinite population SGA (µ = +∞), 3. Each coordinate (G(x))i , i = 0, 1, . . . , r − 1 stands for the sampling probability of the individual with the genotype i ∈ Ω in the epoch that immediately follows the epoch in which the population vector x ∈ Λr−1 appears. Each point of the simplex x ∈ Λr−1 defines the unique measure θ(x) ∈ M(Dr ) (see Section 3.1) so that the following one-to-one mapping may be formally established: θ : Λr−1 → M(Dr ); ∀x ∈ Λr−1 , ∀y ∈ Dr , ∀i ∈ Ω,
(4.24)
y = code(i) ⇒ θ(x)({y}) = xi The third feature of the genetic operator G allows us to define the mapping that turns back the sampling measure over the set of phenotypes: θˆ : Λr−1 → M(Dr ); ∀x ∈ Λr−1 , ∀y ∈ Dr , ∀i ∈ Ω,
(4.25)
ˆ y = code(i) ⇒ θ({y}) = (G(x))i Remark 4.58. Taking the mapping θ into account, the space M(Dr ) may be handled as the new space of states of the Simple Genetic Algorithm, in the same way as the simplex Λr−1 . The role of heuristics will be played by the mapping: ˆ −1 (σ)) ∈ M(Dr ). M(Dr ) σ → θ(θ The measures θ and θˆ can also be identified with the other measures θ and ˆ θ that belong to the space M(D). Both θ and θˆ are defined over the whole admissible set D and concentrated at discrete phenotype points Dr ⊂ D. They will satisfy:
78
4 Asymptotic behavior of the artificial genetic systems
ˆ θ (x)(A) = θ(x)(A ∩ Dr ), θˆ (x)(A) = θ(x)(A ∩ Dr ) ∀x ∈ Λr−1
(4.26)
where A ⊂ D is an arbitrary set measurable in the Lesbegue sense. Remark 4.59. Let us assume the finite population SGA (µ < +∞) and x ∈ Xµ is the vector of the population P = (Ω, η) in the particular genetic epoch, then θ (x) is the counting measure, which means that for all A ⊂ D measurable in the Lesbegue sense θ (x)(A) =
1 µ
η(y) [code(y) ∈ A].
y∈supp(η)
Step II. Now we introduce the mapping Ψ : M(Dr ) → M(D) that returns the special kind of measures which posses Lp (D), D ⊂ RN density functions based on the probabilistic measure concentrated on the set of phenotypes Dr (see Schaefer, Jabłoński [154]). Let ϑ(j) be the domain of the inverse binary affine encoding defined by the formula 3.14 associated with the genotype j = (j1 , . . . , jN ) ∈ Ω. We recall that the substrings {ji }, j = 1, . . . , N encode consecutive coordinates of the phenotype that belong to Dr ⊂ RN . In the case of Gray encoding we assign ϑ(v(k)) to the genotype k ∈ Ω, where v : Ω → Ω denotes the Gray encoding function (see 3.15). If D has the Lipschitz boundary (see e.g. Zeidler [207]) then ϑ(j) , j ∈ Ω are measurable sets in the Lesbegue sense. Now we are prepared to set for the arbitrary measure ω ∈ M(Dr ) and the Lesbegue measurable set A ⊂ D Ψ (ω)(A) = ρω dx (4.27) A
where the right-hand-side integral is computed according to the Lesbegue measure. The density function is given by ρω (x) =
ω({codea (j)}) χϑ(j) (x) meas(ϑ(j) ∩ D)
(4.28)
j∈Ω
where χϑ(j) denotes the characteristic function of the open brick ϑ(j) that contains the phenotype codea (j). The above construction follows the more general one presented in Section 3.1 (see formula 3.4). In the case of Grey encoding we can set ρω (x) =
j∈Ω
ω({codeG (k)}) χϑ(v(k)) (x). meas(ϑ(v(k)) ∩ D)
(4.29)
4.1 Markov theory of genetic algorithms
79
Remark 4.60. If D has the Lipschitz boundary then ρω ∈ Lp (D), p ∈ [1, +∞). Proof The density function ρω is piecewise constant and bounded in D. In particular it is constant in the sets {ϑ(j) }, j ∈ Ω which form the regular partitioning of D so that ϑ(i) ∩ ϑ(j) = ∅, i = j which was included in the partitioning construction(see Section 3.1). The partitioning forms the whole measure support (meas i∈Ω ϑ(i) = meas(D)) for this function. Such conditions p guarantee the existence and boundedness of the Lesbegue integrals ρ for all p ∈ N which motivates the above thesis. D ω Moreover, the following remark may be drawn from formulas 4.28, 4.29. Remark 4.61. The mapping Ψ : M(Dr ) → M(D) is injective, in particular density ρω is uniquely determined almost everywhere in D by the measure ω ∈ M(Dr ). One simple but very important observation may also be performed for the introduced measures associated with the fixed point of the genetic operator. Remark 4.62. If x ∈ K is the arbitrary fixed point of the genetic operator G, ˆ ˆ then θ(x) = θ(x), θ (x) = θˆ (x) and Ψ (θ(x)) = Ψ (θ(x)). Let us assume that the fixed points of the genetic operator G represent populations that contain maximum information about the global optimization problem to be solved, which can be gathered by any SGA algorithm spanned by G. Now we are going to define the condition that may be necessary for successful solving of problem Π4 by using the SGA. Let us denote by W the finite set of local maximizers to the objective function Φ, and by {Bx+ }, x+ ∈ W the family of their basins of attraction. Definition 4.63. We can say that the class of SGA spanned by the genetic operator G is well tuned to the set of local maximizers W if: 1. G is focusing and the set of its fixed points K is finite, 2. ∀x+ ∈ W ∃C(x+ ) closed set in D so that x+ ∈ C(x+ ) ⊂ Bx+ , meas(C(x+ )) > 0 and ρθ(z) ≥ threshold ρθ(z) < threshold
x ∈ C(x+ ) x ∈ D \ x+ ∈W C(x+ )
where z ∈ K is the arbitrary fixed point for G, θ(z) is the discrete measure associated with the fixed point z according to the formula 4.24. The positive constant threshold stands for the definition’s parameter.
80
4 Asymptotic behavior of the artificial genetic systems
In other words, the above definition postulates that the LP (D)-regular measure densities ρθ(z) associated with all G fixed points z dominate almost everywhere on some central parts of the basins of attraction C(x+ ) of local maximizers x+ which belong to the group of our special interest (x+ ∈ W), if the class of the SGA spanned by G is well tuned to W. Remark 4.64. The well tuning condition depends on the main objects of the optimization problem, such as the objective function Φ and the admissible domain D, on the encoding quantities, such as the phenotype mesh Dr and the encoding function code : Ω → Dr and the SGA parameters f, pm , pc , type (see Section 3.5). The above remark only provides an overall characterization of the SGA’s well tuning. Really, the question of whether the Simple Genetic Algorithm is well tuned to the prescribed set of local maximizers of the arbitrary objective function is still open. Adamska has studied experimentally the well tuning condition for some well-known benchmarks of global optimization. Her results were published in the paper [153]. Computational tests, as performed by Adamska, consist of finding the fixed point of the genetic operator and then comparing it with the objective function in order to find out if the particular SGA instance is well tuned, what the threshold parameter should be and what the relation of this parameter to the number of local maximizers to be involved in the well tuning condition is. In order to effectively obtain the fixed point of the genetic operator, the SGA with only mutation is selected. In this case there exists the unique, stable fixed point w of the genetic operator which constitutes the eigenvector corresponding to the maximum eigenvalue of the r × r matrix H = {m ˜ i,j fj }, (1,i⊕j) l−(1,i⊕j) (1 − pm ) (see formulas 4.22, 4.23 and Theorem where m ˜ i,j = pm 4.48). The length of binary code was set at l = 8 while the mutation rate was pm = 0.05 for all the cases presented below. The binary genetic universum Ω may be identified with the subset {0, 1, . . . , 255} ⊂ Z+ of non-negative integers. We will study the one-dimensional global optimization problems for which the objective function Φ : D → [0, M ] is defined on the closed interval D = [lef t, right] ⊂ R. Fitness entries {fi }, i ∈ Ω were computed immediately from the objective by using the affine encoding function codea : Ω → D i.e. " ! right − lef t , i ∈ Ω. (4.30) fi = Φ(codea (i)) = Φ lef t + i 2l Eigenvalue/eigenvector linear algebra problems for the matrix H were solved by the symbolic processor MAPLE. The first example was associated with the one-dimensional Rastrigin benchmark function ΦR (x) = −x2 + cos(2π x) + 110, lef t = −4.12, right = 4.12
(4.31)
4.1 Markov theory of genetic algorithms
81
Fig. 4.2. One-dimensional Rastrigin function (see formula 4.31).
which is shown in Figure 4.2. The next Figure 4.3 presents the chart of the measure associated with the infinite population w defined on Ω ∼ = {0, 1, . . . , 255} which is the unique fixed point of the genetic operator in this example. Although the measure is discrete, the chart in the Figure 4.3 has been smoothed to more precisely express its character.
Fig. 4.3. The limit sampling measure associated with the one-dimensional Rastrigin function.
If we normalize the chart of the measure defined on Ω by the total length of D which equals 8.24 we can deal with the values of the density ρθ(w) (see formula 4.28). Figure 4.3 allows us then to discuss the well tuning of the current instance of the SGA to the local maximizers of the objective 4.31.
82
4 Asymptotic behavior of the artificial genetic systems
If threshold = t1 then the algorithm is well tuned to all the maximizers excluding the two outermost ones. For threshold = t2 the algorithm is well tuned to all the maximizers in D. Finally, if threshold = t3 the algorithm is only well tuned to four local maximizers corresponding to the lower local maxims. The basins of attraction of five, central, isolated local maximizers were not distinguished by the level sets of the density function ρθ(w) .
Fig. 4.4. Twin Gauss peaks with different mean values.
Fig. 4.5. The limit sampling measure associated with twin Gauss peaks with different mean values.
In the next example we will use the objective which is the sum of two Gauss functions which represent the probability distributions with the same standard deviation and different means (see Figure 4.4 for the chart of this
4.1 Markov theory of genetic algorithms
83
function). ΦG1 (x) = exp(−x2 ) + exp(−(x − 5)2 ), lef t = −3.81, right = 3.81
(4.32)
The graph of the limit sampling measure associated with the objective function 4.32 is presented in Figure 4.4. We can easily see that the instance of the Simple Genetic Algorithm is well tuned to both the local maximizers for the wide range of threshold parameter values.
Fig. 4.6. Twin Gauss peaks with different mean and different standard deviation values.
Fig. 4.7. The limit sampling measure associated with twin Gauss peaks with different mean and different standard deviation values.
84
4 Asymptotic behavior of the artificial genetic systems
The last example also deals with the twin Gauss peaks which differ in both: mean and standard deviation (see Figure 4.6). 1 ΦG2 (x) = exp(−x2 ) + exp(− (x − 5)2 ), lef t = −3.81, right = 3.81 (4.33) 8 The left peak is slender while the right one is wider. The chart of the limit measure is presented in Figure 4.7. As in the previous case we may conclude that the Simple Genetic Algorithm under consideration is well tuned to both local maximizers for the threshold parameter values belonging to the wide range included in the range of the measure density variation. Comparing the level sets obtained by threshold = t1 and by threshold = t2 we may conclude that the sets obtained by t2 deliver more information about the shape of the basins of attraction, than obtained by t1, so the first, lower threshold setting seems to be more suitable for approximating the basins of attraction than the second, larger one. The examples presented above prove that the well tuning is not an unrealistic postulate, but can be met in many practically motivated algorithm instances. Of course, instances that do not satisfy well tuning condition can also be easily constructed. Special SGA instances in which different fixed points generate limit measures concentrate around different local maximizers (bistability) have recently been presented (see e.g. [205]). The asymptotic behavior of evolutionary algorithm sampling measures with the real number encoding were also studied by Arabas [6]. Let us report now after Schaefer and Jabłoński [154] two main theoretical results of this section. Theorem 4.65. Let us assume that the genetic operator G : Λr−1 → Λr−1 is focusing and its set of fixed points is finite (#K < +∞) and the mutation probability is strictly positive (pm > 0) then ∀ε > 0, ∀η > 0, ∃N ∈ N, ∃W (N ) ∈ N, ∃z ∈ K; so that ∀µ > N, ∀k > W (N ), ∀A ⊂ D; A - measurable in the Lesbeque sense we have and
* * Pr *θ (xkµ )(A) − θ (z)(A)* < ε > 1 − η * * Pr *Ψ ((θ(xkµ ))(A) − Ψ (θ(z))(A)* < ε > 1 − η
where xkµ ∈ Λr−1 denotes the frequency vector of the SGA population of size µ after k genetic epochs.
4.1 Markov theory of genetic algorithms
85
In other words, the counting measure θ (see formula 4.26 and remark 4.59) of the set A associated with the population of size µ < +∞ after k genetic epochs bounds with the arbitrarily high probability to the measure associated with the infinite population, which is the fixed point of the genetic operator G, if k and µ are sufficiently large. Proof. Let us fix ε > 0 and η > 0. Moreover, the setting ε = c ε, where c stands for the constant, allows us to bound the Euclidean norm in Rr by the norm “sum of coordinate modules” from above. All the assumptions of Theorem 4.54 are satisfied, then Pr xkµ ∈ Kε > 1 − η for µ > N and k > K(N ) for the proper sub-class of the SGA spanned by the genetic operator G. It implies, according to the definition of the set Kε , that there is z ∈ K the fixed point of G so that Pr d(xkµ , z) < ε > 1 − η. From the equivalence of norms in Rr we get k−1
c
* k * *(xµ )i − zi * < d(xkµ , z)
i=0
so Pr c
k−1
* k * *(xµ )i − zi * < ε
) >1−η
i=0
and finally Pr
k−1
) * k * *(xµ )i − zi * < ε > 1 − η.
(4.34)
i=0
The indices i = 0, . . . , r − 1 in the above sum run through the whole genotype set Ω and codea (i) ∈ Dr ⊂ D. Let us set now A ⊂ D and denote by J ⊂ Ω the set of genotypes so that codea (i) ∈ A if, and only if, i ∈ J, then θ(z)(A) = zi , θ(xkµ )(A) = (xkµ )i . i∈J
i∈J
Now, we may evaluate * * * * * * *θ(z)(A) − θ(xkµ )(A)* = ** (zi − (xkµ )i )** * * i∈J
≤
r−1 * * * * *(zi − (xkµ )i )* ≤ *(zi − (xkµ )i )* i∈J
i=0
Taking into account 4.34 we have ) r−1 * * * * *(zi − (xkµ )i )* < ε > 1 − η Pr *θ(z)(A) − θ(xkµ )(A)* < ε ≥ Pr i=0
86
4 Asymptotic behavior of the artificial genetic systems
which completes the proof of the first thesis of the theorem. Let us denote now by J ⊂ Ω the set of genotypes so that i ∈ J if, and only if, meas(ϑ(i) ∩ A) > 0, where ϑ(i) stands for the open brick that contains the phenotype codea (i) (see formula 3.14). Because Ψ (θ(z))(A) =
zi
i∈J
Ψ (θ(xkµ ))(A) =
meas(ϑi ∩ A) , meas(ϑi )
(xkµ )i
i∈J
then
meas(ϑi ∩ A) , meas(ϑi )
* * * * * * meas(ϑi ∩ A) * k *Ψ (θ(z))(A) − Ψ (θ(xkµ ))(A)* = ** (zi − (xµ )i )* ≤ * * meas(ϑi ) i∈J
r−1 meas(ϑi ∩ A) * * * * k * *(zi − (xkµ )i )* . * (zi − (xµ )i ) ≤ ≤ meas(ϑi ) i=0 i∈J
Recalling once more formula 4.34 we have ) r−1 * * * * k k * * * * Pr Ψ (θ(z))(A) − Ψ (θ(xµ ))(A) < ε ≥ Pr (zi − (xµ )i ) < ε > 1 − η i=0
which completes the proof of the second thesis and then the whole theorem. Theorem 4.66. Let us assume that the genetic operator G : Λr−1 → Λr−1 is focusing and its set of fixed points is finite (#K < +∞) and the mutation probability is strictly positive (pm > 0) then ∀ε > 0, ∀η > 0, ∃N ∈ N, ∃W (N ) ∈ N, ∃z ∈ K; so that
& & & & ∀µ > N, ∀k > W (N ) Pr &ρθ(xkµ ) − ρθ(z) &
Lp (D)
where
< cε > 1 − η
1
meas(D) p c= mini∈Ω {meas(ϑ(i) )} and p ∈ [1, +∞).
Proof. Let us select the arbitrary genotype i ∈ Ω. The thesis of the previous Theorem 4.65 yields
4.1 Markov theory of genetic algorithms
* * Pr *Ψ (θ(z))(ϑ(i) ) − Ψ (θ(xkµ ))(ϑ(i) )* < ε > 1 − η
87
(4.35)
for µ > N and k > W (N ). Selecting the arbitrary point ξ ∈ int(ϑ(i) ) we have Ψ (θ(z))(ϑ(i) ) = ρθ(z) (ξ) meas(ϑi ) Ψ (θ(xkµ ))(ϑ(i) ) = ρθ(xkµ ) (ξ) meas(ϑi ) then from 4.35 we obtain * * * * ∀i ∈ Ω Pr meas(ϑi ) *ρθ(z) (ξ) − ρθ(xkµ ) (ξ)* < ε > 1 − η. Because meas D \ i∈Ω ϑi = 0 while meas(D) = meas D ∩ i∈Ω ϑi then for p ∈ [1, +∞)
& & & & < cε > 1 − η Pr &ρθ(xkµ ) − ρθ(z) & Lp (D)
where the constant c may be computed as 1
c=
meas(D) p mini∈Ω {meas(ϑ(i) )}
which completes the proof.
Finally, we may draw the concluding remark considering the possibility of solving problem Π4 by the class of SGA spanned by the single genetic operator. Remark 4.67. If the class of SGA spanned by the genetic operator G is well tuned to the set of local maximizers W (see definition 4.63) and the assumptions of Theorem 4.66 hold, then the central parts of the basins of attraction C(x+ ), x+ ∈ W may be approximated by the level set of the density ρθ(xkµ ) if µ and k are sufficiently large.
4.1.3 The results of the Markov theory for Evolutionary Algorithm The asymptotic behavior of evolutionary algorithms of the type (µ + λ) (see 3.9) will be analyzed in this section. We will utilize problem Π1 (find any global maximizer only, see Section 2.1) without constraints, i.e. D = V . We will apply the phenotypic encoding for which the genetic universum U = RN
(4.36)
may also be identified with the search space V . The space of state for this group of algorithms is E = U µ /eqp = (RN )µ /eqp
(4.37)
88
4 Asymptotic behavior of the artificial genetic systems
where the equivalence is described by the formula 2.14 and µ stands for the constant (does not depend on the number of the genetic epoch) population cardinality. We will use the elitist selection (see Section 3.4.3) which transports the best fitted individual to the next epoch from the current population with the probability 1, and mutation described by the simplified version of the formula 3.51. We will also apply other phenotypic genetic operations (e.g. crossover described by the formula 3.52) which do not memorize any population characteristics. Because neither genetic operation which is applied to create the next epoch population Pt+1 depends on the number of the genetic epoch t as well as on the previous populations P0 , P1 , . . . , Pt , then the evolutionary algorithm under consideration may be modeled as the uniform Markov chain with the space of states E (see formula 4.37) and the Markov transition function τ : E → M(E).
(4.38)
Please, note that the space of states is uncountable #E > #N in this case, even if the population cardinality is finite µ < +∞. Almost all the results presented in this section are cited from papers written by Rudolph [140], [141], [143], [144], Beyer and Rudolph [27] and Grygiel [82]. The model of the (µ + λ) algorithm with elitist selection We start with the analysis of the very simple (1 + 1) evolutionary algorithm with a single-individual population Pt = {xt }, t ≥ 0. The single individual is only mutated in each genetic epoch by adding the random vector zt = N (0, σI)
(4.39)
where N (0, σI) denotes the N -dimensional random variable with a normal probability distribution, σ > 0 stands for the mean standard deviation and I denotes the N × N diagonal identity matrix. Moreover, we assume that zk is independent of zl for k = l. The mutated individual yt = xt + zt is selected for the next epoch population Pt+1 if f (xt ) < f (yt ). If not, Pt+1 is set as Pt . Such a selection is called a hard elitist one (see Section 3.4.3). The space of states will be E = U = D = RN in this case. The Markovian kernel of such mutation will be expressed by the formula ρN (0,σI) (z − x) dz (4.40) τm (x)(A) = A
where x ∈ E, A ⊂ E is the measurable set in the Lesbegue sense, ρN (0,σI) stands for the density function of the probability distribution of N (0, σI).
4.1 Markov theory of genetic algorithms
89
Let us define now W (x) = {y ∈ E; f (y) > f (x)}
(4.41)
which is the set of admissible solutions which are not worse fitted than x ∈ E. Because the result of selection depends on the previous state x ∈ E, then the Markovian kernel of selection depends on this quantity τs (y, x)(A) = χW (x) (y)χA (y) + χ(W (x))− (y)χA (x)
(4.42)
where (W (x))− = E \ W (x) denotes the complement of W (x) and χA is the characteristic function of the set A. Let us note that: • If y ∈ E is better or equally fitted than x ∈ E (i.e. y ∈ W (x)) then the passage from x to y ∈ A (or more correctly to y ∈ A ∩ W (x)) occurs with the probability 1. • If y ∈ E is worse fitted than x (i.e. y ∈ (W (x))− ) then y is not accepted. τ (y, x)(A) = 1 only if x ∈ A in this case. •
All other cases have the occurrence probability 0.
The selection kernel is deterministic as is usual for hard selection. In order to obtain the Markovian kernel of the discussed algorithm both kernels τm and τs should be composed (see remark 4.10). τm (x)(dy) · τs (y, x)(A) = τ (x)(A) = E " ! τm (x)(dy)χA∩W (x) (y) + χA (x) τm (x)(dy)χ(W (x))− (y) = = E +E , =
τm (x)(dy) A∩W (x)
+ χA (x)
(W (x))−
τm (x)(dy) =
= τm (x, A ∩ W (x)) + χA (x)τm (x)((W (x))− ) (4.43) Now we try to extend the above modeling result to the case of populations containing more than one individual (µ > 1). We start from two very simple observations: Remark 4.68. Both formulas 4.42, 4.43 which were introduced for the (1 + 1) algorithm for which E = U = RN can be extended to the arbitrary space of states E if the rule of passing a “better” fitted population to the next epoch is reconstructed. Remark 4.69. The kernel operator τm defined by formula 4.40 for the (1 + 1) algorithm may be replaced by the system of mixing operations that modify the state x ∈ E in the stochastic manner.
90
4 Asymptotic behavior of the artificial genetic systems
Let us consider now the genetic universum U = D and the space of states E = U µ /eqp, µ < +∞. We may introduce the mapping b : E x → b(x) ∈ U
(4.44)
which selects the genotype of the best fitted individual b(x) in the population x ∈ E. Similarly to the previous rule we may select the set of states which are “not worse” than the particular state x ∈ E such as W (x) = {x ∈ E; f (b(y)) ≥ f (b(x))}
(4.45)
Fig. 4.8. The scheme of the evolutionary algorithm with elitist selection on the population level and deterministic passage of the best fitted individual.
Now we will utilize the selection rule which is applied immediately to the whole population y ∈ E which can be obtained from the previous epoch population x ∈ E:
4.1 Markov theory of genetic algorithms
91
1. If y ∈ W (x) ∩ A, then the algorithm leaves the set of states A and the next population is y ∈ E. 2. If y ∈ / W (x) then the best fitted individual b(y) is “worse” than b(x) and the whole population y is rejected. The population x ∈ E passes to the next epoch. If the above selection rule is applied, then the formula 4.42 which determines τm remains valid. Moreover, we need one new mapping e : E × E (x, y) → y = e(x, y) ∈ E
(4.46)
where the population y is obtained from the population y by placing in y the individual b(x) in some way. The scheme of such algorithms of the type (µ + λ) is presented in Figure 4.8. Now the selection may be characterized by the following Markovian kernel τs (x, y)(A) = χA∩W (x) (y) + χ(W (x))− χA (x)χA (e(x, y)).
(4.47)
Composing now τs with the mixing kernel τm we obtain τA (x)(A) = τm (x)(W (x) ∩ A) + χA (x) τm (x)(dy)χA (e(x, y)) (4.48) (W (x))−
where the above integral has to be computed according to the proper measure on the space of states E. The convergence of Evolutionary Algorithm to the global maximizer Let us consider again global optimization problem Π1 (see Section 2.1) which consists of finding at least one global maximizer for the objective function Φ on the admissible set D. The evolutionary algorithm of the (µ + λ) type with the genetic universum U = D and the fitness function f = Φ will be utilized for solving Π1 . If x∗ ∈ D is the solution to Π1 then f ∗ = f (code(x∗ )) will denote the maximum fitness value that may appear in this instance of evolutionary computation. Let P1 , P 2, . . . be the stochastic sequence of populations produced by this algorithm. We define the new stochastic sequence (stochastic process) {Yt = f (b(Pt ))}, t ≥ 0
(4.49)
which is instantiated by the best fitted individuals in the consecutive populations. Such a sequence may be considered as the sequence of estimators of f ∗. It is interesting to know when the sequence {Yt }, t ≥ 0 converges to f ∗ and what sort of convergence this is. Let us introduce the family of level sets Aε = {x ∈ E; (f ∗ − f (b(x))) ≤ ε}, ε > 0
(4.50)
The answer to the above question is delivered by the following theorem.
92
4 Asymptotic behavior of the artificial genetic systems
Theorem 4.70. (see Rudolph [141]) Assume that the fitness function f : U → R+ satisfies ∃M, 0 < M < +∞; f (x) ≤ M ∀x ∈ U Let us consider the evolutionary algorithm with the space of states E = U µ /eqp whose dynamic may be described by the Markovian kernel τ that satisfies following conditions: ∀ε > 0 ∃δ > 0 such that τ (x)(Aε ) ≥ δ ∀x ∈ (Aε )− , τ (x)(Aε ) = 1 ∀x ∈ Aε , then the sequence {Yt }, t ≥ 0 converges completely to f ∗ which means that ) t ∗ Pr{(f − Yi ) > ε} , +∞ ∀ε > 0 lim t→+∞
i=0
Remark 4.71. If the sequence {Yt }, t ≥ 0 converges completely to f ∗ then it also almost surely converges to f ∗ i.e.
∗ Pr lim (f − Yt ) = 0 = 1 t→+∞
and converges in probability to f ∗ i.e. ∀ε > 0
lim Pr{(f ∗ − Yt ) > ε} = 0.
t→+∞
Theses of the above remark follow the well-known dependencies between the various sorts of stochastic convergence that may be found in many monographs (see e.g. Lucas [106], Chow and Teicher [49], Iosifescu [88], Billingsley [28]). Example 4.72. Let us check when the group of algorithms described in the previous sections satisfy the assumptions of Theorem 4.70. The Markovian kernel of such algorithms is described by the formula 4.48. If Aε ⊂ W (x) and / Aε we have Aε ∩ W (x) = Aε where W (x) = {y ∈ Aε = W (x), then for x ∈ E; f (b(y)) ≥ f (b(x))} and then τ (x)(Aε ) = τm (x)(Aε ). If now W (x) ⊆ Aε , x ∈ Aε and W (x) ∩ Aε = W (x) then we have
4.1 Markov theory of genetic algorithms
93
τ (x)(Aε ) = τm (x)(W (x)) +
(W (x))−
τm (x)(dy) χAε (e(x, y)) =
= τm (x)(W (x)) + τm (x)((W (x))− ) = 1 because e(x, y) ∈ W (x) ⊆ Aε . The second assumption of Theorem 4.70 is then satisfied. The probability of the passage to the set Aε can be computed as τm (x)(Aε ) = τm (x)(Aε ) χ(Aε )− (x) + χAε (x). The first assumption of Theorem 4.70 will be satisfied if τm (x)(Aε ) ≥ δ, ∀x ∈ (Aε )− for some δ > 0. Now we intend to formulate the sufficient conditions for the convergence of the discussed class of algorithms. Let us assume that the mixing operation is composed of mutation and other operations (e.g. crossover). The mixing kernel τm will be the composition of the mutation kernel τmut and the kernel τc that models the remaining mixing operations. Moreover, we assume that the space of states of the considered class of algorithms is finite i.e. #E < +∞. The crucial, practical assumption will be stressed as follows: mutation allows us to obtain the arbitrary state x ∈ E in the single genetic epoch starting from another arbitrary state x ∈ E. In other words ∃σm > 0; τmut (x)({x }) ≥ σm , ∀x, x ∈ E.
(4.51)
Then it implies τmut (x)({x }) =
y∈E
≥ σm
τc (x)({y}) τmut (y)({x }) ≥
(4.52) τc (x)({y}) = σm τc (x)(E) = σm > 0
y∈E
and also τm (x)(Aε ) ≥ σm > 0 ∀x ∈ (Aε )− , moreover, Aε = ∅. The above result may be compared to Theorem 4.31 and the remark 4.32 whih is the consequence of applying the ergodic theorem to the Markov chain that models the Simple Genetic Algorithm. The convergence of Supermartingals in Evolutionary Algorithm convergence analysis Definition 4.73. (see Neveu [114]) Let the triple (ΩP , , Pr) be the probabilistic space where ΩP stands for the set of elementary events, is the σ-algebra and Pr the probability measure. We will analyze the increasing sequence of sub-σ-algebras 0 ⊆ 1 ⊆ 2 ⊆ · · · ⊆ that satisfies the condition , + ∞ = σ t ⊆ t
94
4 Asymptotic behavior of the artificial genetic systems
where σ(A) denotes the minimum σ-algebra that contains the class A of subsets of the elementary event set ΩP (see Billingsley [28]). The stochastic process {Xt }, t ≥ 0 for which random variables Xt are t -measurable will be called supermartingal if ∀t ≥ 0 E(|Xt |) < +∞ and E(Xt+1 |t ) ≤ Xt almost surely where E(· | ) denotes the expected value operator computed with respect to the σ-algebra (see e.g. Billingsley [28]). Moreover, if ∀t ≥ 0 Pr{Xt ≥ 0} = 1 then the supermartingal {Xt }, t ≥ 0 will be called non-negative.
Theorem 4.74. (see Rudolph [140]) Let {Xt }, t ≥ 0 be the non-negative supermartingal that satisfies E(Xt+1 |t ) ≤ ct Xt with the probability 1 + , +∞ t−1 ck < +∞ where ct ≥ 0 for t ≥ 0 and t=1
k=0
then lim E(Xt ) = 0
t→+∞
which means that {Xt }, t ≥ 0 converges in mean and converges completely , + t Pr{Xi > ε} < +∞. ∀ε > 0 lim t→+∞
i=0
Remark 4.75. The thesis of the theorem 4.74 also implies the almost surely convergence of the supermartingal {Xt }, t ≥ 0 to zero
Pr lim |Xt | = 0 = 1 t→+∞
and convergence in probability ∀ε > 0
lim Pr {|Xt | > ε} = 0.
t→+∞
The above remark is the simple issue from well-known dependencies among various modes of stochastic convergence (see e.g. Lucas [106], Chow and Teicher [49]). We will continue with two almost trivial observations.
4.1 Markov theory of genetic algorithms
95
Remark 4.76. The assumptions of the above Theorem 4.74 are satisfied if lim sup {ct } < 1 t≥0
or in particular ct ≡ c < 1.
Remark 4.77. If ct ≡ c < 1 then for t ≥ 0 the inequality E(Xt+1 |t ) ≤ c Xt holds almost surely, then E(Xt+1 |t ) ≤ c E(Xt ) and E(Xt ) ≤ ct E(X0 ) so the rate of convergence of the supermartingal {Xt }, t ≥ 0 to zero is geometrical. Taking the above theorem and remarks into account, we try to formulate the other necessary conditions for the evolutionary algorithm convergence. These conditions will handle the expected increment of the maximum fitness f (b(Pt )) that occurs in the population in the particular evolution step. Let us assign to the evolutionary algorithm that processes populations Pt ∈ E = U µ /eqp a new random sequence {ωt } = {f ∗ − f (b(Pt ))}, t ≥ 0
(4.53)
where f : U → R+ stands for the fitness function and f ∗ for its maximum value. Remark 4.78. If f < M < +∞ on the whole U and the sequence of populations {Pt }, t ≥ 0 was produced by the evolutionary algorithm described in the example 4.72 then the random sequence {ωt }, t ≥ 0 is the non-negative suprmartingal. Proof. The expectation E(|ωt |) = E (|f ∗ − F (b(Pt ))|) takes finite values when the fitness function f is bounded on U . The random variable ωt is non-negative almost surely for t ≥ 0 because f ∗ ≥ f (b(Pt )), t ≥ 0. Moreover, E(ωt+1 |t ) ≤ ωt holds almost surely because f (b(Pt+1 )) ≥ f (b(Pt )), t ≥ 0. Remark 4.79. Let the evolutionary algorithm described in example 4.72 satisfy the assumptions of remark 4.78 and there is the constant c ∈ (0, 1) so that E (f ∗ − f (b(Pt+1 )) |t ) ≥ c (f ∗ − F (b(Pt )) then the sequence {ft } = {f (b(Pt ))}, t ≥ 0 converges in mean to f ∗ i.e. lim E(|f ∗ − f (b(Pt ))|) = 0
t→+∞
with the geometrical rate ct which precisely means that |f ∗ − f (b(Pt ))| has the order O(ct ).
96
4 Asymptotic behavior of the artificial genetic systems
Proof. Of course, if the assumptions of remarks 4.78 and 4.79 are satisfied then the random sequence {ωt }, t ≥ 0 is the non-negative supermartingal which satisfies all the assumptions of Theorem 4.74, so according to its thesis the in mean convergence may be drawn limt→+∞ E(ωt ) = 0 and then lim E(f ∗ − f (b(Pt ))) = f ∗ − lim E(f (b(Pt ))) = 0.
t→+∞
t→+∞
The geometrical rate of such convergence may be drawn from remark 4.77.
4.2 Asymptotic results for very small populations Some interesting asymptotic results have been obtained for the genetic algorithms which process very small (1 - 2 individuals) populations. Although they do not deliver tools for immediate analysis of computation instances applied in engineering practice, they allow us to establish the intuition necessary for better understanding of the particular genetic operation’s activity and synergy. They are also helpful when trying to understand the meaning of various convergence modes associated with genetic algorithm dynamics. The first group of results described in Section 4.2.1 was formulated and proved by Kazimierz Grygiel as well as published in his paper [81]. The second one stressed in Section 4.2.2 was delivered mainly by Iwona Karcz-Dulęba and published in the papers [70], [62] and [91]. 4.2.1 The rate of convergence of the single individual population with hard succession We intend to evaluate the rate of convergence of the genetic algorithm that solves one-dimensional optimization problem of type Π1 (see Section 2.1) max{Φ(x)}, S = [0, 1] ⊂ R x∈S
(4.54)
for the unimodal, continuous objective function Φ :→ R+ . Besides the simplicity of the problem to be solved, the genetic computation model was also drastically simplified in order to obtain a strong mathematical result. We will apply the particular type of the affine binary encoding in which we utilize the binary strings of the length l as genotypes. The binary universum Ω = {a = (a0 , . . . , al−1 ); aj ∈ {0, 1}, 0 ≤ j ≤ l − 1} will be mapped on the left ends of the sub-intervals of the length 2−l , so the encoding and the set of phenotypes will be defined by the formula l−1 code : Ω a → code(a) = j=0 aj 2−j−1 ∈ Dl (4.55) Dl = {code(a), a ∈ Ω}. Please note that, in contrast to the traditional binary affine encoding, the right end of the interval S does not belong to the phenotypes 1 ∈ / Dl . Because
4.2 Asymptotic results for very small populations
97
the numerical value of each genotype a ∈ Ω precisely fits the phenotype code(a), then we will not distinguish among the genotype, and associated phenotypes, if it does not lead to the ambiguity. We will use the single genetic operation called AB-mutation, which is given by the stochastic mapping Ω a → a ⊕ ±i ∈ Ω.
(4.56)
The code i ∈ Ω stands for the AB-mutation mask which is obtained by sampling according to the same probability distribution as in the case of the binary multi-point mutation (see Section 3.5.1, formula 3.37). Pr({i}) = (pm )(1,i) (1 − pm )l−(1,i)
(4.57)
where 1 = (1, . . . , 1) l−times
denotes the binary vector of l units and (1, i) the Euclidean scalar product of binary vectors 1 and i, moreover, pm ∈ [0, 1] stands for the mutation rate parameter. The signum which is assigned to the mask is also sampled from the set {+, −} with the uniform probability distribution { 12 , 12 } independent of the mask code i. The suggested type of mutation exhibits a much stronger ability of the interval S exploration than the standard multi-point binary mutation, which replaces the individual a ∈ Ω by a ⊕ i. It may be formalized as the following lemma. Lemma 4.80. (see Grygiel [81], Lemma 1) Let x, y ∈ Dl , x < y then there are two masks i , i ∈ Ω; (1, i ) = (1, i ) = 1 so that y − x <
y−x y−x and x − x < 2 2
where x = x + i and x = x + i .
In other words, for each of the two nodes x, y there is a one-point mutation mask with a large sampling probability (see formula 4.57) which produces the individual that is less distant to the parent than to the second node. The genetic algorithm under consideration transforms the single individual populations, then E = Dl stands for its space of states. The transition to the next state is described by the formula xt+1 =
ξt if Φ(ξt ) > Φ(xt ) xt in the other case
, t≥0
(4.58)
where ξt is the AB-mutant of xt . The above algorithm may be understood as the instance of the random walk (see Section 2.2) in which AB-mutation plays
98
4 Asymptotic behavior of the artificial genetic systems
ˆ = Φ |Dl the restriction the role of the sampling procedure. Let us denote by Φ of the objective function to the phenotype set. The main result of this section will be formulated as follows. Theorem 4.81. (see Grygiel [81], Theorem 1) For the arbitrary unimodal, ˆ : Dl → R+ the expected time Tl to find the best approxibounded function Φ mation of the global maximizer satisfies the inequality Tl ≤
4l pm (1 − pm )l−1
Remark 4.82. The denominator in the right hand side of the main formula in Theorem 4.81 reaches the maximum value if pm = 1l which is the best hint for choosing algorithm parameters (l, pm = 1l ). Moreover, for such parameter l−1 → e−1 for l → +∞. setting Tl = O(l2 ), because 1 − 1l The above remark may be generalized to the case of multidimensional, ˆ : (Dl )N → R+ which are separable, i.e. there bounded, unimodal functions Φ ˆ exists a representation of Φ, so that ˆ 1 , . . . , xN ) = Φ(x
N
ˆi (xi ), αi = 0, Φˆi : Dl → R+ , i = 1, . . . , N. αi Φ
(4.59)
i=1
In this case, the optimal parameter selection is (l, pm = evaluation Tl = O(N l2 ).
1 N l)
and gives the
4.2.2 The dynamics of double individual populations with proportional selection This section presents results related to the dynamics of the populations that are composed of only two individuals whose genotypes are real-valued vectors. Similarly to the previous Section 4.2.1, mutation is the only genetic operation that will be used by the new population creation. However, asymptotic results will be of a different sort and will be quite differently presented. The section contents are based on the papers [70], [62] and [91]. The presentation will start with a detailed description of the slightly more general case of evolutionary algorithms for which each population is of the constant size µ ∈ N, not only for µ = 2. The genetic universum will be U = RN (see phenotypic encoding, Section 3.1.3). The space of states will then be E = (RN )µ /eqp, where eqp is the equivalence defined in Section 2.2 (see formula 2.14) which identifies the µ-element strings of N -dimensional vectors, which may be obtained by the permutation of their vector elements. The global optimization problem under consideration consists of finding the global maximizer to the real-valued, bounded function of N real variables
4.2 Asymptotic results for very small populations
99
in the unbounded domain - the unbounded domain version of problem Π1 defined in Section 2.1. max {f (x)}, f (x) < M < +∞, ∀x ∈ RN
x∈RN
(4.60)
The stochastic transition rule τ : E → M(E) may be defined in the following steps: 1. Select the individual x from the population P according to the proportional selection rule i.e. x will be obtained by multiple sampling according to the probability distribution given by formula 3.30 (see Section 3.4.1). 2. Modify the individual x → x according to the phenotypic, normal mutation rule (see Section 3.7.1, formula 3.51), i.e. x = x + ξ where ξ is the result of sampling according to the N -dimensional normal distribution with a zero mean and the diagonal, isotropic covariance matrix C = σI. The coefficient σ stands for the standard deviation of this distribution which is valid for each dimension in the genetic universum. 3. Place the offspring x in the next epoch population P . 4. If the next epoch population P contains less than µ individuals, then go to step 1. The probability of sampling the individual x ∈ U = RN to the next epoch population P is given by the formula Pr({x}|P ) = α(y)ρN (y,σI) (x) (4.61) y∈P
where
f (y) z∈P f (z)
α(y) =
is the probability of selecting the individual with the genotype y from the current population P and ρN (y,σI) is the density of the N -dimensional normal distribution with the mean y ∈ U and the covariance matrix C = σI. The probability distribution τ (P ) ∈ E has the density function. Let P = x1 , . . . xµ ∈ E be the population that immediately follows P , then τ (P )({P }) = µ!
µ j=1
Pr({xj }|P ) = µ!
µ
α(y)ρN (y,σI) (x).
(4.62)
j=1 y∈P
If µ = 2, N = 1 (two individual population, scalar genotypes) then we may represent each population as P = {y1 , y2 }, yi ∈ R, i = 1, 2. The probability of sampling the individual with the genotype x to the epoch that follows P is given by the formula Pr({x}|P ) = α(y1 )ρN (y1 ,σ) (x) + α(y2 )ρN (y2 ,σ) (x)
(4.63)
100
4 Asymptotic behavior of the artificial genetic systems
where ρN (y,σ) (x) stands now for the density function of the unidimensional normal probability distribution with the mean value y and the standard deviation σ. For the next epoch population P = {x1 , x2 } we have τ (P )({P }) = 2 Pr({x1 }|P ) Pr({x2 }|P ) = = 2 (α(y1 )ρN (y1 ,σ) (x1 ) + α(y2 )ρN (y2 ,σ) (x1 ))
(4.64)
(α(y1 )ρN (y1 ,σ) (x2 ) + α(y2 )ρN (y2 ,σ) (x2 )). The space of states of the evolutionary algorithm may be interpreted now as the half-plane according to the following rule E (x1 , x2 ) →
(x1 , x2 ) if x1 ≥ x2
(4.65)
(x2 , x1 ) if x1 < x2 .
This is possible because the state of the algorithm does not depend on the order of the individuals in the population. The next formal step that makes the analysis more convenient is the π4 turn of the system of coordinates (x1 , x2 ) in the space of states. The new system of coordinates (w, z) will be given by the formula (x1 + x2 ) (x1 − x2 ) √ √ , z= , w ≥ 0. (4.66) w= 2 2 In this system of coordinates the expected location (w , z ) of the next epoch population P is computed with respect to the location of the current population P located at (w, z). . w w w 2 + (1 − γ 2 ) σ φ + Ξ E(w |P ) = σ π σ σ σ E(z |P ) = z + γw, γ =
q 1 − q2 q1 + q2
" " ! w+z z−w √ √ , q2 = f (x2 ) = f q1 = f (x1 ) = f 2 2 " ! ! 2" ! 2" 1 ζ ζ t 1 − 1 , Ξ(ζ) = √ dt exp − exp − φ(ζ) = √ 2 2 2π 2π 0 !
(4.67)
Next computations are performed for two fitness functions f (x) = exp(−ax2 ), f (x) = exp(−a(x + d)2 ) + exp(−a(x − d)2 )
(4.68)
where a and d are the positive, numerical parameters. Let us now compute the expected locations for populations of some characteristic arguments (previous states, previous step populations), which allow us to formulate substantial qualitative conclusions according to the double individual population dynamic
4.2 Asymptotic results for very small populations
101
Fig. 4.9. Expected behavior of the two-individual population.
behavior. Such expected locations have a similar meaning to the values of the genetic operator (see definition 4.16 and Theorem 4.20) for the Simple Genetic Algorithm. 1. If the population P lies on the 0z axis (w = 0), i.e. x1 = x2 , q1 = q2 then . 2 , E(z |P ) = z. (4.69) E(w |P ) = σ π % The population is pushed from the axis 0z to the straight line w = σ π2 . 2. If the population P lies on the axis 0w (on the axis of symmetry z = 0), i.e. x1 = −x2 , q1 = q2 then . w w w 2 E(w |P ) = σ +σ φ + Ξ , E(w |P ) = 0 (4.70) π σ σ σ and for large w we have 1 E(w |P ) = 2
+ . σ
, 2 +w . π
(4.71)
102
4 Asymptotic behavior of the artificial genetic systems
3. If the population P lies on the straight line w = z, then x2 is optimal i.e. q2 = fmax , and . w w w 2 + (1 − γ 2 ) σ φ + Ξ , E(w |P ) = σ π σ σ σ (4.72) E(z |P ) = (1 + γ)z. Because γ < 0 in this case, then the expected value of z decreases. If q2 is much larger than q2 or q1 → 0, then γ → −1 and . 2 , E(z |P ) = 0. (4.73) E(w |P ) = σ π % 2 , 0 on the axis 0w. The The population P “jumps” to the location π symmetric behavior may be observed for the population P from the line w = −z (q1 = fmax ). 4. For populations located far from both axes 0w and 0z if q2 is much larger than q2 we have . 2 , E(z |P ) = z − w. (4.74) E(w |P ) = σ π All behavior described above is illustrated % in Figure 4.9. As can be seen, the straight line w = σ π2 plays a role of the attractor, sometimes called the evolutionary channel. If individuals in the population P are distant from each other (w is large) or |q1 − q2 | takes large values, then the expected behavior is the population “jump” % to the evolutionary channel where the population individuals differ in σ
2 π.
Next, the population slowly % moves along the evolutionary channel with the step size ∆z = ±σ π2 until it bounds sufficiently close to the symmetry axis z = 0. The behavior described above exhibits the influence of the fitness function form. Because |γ| ≤ 1 the coordinate z of the population state changes very slowly. The shape of the regions in which the fitness strongly affects the expected value of the population state in the next genetic epoch also strongly depend on the fitness form. In such regions . . 2 2 , E(∆z|P ) < σ . (4.75) E(w |P ) > σ π π The formal analysis and discussion of the two-individual population’s expected behavior is confirmed by tests performed for two fitness functions f (x) = exp(−5x2 )
(4.76)
4.2 Asymptotic results for very small populations
103
Fig. 4.10. Trajectories for the expected two-individual population in case of the unimodal fitness f (x) = exp(−5x2 ). White circles mark the starting positions of populations, black dots the positions after the first epoch while crosses mark the positions after 20 epochs.
f (x) = exp(−5x2 ) − 2 exp(−5(x − 1)2 )
(4.77)
In both cases we set σ = 0.1. The computation results for function 4.76 are presented in Figure 4.10 while for function 4.77 in Figure 4.11. We may observe that the evolution went according to the scheme of the expected position transformations described before. We may distinguish two evolution phases: •
The “jump” close to the identity axis w = 0 which is practically independent of the initial position. It is the effect of reproduction of only a single individual.
•
Slow “drift” towards the maximizer of the fitness f . In the case of the unimodal function 4.76 this drift is immediate. If the fitness f is bimodal (see e.g. formula 4.77) such “drift” is much slower and may pass through the local extrema of f .
Another interesting subject is the study of the fixed points of the expected population operator, which correspond in some way to the fixed points of the genetic operator for the Simple Genetic Algorithm. Let us turn back to more general fitness definitions 4.68. Such points may be obtained in this case by solving the system
104
4 Asymptotic behavior of the artificial genetic systems
Fig. 4.11. Trajectories for the expected two-individual population in case of the bimodal fitness f (x) = exp(−5x2 ) − 2 exp(−5(x − 1)2 ). White circles mark the starting positions of populations, black dots the positions after the first epoch while crosses mark the positions after 20 epochs.
ws = E (w|P = (ws , zs )) (4.78) zs = E (z|P = (ws , zs )) The second equation is satisfied if γ = 0, i.e. if q1 = q2 which gives " ! " ! z − ws ws + z √ √ =f (4.79) f 2 2 The approximated solution of the above equation equals ws = 0.97σ and does not depend on the fitness function, but only on the standard deviation of the mutation operation. However, the second coordinate zs depends on fitness. If the fitness f is symmetric, then zs = 0 which may be obtained from the equation 4.79. If fitness is symmetric and unimodal, then the point (0.97σ, 0) is the only equilibrium point of the expected population (fixed point). In this ws from equilibrium state both individuals are located at the same distance √ 2 the global maximizer to f , so (x1 )s = −0.7σ, (x2 )s = 0.7σ. For the function f (x) = exp(−ax2 ) the point (0.97σ, 0) is asymptotically 2 stable if a ∈ (0, a0 ) where a0 = (w2s )2 = (0.97σ) 2.
4.3 The increment of the schemata cardinalityin the single evolution epoch
105
For multimodal functions we may obtain more fixed points of the expected population operator. In particular, for the function f (x) = exp(−a(x + d)2 ) + exp(−a(x−d)2 ) we may have single or three fixed points. If σ > 1.46d then the saddle point (0.97σ, 0) is the only equilibrium of the expected two-individual population. If σ > 1.46d then two more fixed points appear close to the local maximizer of the fitness, symmetrically to the axis 0w. When σ decreases (the mutation becomes less intensive) then such points bound to the local maximizers. Generally, the process of two-individual population evolution may lead to finding symmetric maximizers if the standard deviation of the mutation σ is small in comparison to the distance between two maximizers.
4.3 The increment of the schemata cardinality in the single evolution epoch The schemata theory was the first formal approach which tried to explain the asymptotic behavior of genetic algorithms with binary encoding. It was introduced by Holland in 1975 [85] and was devoted rather to the modeling of artificial life than to the stochastic global optimization algorithm analysis. This approach has been criticized many times (see e.g. Grefenstette and Bayer [79], Grefenstette [78], Podsiadło [130]). A significant improvement in the understanding of the main schemata theorem idea was delivered by Whitley in his Genetic Algorithm Tutorial [196]. A detailed explanation and exact formulation was given by Wright [204]. Recently, comments have been published by Kieś and Michalewicz [93] as well as by Reeves and Rowe [134] and Schaefer [151]. We intend to deliver quite a detailed and precise formulation and proof of these historical results in order to explain their true, very constrained meaning in genetic algorithm analysis. We will partially follow the way taken by the last group of cited authors and Vose’s definition of the Simple Genetic Algorithm (see Vose [193], also Section 3.6). Let us recall the binary genetic universum Ω = {(a0 , a1 , . . . , al−1 ); ai ∈ {0, 1}, i = 0, 1, . . . , l − 1} which contains binary codes of the length l (see formula 3.5). The set S = {0, 1, ∗}l will be called the space of schemata. Definition 4.83. The sequence (h0 , h1 , . . . , hl−1 ) ∈ S depicts the schemata H which is the subset of the binary genetic universum (H ⊂ Ω) given by the formula ) hi for hi ∈ {0, 1} , i = 0, 1, . . . , l − 1 . H = (a0 , a1 , . . . , al−1 ); ai = 0 or 1 for hi = ∗ Moreover, we assume that each schemata is nontrivial, i.e. ∃i ∈ 0, 1, . . . , l − 1; hi ∈ 0, 1.
106
4 Asymptotic behavior of the artificial genetic systems
Each schemata may be characterized by some important parameters. The first of them is called the schemata length and the second the degree of schemata. Definition 4.84. The length ∆(H) of the schemata H ⊂ Ω is the maximum distance between the well-defined digits ∆(H) =
max
i,j=0,1,...,l−1
{|i − j|; hi , hj ∈ {0, 1}}
while the degree ℵ(H) of the schemata H ⊂ Ω is equal to the number of welldefined digits in H. It is easy to observe that ∆(H) ∈ {0, 1, . . . , l − 1} while ℵ(H) ∈ {1, . . . , l}. We will model the finite population of individuals which are clones of elements from the genetic universum Ω as pairs P = (Ω, η), where the function η : Ω → R+ turns back the number of clones of the particular genotype from Ω (see definition 2.8). The fitness function will be represented by the 2l -dimensional vector f = {fi }, i ∈ Ω. We may denote for convenience by f¯P the mean fitness of the population P = (Ω, η) i.e. 1 fi η(i) f¯P = µ
(4.80)
i∈Ω
where µ stands for the population cardinality µ = #P < +∞. Definition 4.85. Let us consider the schemata H ⊂ Ω. The schemata representation in the population P = (Ω, η) will be the multiset rep(H, P ) = (Ω, ηrep(H,P ) ) where ηrep(H,P ) (i) =
η(i) if i ∈ H 0
.
otherwise
We now introduce the next two parameters of the schemata H related to the particular population P . The first of them will be the mean fitness of the schemata representation f¯rep(H,P ) =
1 fi ηrep(H,P ) (i) #rep(H, P )
(4.81)
i∈Ω
and the fitness ratio of the schemata representation rat(H, P ) =
f¯rep(H,P ) . f¯P
(4.82)
4.3 The increment of the schemata cardinalityin the single evolution epoch
107
Let us now consider the infinite set of populations P0 , P1 , . . . generated by the Simple Genetic Algorithm, so that Pt = (Ω, ηt ) and #Pt = µ < +∞ for all t = 0, 1, . . . . We assume the multi-point mutation described by the formula 3.36 where the mutation mask is given by the probability distribution 3.37. We will restrict ourselves to the one-point crossover determined by formulas 3.40, 3.41 while the crossover type is given by formula 3.42. The proportional selection distribution utilized in the SGA, described by formula 3.30 may be rewritten to the form self ({i}) =
fi ηt (i) . µ f¯P
(4.83)
Let us start with three lemmas that correspond to the Lemmas 3.1, 3.2 and 3.3 in Reeves and Rowe [134]. Lemma 4.86. The probability of selecting the single individual that represents the schemata H at the epoch t equals 1 rat(H, Pt ) #rep(H, Pt ). µ Proof. The probability under consideration may be computed as follows fi ηrep(H,P ) (i) i∈H fi ηt (i) self ({i}) = = i∈Ω = ¯ µ fP µ f¯P i∈H #rep(H, P ) f¯rep(H,P ) #rep(H, P ) rat(H, P ) . = ¯ µ µ fP It is obvious that the probability of selecting the individual that does not represent H is 1 (4.84) 1 − rat(H, Pt ) #rep(H, Pt ). µ Lemma 4.87. Let x ∈ H and y ∈ Ω \ H (or x ∈ Ω \ H and y ∈ Ω) be parent genotypes. The probability that the offspring z ∈ Ω obtained by the crossover of x and y belongs to H is larger than or equal to ! " ∆(H) 1 1 − pc . 2 l−1 Proof. The probability that x will be crossed non-trivially with y equals pc because it is the probability that the crossover mask differs from the string composed of zeros or from the string composed of ones (see formula 3.41). Because the selection of the cutting position is uniform (see formula 3.42),
108
4 Asymptotic behavior of the artificial genetic systems
then the conditional probability of destroying the schemata H in one child is less than or equal to pc ∆(H) l−1 . It may be substantially less because the part of the string y that is exchanged may also fit schemata, so this child may also belong to H. Anyway, the lower bound of the probability that this child belongs to H is 1 − pc ∆(H) l−1 . From the Bayes rule applying for crossover we may obtain the following lower bound of the probability that the crossover result belongs to the schemata H ! " 1 ∆(H) 1 1 − pc + δ Pr{z ∈ H} ≥ 2 l−1 2 where δ stands for the probability that the second child belongs to H. The thesis of Lemma 4.87 can be obtained by dropping the second term in the right-hand-side of the above inequality. Lemma 4.88. Assuming z ∈ H the probability that z , obtained by uniform mutation, stays in the schemata H is (1 − pm )ℵ(H) . Proof. The digits predefined in the schemata H that occur in the string z ∈ H will be passed to z if the mutation mask has zero on the loci predefined in the schemata H. The probability of selecting such a mask due to the rule described in Section 3.5.1 equals the probability of sampling zeros on the ℵ(H) positions and ones on the l − ℵ(H) positions independently. Then we have Pr{z ∈ H} = (1 − pm )ℵ(H) 1l−ℵ(H) . The next lemma will evaluate the probability of surviving the schemata in one-time sampling in the single step of SGA. Lemma 4.89. Assuming the current population Pt , the probability of onetime sampling the individual z ∈ H according to the SGA procreation rule is greater than or equal to
!
rat(H, Pt ) #rep(H, Pt ) µ
∆(H) 1 − pc l−1
! "" rat(H, Pt ) #rep(H, Pt ) 1− (1 − pm )ℵ(H) . µ
Proof. By using the Bayes rule we can evaluate Pr{z ∈ H} = Pr{z ∈ H|z ∈ H} Pr{z ∈ H} + + Pr{z ∈ H|z ∈ Ω \ H} Pr{z ∈ Ω \ H} where z is obtained by mutation from the individual z. Dropping the second term, which give us the probability of obtaining by mutation the string that belongs to H from the string that does not belong to H, and using the thesis of Lemma 4.88 we have
4.3 The increment of the schemata cardinalityin the single evolution epoch
Pr{z ∈ H} ≥ (1 − pm )ℵ(H) Pr{z ∈ H}.
109
(4.85)
Let us now assume that the individual z was obtained from two parental strings x and y by one-point crossover. Using the Bayes rule once more we have Pr{z ∈ H} = Pr{z ∈ H|x, y ∈ H} Pr{x, y ∈ H} + + Pr{z ∈ H|x ∈ H and y ∈ Ω \ H} Pr{x ∈ H and y ∈ Ω \ H} + + Pr{z ∈ H|x ∈ Ω \ H and y ∈ H} Pr{x ∈ Ω \ H and y ∈ H} + + Pr{z ∈ H|x, y ∈ Ω \ H} Pr{x, y ∈ Ω \ H}. Dropping the last term that expresses the probability of obtaining the string that belongs to the schemata H from two parents that do not belong to H, and next using the theses of Lemmas 4.86 and 4.87 we obtain "2 ! " 1 ∆(H) rat(H, Pt ) #rep(H, Pt ) 1 − pc + Pr{z ∈ H} ≥ 1 · µ 2 l−1 "! " ! rat(H, Pt ) #rep(H, Pt ) rat(H, Pt ) #rep(H, Pt ) 1− + µ µ ! "! "! " rat(H, Pt ) #rep(H, Pt ) 1 ∆(H) rat(H, Pt ) #rep(H, Pt ) 1 − pc 1− 2 l−1 µ µ !
and now substituting to the inequality 4.85 we can complete the proof rat(H, Pt ) #rep(H, Pt ) µ ! ! "" rat(H, Pt ) #rep(H, Pt ) ∆(H) 1 − pc 1− (1 − pm )ℵ(H) . l−1 µ Pr{z ∈ H} ≥
Because the process of obtaining the next step population Pt+1 is random, then the schemata representation cardinality #rep(H, Pt+1 ) may also be handled as a random variable. Taking the above definitions and notations into account, the main results of this section may be formulated. Theorem 4.90. If all the assumptions made in Lemmas 4.86, 4.87, 4.88 are satisfied then we have E(#rep(H, Pt+1 )) ≥ rat(H, Pt ) #rep(H, Pt ) ! "" rat(H, Pt ) #rep(H, Pt ) ∆(H) 1 − pc 1− (1 − pm )ℵ(H) . l−1 µ
!
110
4 Asymptotic behavior of the artificial genetic systems
Proof. The total procreation step in the SGA may be treated as the µ-time independent sampling of the single individual z according to the Bernoulli scheme, in which the result z ∈ H is interpreted as the success. The number of successes in such sampling is the number of individuals in rep(H, Pt+1 ) i.e. the number of individuals in Pt+1 that belong to the schemata H in the next epoch. Using the well-known formula for expectation in the Bernoulli scheme we obtain E(#rep(H, Pt+1 )) = µ Pr{z }. It is sufficient to use Lemma 4.89 to complete the proof. The formula which was derived (in thesis of Theorem 4.90) is really nothing new and is almost the same as that given by Whitley [196]. The advantage of the above considerations seems to be the rigorous formulation and proof of all the steps, which underlines all their imperfections and simplifications leading in particular to underestimation of the expectation of the rep(H, Pt+1 ) cardinality. The thesis of Theorem 4.90 inherits, of course, all the imperfections of the schemata theorem mentioned by many authors (see e.g. Reeves and Rowe [134], Whitley [196], Vose [193]. In particular: • It is inaccurate, which means that it is possible to get better evaluations for E(#rep(H, Pt+1 )) which do not involve the simplifications and underestimations that are visible in the proofs of Lemmas 4.87 and 4.89. •
It does not allow us to study the asymptotic behavior of the SGA. We need to be sure that Pt is the population in the t-th epoch in order to evaluate the statistics of #rep(H, Pt+1 ) in the next step, so the formula could not be iterated along t → +∞. In order to study the asymptotic behavior it is necessary to evaluate the states or sampling probability distribution transition between consecutive epochs as is done in the Markov theory of SGA introduced by Vose.
The usual form of the schemata theorem, introduced by Holland [85] and reformulated by Goldberg [74], Michalewicz [110], Reeves and Rowe [134], brings the the following formula " ! ∆(H) − ℵ(H) pm . E(#rep(H, Pt+1 )) ≥ rat(H, Pt ) #rep(H, Pt ) 1 − pc l−1 The above evaluation is weaker than the one presented in Theorem 4.90. Really, taking into account the inequality (1 − pm )ℵ(H) ≥ 1 − ℵ(H) pm we obtain ! " rat(H, Pt ) #rep(H, Pt ) ∆(H) 1− = 1 − pc l−1 µ 1 − pc so, then
∆(H) ∆(H) rat(H, Pt ) #rep(H, Pt ) ∆(H) + pc ≥ 1 − pc l−1 l−1 µ l−1
4.4 Summary of practicals coming from asymptotic theory
111
! "" rat(H, Pt ) #rep(H, Pt ) ∆(H) 1− rat(H, Pt ) #rep(H, Pt ) 1 − pc l−1 µ " ! ∆(H) (1 − ℵ(H) pm ) ≥ (1 − pm )ℵ(H) ≥ rat(H, Pt ) #rep(H, Pt ) 1 − pc l−1 " ! ∆(H) ∆(H) rat(H, Pt ) #rep(H, Pt ) 1 − pc − ℵ(H) pm + pc ℵ(H) pm ≥ l−1 l−1 " ! ∆(H) rat(H, Pt ) #rep(H, Pt ) 1 − pc − ℵ(H) pm . l−1 !
4.4 Summary of practicals coming from asymptotic theory Significant asymptotic theory results have been obtained mainly for genetic algorithms whose behavior may be described by the trajectory in the space of states composed of populations (e.g. E = U µ /eqp for constant, finite size µ < +∞ population algorithms) or their unambiguous representations (e.g. E = Λr−1 for SGA). In such cases, under some additional conditions (see formula 4.6), the stochastic dynamics of such algorithms may be modeled by Markow chains. The effective analysis of the asymptotic behavior by t → +∞ is easier if the probability transition rule τ : E → M(E) is stationary i.e. does not depend on the epoch number. It holds if selection and genetic operations do not depend explicitly or implicitly on the epoch number t. This is the case in all the genetic algorithm instances effectively analyzed in this chapter. The main feature which was studied is the ergodicity of the Markov processes that can model particular groups of genetic algorithms. Ergodicity roughly means the lack of absorbing states that may be occupied by the genetic algorithm infinitely and the guarantee of visiting all the states in E. Positive results for the SGA (see Theorem 4.31) and for the evolutionary algorithm (λ + µ)-type with elitist selection (see formula 4.52) were obtained assuming the rather “brutal” condition that forces strictly positive mutation in every genetic epoch (the mutation rate is positive pm > 0 in the case of the SGA, and the strictly positive mutation operation kernel τmut (x)({x }) > 0, ∀x, x ∈ U in the case of the EA (λ + µ)-type with elitist selection). In the case of the EA we have to additionally assume that the space of states is finite (#E < +∞). Such an assumption guarantees a much stronger feature, the algorithm can pass between two arbitrary stages from E in a single genetic epoch with a strictly positive probability. Summing up this part of results we may say that: If the genetic algorithm can be modeled by the ergodic Markov chain with states in E = U µ /eqp then it passes with all possible populations, in particular those that contain individuals whose phenotypes correspond to global or local
112
4 Asymptotic behavior of the artificial genetic systems
extrema of the objective function (or the best approximations in the set of phenotypes Dr ). Such behavior is precisely called the asymptotic correctness in the probabilistic sense and the asymptotic guarantee of success (see definitions 2.16, 2.17). Ergodicity of the Markov chain that models the genetic algorithm also forces two kinds of convergence for two important classes of algorithms: The Simple Genetic Algorithm with finite population (µ < +∞) case. The weak convergence of the sequence of measures {πµt }, πµt ∈ M(E), t = 0, 1, . . . to the limit, invariant measure πµ for t → +∞ was proved (see Theorem 4.31). Moreover, the limit measure πµ is strictly positive and does not depend on the starting measure πµ0 . Let us recall that the measure πµt determines the probability of selecting a new population from the space E in the t-th genetic epoch. The progress in the {πµt } convergence illustrates the process of SGA learning. The algorithm gathers information about the optimization problem, which is memorized in the measure πµt . The ergodicity guarantees the stable convergence of the learning process, but does not guarantee the total, maximum level of knowledge to be gathered. Please note that the convergence of the sequence {πµt } does not necessarily imply the stabilization of the population state by t → +∞. This feature is surprising even for many researchers that apply genetic algorithms. It may be stressed in the form of the so-called genetic paradox : If the Simple Genetic Algorithm is convergent (the sequence {πµt } converges) then it it not “convergent” i.e. all individuals do not lead to the global maximizer. However, if the SGA populations converge deterministically to the monochromatic one then the related Markov chain has the absorbing state and then is not ergodic. The global search ability in Dr is lost (the algorithm does not posses the asymptotic correctness in the probabilistic sense and the asymptotic guarantee of success). The case of evolutionary algorithm (µ + λ) with elitist selection. The convergence of maximum fitness appearing in the consecutive population to the global fitness maximum under rather restrictive assumptions has been proved (see Theorem 4.70, example 4.72). The next results of the practical meaning have been obtained for the Simple Genetic Algorithm instances for which the genetic operator G is focusing (see definition 4.34) and they are “well tuned” with respect to the group of local extremes of the objective functions to be searched (see definition 4.63). For such algorithms the sampling measure becomes dense in the central parts of the basins of attraction of local extremes (see Theorem 4.66 and Remark 4.67) if populations are sufficiently large and a sufficiently large number of genetic epochs are passed. In such cases the information about the evolutionary
4.4 Summary of practicals coming from asymptotic theory
113
landscape may be drawn by the a’posteriori analysis of the counting measure obtained from the final population (see remark 4.59). The simplest possibility of such a process, which consists of finding the level sets of specially defined density of the above mentioned measure (Clustered Genetic Search), will be presented in Section 6.3.1. Well tuned genetic algorithms may also constitute an effective tool for producing the random sample in the first step of the two-phase global optimization stochastic strategies (see Section 6.3). The rate of convergence of genetic algorithms is generally difficult to evaluate analytically. The logarithmic rate of convergence with respect to the number of genetic epochs of the idealized, infinite population SGA to the fixed points of the genetic operator G was delivered by Theorem 4.53. The evaluation of the SGA efficiency, restricted to the single evolution step that consists of passing between two consecutive genetic epochs, may be drawn from the schemata Theorem 4.90. In particular, the one-step increment of the population sub-multisets, which contain solutions to global optimization problem Π1 may be statistically estimated. The next important results in this area (see remark 4.79) prove the geometrical rate of convergence of the maximum fitness value evaluation with respect to the number of genetic epochs of the (µ + λ) evolutionary algorithm. Very strong assumptions with respect to the fitness increment between two consecutive genetic epochs were accepted. The mean hitting time of the global maximizer for the special kind of random walk (genetic algorithm with a single individual population, AB mutation and hard selection) assuming the arbitrary unimodal fitness can also be evaluated (see Theorem 4.81 and remark 4.82). The next important feature of genetic algorithms for which mathematicallyverified results are rather rare is the stopping rule. It is worth mentioning the results of Hulin [87] who tried to construct the stopping rule of the Bayes type, typical for the Monte Carlo algorithms. Another type of effective, mathematically-verified stop criterion for the two-phase global optimization strategy that utilizes the well tuned Simple Genetic Algorithm and the fitness landscape erosion technique (hill crunching described in Section 5.3.5) will be delivered in Sections 6.3.2.
5 Adaptation in genetic search
The previous Chapters 3, 4 presented perhaps the simplest instances of genetic global optimization algorithms that only exploit basic mechanisms of genetic computation, such as mutation, crossover and selection. All these mechanisms do not change with respect to the genetic epoch and are “blind” to the optimization problem to be solved as well as to the knowledge about it currently gathered by the algorithm. Such simplicity allows us to construct the mathematical models and perform deep formal analysis of the asymptotic behavior, which is helpful in understanding the real nature of genetic global optimization. However, the efficiency of basic genetic search mechanisms are frequently criticized. This chapter discusses the adaptation techniques and strategies that go toward the better efficiency of global genetic searches in continuous domains. The first group of them consists of on-line modifying genetic mechanisms during the genetic epoch progress, according to the assumed plan or according to the feedback coming from one or more previous steps (see Section 5.3). The second group introduces more sophisticated multi-deme searches that allow concurrent checking of the whole admissible domain (see Section 5.4).
5.1 Adaptation and self-adaptation in genetic search The most basic reference algorithm that is worth mentioning in the taxonomy of adaptive stochastic global optimization searches is called Pure Random Search (PRS) (see Section 2.2). It involves the generation of all populations {Pt }, t = 0, 1, . . . according to the same, uniform probability distribution over the admissible domain D. Such an algorithm, in which the sampling probability distribution is still or varies only with respect to the deterministic, a’priori assumed plan, may be called devoid of the adaptation mechanisms. The classical genetic algorithms described in Chapters 3, 4, perform the sampling distribution modification in each genetic epoch. All these algorithms R. Schaefer: Foundation of Global Genetic Optimization, Studies in Computational Intelligence (SCI) 74, 115–152 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
116
5 Adaptation in genetic search
can be modeled by uniform Markov chains with the space of states E which represent all the possible populations and with the stationary transition rule τ : E → M(E) which does not depend on the genetic epoch t = 0, 1, . . . as well as from earlier states. More precisely, the form of the mapping τ in the tth epoch depends neither on t nor on the states x0 , x1 , . . . , xt−1 ∈ E (see Section 4.1.1). However, the probability distribution π t+1 = τ (xt+1 ) which is used for the next population sampling (see Figure 4.1) varies in time because τ (xt+1 ) = τ (xt ), t = 0, 1, 2, . . . . It has an obvious influence on the probability distributions which are used when sampling individuals (and then their phenotypes) from the genetic universum U (from the admissible domain D). For the Simple Genetic Algorithm the consecutive samples from D were selected according to the probability distributions θˆ (xt ) ∈ M(D) (see formula 4.26 and Remark 4.59). The form of the mapping θˆ does not depend on the epoch counter t or on the sampling results in previous epochs. A similar scheme was accepted in the case of classical evolutionary algorithms, as mentioned in Sections 3.7 and 4.1.3. Summing up, classical genetic algorithms involve the adaptation mechanism that modifies the probability distribution, which is utilized by sampling the consecutive multiset from the admissible domain D (phenotypes assigned to the next step population). This mechanism depends only on the assumed genetic operations and the selection operation as well as their constant parameters (e.g. the mutation rate pm and the crossover rate and type pc , type for the Simple Genetic Algorithm described in Sections 3.5, 3.6). This adaptation does not need any interference during the algorithm iteration, so we will call it self-adaptive genetic algorithm. In contrast to the self-adaptive, classical genetic algorithms the adaptive stochastic search and the adaptive genetic algorithm perform single-step sampling from D by using mechanisms that are intentionally modified during the computation. In particular, the genetic operations, selection and their parameters settings may change in the consecutive genetic epochs. Moreover, the number of individuals that is processed in the particular epoch as well as other features of algorithms may be modified.
Fig. 5.1. Sampling scheme for the adaptive genetic algorithms.
5.2 The taxonomy of adaptive genetic strategies
117
In the case of adaptive stochastic searches, the possibility of the dynamic behavior modeling by using a uniform Markov chain (i.e. the Markov chain with a stationary probability transition rule τ ) is lost. If the adaptive genetic strategy allows the precise definition of the space of states E, then the single-step progress may be expressed by the diagram 5.1, similarly to the diagram 4.1 which describes the sampling rule in the case of classical genetic algorithms. The transition function τt that determines the probability distribution π t+1 , which is used by the next epoch population sampling may now depend on the current genetic epoch counter t, the current population Pt , earlier populations produced by the algorithm P0 , P1 , . . . , Pt−1 as well as on some parameter vector u(t) that controls the evolutionary process. Entries of the vector u(t) may have an influence on the selection and genetic operations as well as on other parameters, such as the population cardinality. The relaxation of the sampling rule may result not only in the loss of its stationarity but also in the loss of Markovian principles in general (see formula 4.6).
5.2 The taxonomy of adaptive genetic strategies The main goal of adaptation, which is introduced in the sampling rule of genetic search, is to increase the total efficiency of such strategies. Such efficiency improvement may be reached by elimination of basic disadvantageous behavior observed during the virtual evolution. Disadvantages under consideration may fall into three groups: 1. Slow concentration of individuals in the neighborhood of the global maximizer as well as slow bounding of the best fitted individual to this maximizer. 2. Small range of penetration of the individual phenotypes in the domain D which only reach the basins of attraction of several local maximizers after a large number of genetic epochs. 3. Evolutionary cliffs (or evolutionary traps) which result in the long-term occupation of the basin of attraction of a single local maximizer by almost all individuals. There are numerous adaptive genetic strategies described in monographs (see e.g. Goldberg [74] Michalewicz [110], Michalewicz, Bäck, Fogel [15], [10], [11] and in a large number of research papers. It is difficult to introduce an exhaustive taxonomy of such solutions. Some very interesting, and perhaps the most comprehensive, approaches to the classification of adaptive genetic strategies may be found in Michalewicz, Bäck, Fogel monographs [15], [10], [11] and also in the book written by Arabas [5]. A taxonomy, restricted to strategies dedicated to solving continuous global optimization problems with a multimodal objective function, was presented by Kołodziej [96]. The reasons that make this venture so difficult are:
118
5 Adaptation in genetic search
•
the complex nature of many adaptive strategies,
•
the similar final effect obtained by some strategies constructed in order to satisfy quite different principles.
Both of the above mean there are a great deal of solutions already published by researchers and make it difficult to establish a single, uniform criterion of adaptive strategy classification. We decide to introduce two criteria and then two partially-dependent taxonomies. The first and simplest of them takes goals of adaptation into account, while the second one differentiates adaptation strategies with respect to elementary techniques to be utilized. The presented taxonomies do not aspire to be general or exhaustive ones. According to restrictions accepted in this book, these taxonomies omit adaptation techniques that can only be applied in the case of solving discrete optimization problems. We do not include strategies that allow us to fit the virtual evolution to the dynamically changing fitness either. Concerning the first established criterion, the main goals that may be distinguished in the genetic adaptive searches applied to the continuous global optimization problems are the following: A. Strengthen and fasten local search ability. Activities that are performed result in changing the genetic algorithm behavior in the area of attraction set of the single, but arbitrary local maximizer of the objective function Φ. They generally lead to an increase in the sampling measure density in the central part of this attraction set, and as a consequence quicken the sampling of the individual with the phenotype close to this local maximizer. In the case of multi-deme searches it is possible to attain this goal concurrently in attraction sets of more than one local maximizer which corresponds to the goal C. B. Increment of the population discrepancy and mobility. Particular targets which such techniques try to attain are directed or undirected changes in the sampling measure density leading to intensive, chaotic penetration of the whole admissible domain D or the migration of the population to the attraction set of another, neighboring local maximizer. The benefits of the periodical application of such strategies is the easier passing of evolutionary cliffs, i.a. capturing the attraction set of the local maximizer with the higher value of the objective function Φ. C. Total concurrent search. These strategies force the concurrent increase of the sampling measure density in the central parts of attracting sets of many local maximizers of Φ. These measures are utilized to sample the single population or many demes by genetic algorithms in the consecutive steps of evolution. The second taxonomy has a tree form (see Figures 5.2, 5.3, 5.4). Its root corresponds to the classical genetic algorithms that utilize only selection, mutation and crossover operations which are fixed in time. The population
5.2 The taxonomy of adaptive genetic strategies
Fig. 5.2. The tree of adaptation techniques. Part 1.
119
120
5 Adaptation in genetic search
Fig. 5.3. The tree of adaptation techniques. Part 2.
cardinality as well as the succession scheme are also fixed (µ, λ = const., see 3.9). Leafs of the classification tree constitute classes of elementary adaptation techniques. In engineering practise, we observe complex strategies that involve elementary techniques from several groups in order to reach one of the goals A., B. and C. or the weighted combination of more than a single goal. The basic criterion that allows us to spread the root into two boughs is the structure of the random sample. It contains single- and twin-population techniques and multi-deme techniques with the extended relation structure between respective populations. The following sections contain a short description of more than twenty selected elementary adaptation techniques and important examples of their application.
5.2 The taxonomy of adaptive genetic strategies
Fig. 5.4. The tree of adaptation techniques. Part 3.
121
122
5 Adaptation in genetic search
5.3 Single- and twin-population strategies (α) Single population adaptive strategies are historically the earliest ones. They intend to modify the population dynamics in order to reach one or more of the goals A., B., or C. together. In the case of multiple goals, single-population strategies try to reach them consecutively. Twin-population strategies usually differentiate the role of each population evolving concurrently. One of them is the typical population of encoding candidates of solutions, e.g. their phenotypes belong to the admissible domain D. The second population plays a control rule e.g. dynamically profiles genetic operations or/and the selection process. 5.3.1 Adaptation of genetic operation parameters (α.1) Adaptation of genetic operation parameters is perhaps the simplest and most frequently utilized class of adaptation genetic strategies. Its formal description is based on the extending of the individual model to the pair (x, s), where x ∈ U stands for the individual’s genotype and s ∈ Q ⊂ Rk , k ∈ N is the vector of genetic operation parameters (see Bäck [13]). The set of all possible parameter values Q is usually bounded and regular (e.g. has the Lipschitz boundary). Let us denote by {gi } genetic operations that can affect population genotypes. Each of them may be formalized as the random function gi : U p × Q → M(U ), p ∈ N
(5.1)
where p equals 1 or 2 for typical operations (mutation and crossover). Only the operations taken at the parameter value s, i.e. {gi (·, s)} may act on the individual represented by the pair (x, s). If the algorithm does not allow adaptation, then s = const. which means it is this same for all genotypes and does not change during the evolution. For example, in the case of the Simple Genetic Algorithm the parameter set Q ⊂ Rr+2 and equals Q = [0, 1] × [0, 1] × Λr−1 while the parameter vector Q s = (pm , pc , type) is composed of mutation and crossover rates and the crossover type vector (see Sections 3.5.1, 3.5.2 and 4.1 for details). Because the SGA does not allow any adaptation, each of these parameters is valid for all genotypes and does not change with respect to the genetic epoch counter. Parameters modification according to the deterministic and stochastic rules (α.1.1) In these techniques, the parameter vector s usually has the same value for all individuals in the single epoch population, but may be modified when passing between two consecutive genetic epochs.
5.3 Single- and twin-population strategies (α)
123
The parameter which is most frequently adopted is the parameter that controls the mutation intensity. It may be, in particular, the rate pm of the single bit mutation in the case of genetic algorithms using binary encoding (see e.g. formulas 3.36, 3.37 for SGA) and the standard deviation σ or the covariance matrix C for the evolutionary algorithms with normal mutation (see formula 3.51). Deterministic formulas that define such modification were given by Fogel [68], Lis [103], [104], Michalewicz [110], Bäck [12] and Arabas [5]. The simplest, widely-cited formula of this type, that may adapt the mutation rate by passing to the next genetic epoch in the case of binary mutation, has the form (5.2) pm = αpm , α ∈ [0, 1] where α stands for its constant parameter. Bäck in [14] delivers another formula which determines the mutation rate at the tth genetic epoch. pm (t) =
0.11375 1 + 240 2t
(5.3)
Such formulas decrease the mutation intensity in consecutive genetic epochs. This results in increasing the selection pressure and in strengthening the local search ability in the later genetic epochs which follow a more chaotic search in the introductory period of evolution. Bäck and Schütz [16] suggested the stochastic rule of the binary mutation rate adaptation. pm
! "−1 1 − pm = 1+ exp(−0.2 N (0, 1)) pm
(5.4)
where N (e, σ) denotes the real value random variable with the normal distribution, mean e and standard deviation σ. This formula is applicable in the case of maximizing simple objective functions. Let us consider the (µ + λ) evolutionary algorithm with phenotypic encoding and normal mutation. If the admissible domain D ⊂ RN , then we may denote x = {xi }, s = {σi }, i = 1, . . . , N and C = diag{σi } is the covariance matrix used by the mutation operation. Schwefel [163], Schwefel and Rudolph [165], Bäck [14] and Arabas [5] reported the rule that adopts the “range of mutation” ⎧ xi = xi + σi (N (0, 1))i , i = 1, . . . , N ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ σ = σ (αN (0, 1) + β(N (0, 1)) ) i i i (5.5) ⎪ ⎪ ⎪ K K ⎪ ⎪ , β= √ α= √ ⎩ 2N 2 N where the parameter K is called normalized rate of convergence. Its value can be set 1 at the start of evolution. A much more extended formula including
124
5 Adaptation in genetic search
the correlations among the individual coordinates (the wholly filled covariance matrix C = {cij }) was given by Schwefel [163]. ⎧ xi = xi + N (0, C), i = 1, . . . , N ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪ ci,j = tan(2ak )((σi )2 − (σj )2 ), i = j, ⎪ ⎪ 2 ⎪ ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎪ k = (2N − i)(i + 1) − 2N − j ⎪ ⎪ 2 ⎪ ⎨ (5.6) cii = σi ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ σi = σi (αN (0, 1) + β(N (0, 1))i ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ai = ai + 0.0873(N (0, 1))i ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ K K ⎪ ⎪ ⎪ , β= √ α= √ ⎩ 2N 2 N In the above formulas, the random variables (N (0, 1))i are independent instances of the normal random variable N (0, 1) (independently sampled) while N (0, C) stands for the N -dimensional random variable with the normal probability distribution, 0 mean and the covariance matrix C. Its density is given by the formula exp − 12 xT Cx (5.7) ρN (0,C) (x) = 1 ((2π)N det(C)) 2 The parameters σi , ai have to be initialized at the start of evolution. The coefficients α, β called “learning rates” may also be computed by using other formulas recommended by Schwefel [163] and Beyer [25]. In this case, the parameter vector assigned to each individual with the genotype x equals s = ({σi }, {ak }), i = 1, . . . , N, k = 1, . . . , N (N2−1) . Please note that the parameter vector value s may vary for different individuals in the population. All techniques adapting the mutation parameters described above in this section try to satisfy the goal A. Methods that try to improve the evolution by crossover parameter modification generally go in two directions: • They prevent individuals whose phenotypes are located too closely to each other in the admissible domain D from crossing. The phenotype distance is measured using the distance function d in the solution space V and should be not less than the assumed real, positive constant c. This constant can be modified proportionally to the inverse of the genetic epoch counter t (e.a. proportionally to 1t ). •
They prevent incest. Several levels of the genealogical tree of each individual are memorized. Crossover is prevented if parents have a common
5.3 Single- and twin-population strategies (α)
125
ancestor in the last k levels of their trees. The integer k stands for the parameter of such a strategy. The methods of adaptation of the rate of crossover pc in the binary crossover and, more generally, the modification of probability distribution of the crossover mask were given by Schaffer and Morishima [159]. For the (µ + λ) strategy, crossover parameter modification was studied by Rzewuski, Szreter and Arabas [146]. Now we assume the same parameter vector s = {σi }, i = 1, . . . , N for the individual x = {xi }, as used for mutation (see formulas 5.5, 5.6). They suggest modifying the arithmetic crossover of two parental individuals x1 , x2 ∈ D ⊂ RN to the form: ⎧ xi = x1i + γN (0, 1)(x2i − x1i ), i = 1, . . . , N ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ρN (x1 ,C2 ) (x2 ) (5.8) γ= ⎪ ρN (x2 ,C1 ) (x1 ) ⎪ ⎪ ⎪ ⎪ ⎩ Cj = {(σi )j δik }, j = 1, 2 where the vector x = {xi } is the child of x1 , x2 . The density functions ρN (xi ,Cj ) , i, j = 1, 2 are given by the formula 5.7. The parameters sj = {(σi )j }, j = 1, 2, i = 1, . . . , N are modified in the same way as in the formula 5.5. Another formula for adaptive crossover was delivered by Arabas [5]: ⎧ x = αx1 + (1 − α)x2 , x = αx2 + (1 − α)x1 ⎪ ⎪ ⎪ ⎨ (σ )1 = ασ 1 + (1 − α)σ 2 , (σ )2 = ασ 2 + (1 − α)σ 1 (5.9) ⎪ ⎪ ⎪ ⎩ α = N (0, 1) where σ j = {(σi )j }, j = 1, 2. Both adaptive crossover rules described above belong to class B. because they lead to the discrepancy increasing in consecutive genetic epochs. Parameter modification using monitoring of the evolutionary process (α.1.2) In order to modify genetic operation parameters, while respecting the monitoring results of the evolutionary process, the proper state parameters and state dynamic parameters have to be distinguished and some characteristic behavior has to be indicated and classified. Let us consider the evolutionary strategy with phenotypic encoding and the admissible domain of search D ⊂ RN as mentioned in the previous section. For its simplest instance (1 + 1) (random walk) we may use the adaptation strategy called 15 – rule, introduced by Rechenberg (see Arabas [5]). Now we
126
5 Adaptation in genetic search
consider only the single parameter s = σ which is the standard deviation that controls the mutation operation. ⎧ ⎪ ⎪ xi = xi + (N (0, σ ))i , i = 1, . . . , N ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ cI σ if f (x ) < f (x) holds at least ⎨ ⎪ ⎪ ⎪ ⎨ (5.10) k ⎪ − times in consecutive k epochs ⎪ = σ ⎪ ⎪ 5 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ cD σ in other case Real numbers cI > 1, cD < 1 and the positive integer k stand for the parameters of this strategy. The next two examples of evolutionary strategies which introduce feedback based on evolution monitoring are called Evolutionary Search with Soft Selection (ESSS). Both of them are based on the classical evolutionary (µ+λ) mechanism and phenotypic encoding, as in the previous case. The adaptation mechanism is activated if the evolving population satisfies the “trap test” at some evolutionary step k. The two following trap tests were defined by Obuchowicz and Patan [118], [122]: 1. Mean fitness in the population growth is less than p% in the last kntrap epochs. 2. The displacement norm of the expected value of population phenotypes (the norm of the population “mass” center displacement) is less than σ in the last kntrap epochs. The first strategy called Evolutionary Search with Soft Selection - Simple Variation Adaption (ESSS-SVA), increases the standard deviation of the mutation operation in each epoch that follows the trap test fulfillment, i.e. σ = α σ, α > 1
(5.11)
If the trap test is not satisfied, the standard deviation σ is set to its initial value. The constant α stands for the ESSS-SVA parameter. The second strategy of this group, called Evolutionary Search with Soft Selection - Forced Direction of Mutation (ESSS-FDM), changes the mutation operation after the trap test is satisfied at the genetic epoch t. The temporary mutation operation has the form: ⎧ xi = xi + (N (mi , σ)), i = 1, . . . , N ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Ei (Pt ) − Ei (Pt−1 ) ⎨ mi = ζ σ (5.12) E i (Pt ) − Ei (Pt−1 ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ Ei (Pt ) = 1 ηt (x) xi #Pt x∈Pt
5.3 Single- and twin-population strategies (α)
127
where Pt = (D, ηt ) is the population representation. The real number ζ stands for the strategy parameter. Both ESSS strategies described above mobilize the population if it occupies the basin of attraction of the single local extreme for a long time (the trap test is positively verified). ESSS-SVA temporary increases the chaotic search component which enlarges the region effectively checked while ESSS-FDM results in the stochastic drift of the whole population in the direction m ∈ RN determined by the current movement of the population centroid. The main goal of both strategies is to force population to cross the saddle in the evolutionary landscape between the local extreme currently occupied and the basin of attraction of a better-fitted extreme. Such a policy finally leads to finding at least one of the global extremes. A detailed description of these strategies together with the test results may be found in the papers [118], [122]. Parameters adaptation using another genetic operations (α.1.3) We recall once more the individual representation (x, s), x ∈ U, s ∈ Q introduced at the start of Section 5.3.1. These strategies also posses genetic operations that can process the parameter vectors s ∈ Q, so we define the second genetic algorithm with the genetic universum Q. Let us consider the execution of the single genetic operation gi for the fixed index i (see formula 5.1) for the parental individuals (x1 , s1 ), . . . , (xp , sp ). We will perform it in two steps: 1. First we use the proper genetic operation to the sequence of parameters (s1 , . . . , sp ) producing the new parameter vector s . 2. We execute the operation gi on the string (x1 , . . . , xp , s ) producing the offspring x . A detailed description of these strategies together with the test results may be found in the papers written by Bäck [12], [13], Grefenstette [77] and in the Arabas monograph [5]. 5.3.2 Strategies with a variable life time of individuals (α.2) Adaptation strategies that explicitly determine the life time of individuals lead to the variation of the population’s cardinality. This quantity, becomes the new, independent parameter that controls the evolutionary search dynamics. There are two different opinions that explain the influence of this parameter on the genetic search ability: Michalewicz [110] is of the opinion that a large population performs a better global search because it better fills the admissible domain D (in particular, the set of phenotypes Dr ). Strategies proposed by this author lengthen the
128
5 Adaptation in genetic search
individual’s life at the initial phase of evolution in order to perfectly fill the admissible domain and to generate the individual with a phenotype close to at least one global extreme of the objective function, or individuals located in the basin of attraction of such an extreme. The individual’s life time was shortened in the second phase of evolution in order to strengthen the landscape exploitation (e.g. in order to better concentrate population phenotypes close to the global extreme). Such an idea seems to be effective in the case of bounded, moderately-large admissible domains. Galar [71] underlines the greater mobility of small populations which results in faster checking of the whole admissible domain. The computational cost of the single evolution step is much lower because of much fewer fitness evaluations. Large populations are the worst suited for passing the evolutionary barriers, staying for a long time in the basin of attraction of the single local extreme. Strategies based on this idea decrease the population size by shortening the individual’s life if the evolutionary process is stagnated (e.g. the trap test is satisfied). They seem to be effective in the case of infinite or huge admissible domains that could not be filled satisfactorily by population phenotypes. Evolutionary strategies with a variable life time of individuals introduce an additional (aside from µ and λ) defining parameter κ which determines the maximum life time of individuals. The specific kind of selection controlled by the imposed individual life time is also utilized (see e.g. [5]): 1. Compute the life time L(x) of the individual with the genotype x ∈ U . 2. If L(x) > κ then the individual is removed from the population. 3. The usual type selection is performed among surviving individuals (e.g. proportional selection, see Section 3.4). The following formulas that allow us to compute the individual life time were quoted after Michalewicz [110], Bäck Fogel and Michalewicz [15], [10], [11] and Arabas [5]. Initially, we introduce three parameters: lmin , lmax , lav = 1 2 (lmin +lmax ) that denote minimum, maximum and mean individual life time assumed for the whole evolutionary process. We denote moreover: ⎧ fmin (t) = minτ =ts ,...,t {minx∈Pτ {f (x)}} ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ fmax (t) = maxτ =ts ,...,t {maxx∈Pτ {f (x)}} ⎪ ⎨ ! " (5.13) 1 1 ⎪ ⎪ fav (t) = f (x) ⎪ τ =t ,...,t x∈P s τ ⎪ t − ts λ(t) ⎪ ⎪ ⎪ ⎪ ⎩ ts = max{0, t − κ + 1} where κ is the strategy parameter and λ(t) stands for the offspring cardinality in the genetic epoch t.
5.3 Single- and twin-population strategies (α)
129
The first formula suggests the individual life time proportional to its fitness value: lmax − lmin f (x) (5.14) L(x) = lmin + fmax (t) − fmin (t) The next two formulas are only slight modifications of the previous one:
lmax − lmin L(x) = max lmax , lmin + f (x) fmax (t) − fmin (t) ⎧ ⎪ ⎪ ⎪ ⎨ lmin +
(5.15)
lav − lmin f (x) if f (x) < fav fav (t) − fmin (t)
(5.16) lmax − lav f (x) if f (x) ≥ fav fmax (t) − fav (t) In strategies discussed in this section both the parameters λ and µ become dependent on the genetic epoch counter. Two rules that determine the dependency between the number of parental and offspring individuals can be applied: L(x) =
⎪ ⎪ ⎪ ⎩ lav +
a. λ(t) = const. and µ(t) ≤ µmax (t) where µmax (t) corresponds to the situation in which L(x) = lmax ∀x ∈ Pt . b.
λ(t) = const. µ(t)
Detailed descriptions of the strategies mentioned above, completed by test results, are contained in the papers [4], [5], [164], [7]. These strategies generally increase the intensity of local search in the late phase of evolution, therefore they may be located in class A. Obuchowicz and Korbicz [118], [120] introduced the strategy Evolutionary Search with Soft Selection - Varying Population Size (ESSS-VSP). Each individual x has the initial life time L0 assigned at its creation. Its life time L(x) varies according to its fitness history. 0 / f (x) (5.17) L(x) = lm fm (t) The individual’s life time is modified by the ratio between its fitness and the maximum fitness fm (t) of the whole population Pt at the tth epoch. The quantity lm stands for the strategy parameter. The population’s cardinality grows if the mean population fitness grows, which enforces the local search in the basin of attraction of the single extreme. If the stagnation of the fitness increment is observed i.e. the mean fitness in the population bounds to the maximum one, then the number of individuals drops, which increases the population mobility, then facilitates the passage to the basin of attraction of the better fitted local extreme. This strategy may be classified into group B.
130
5 Adaptation in genetic search
5.3.3 Selection of the operation from the operation set (α.3) The main idea of this strategy is the personal selection of the genetic operation by each particular operation executed on the individual. The selection is performed from the prescribed set of operations GO. This set may be invariant with respect to the genetic epoch counter or may vary when passing between two consecutive epochs. Practically, we will consider the sequence of operation sets {GOt }, t = 0, 1, 2, . . . in the second case. The sequence element GOt will be available for the operation selection at the epoch t. The set GO or each set GOt may posses an internal structure which reflects the appearance of different types of operations. In particular we may have GOi , GOt = GOit (5.18) GO = i∈IOT
i∈IOT
where IOT denotes the finite set of operation types. Selection of the operation according to the still probability distribution (α.3.1) Each genetic operation performed at the tth genetic epoch is preceded by the one time sampling from the set GO according to the invariant probability distribution p ∈ M(GO). The sampling is usually multiple, which does not restrict the set GO by consecutive operations in the same epoch. In the case of more than one type of genetic operation, we first sample the operation type i from the set IOT and then the operation using the probability distribution pi ∈ M(GOi ). Both samplings are multiple ones. More information about these adaptation strategy instances may be found in the papers of Davis [52], [53], Julstrom [90], Arabas [5] and Stańczak [181]. Selection of the operation according to the self-adaptive probability distribution (α.3.2) The first strategy from this group consists of sampling only one genetic operation for the single genetic epoch t. This sampling is performed according to the probability distribution pt ∈ M(GO) which is adopted during the evolution process. In the case of more than one type of genetic operation performed consecutively (e.g. first mutation and next crossover), a single operation for each group is selected at the epoch t. Probability distributions pti ∈ M(GOit ), i ∈ IOT used for each operation type selection are also adopted by passing between two neighboring genetic epochs. Each sampling is a multiple one as previously. Let us denote one of the best fitted individuals in the epoch t of the strategy. x ˆ(t) = arg max{f (y)} y∈Pt
(5.19)
5.3 Single- and twin-population strategies (α)
131
We introduce the monitoring function 0 ψ(x, g, t) =
if f (x) ≤ f (ˆ x(t)) (5.20)
f (x) − f (ˆ x(t)) otherwise
where x is the genotype of the individual from the population Pt which was created by the operation g in the epoch t. Then we determine the price of the amount αψ(x, g, t) which is next distributed among operations that lead to the creation of the individual x during the T last epochs. The positive numbers α and T are the parameters of this strategy. The final price for the particular operation g in the genetic epoch t is computed according to the formula ) t−T ψ(x, g, τ ) (5.21) ψg (t) = max ψmin , τ =t x∈Pt
where ψmin is some positive constant. We choose some strictly positive probability distribution p0 ∈ M(GO), p0 ({g}) > 0 ∀g ∈ GO at the start of the strategy. The modification of the probability distribution is performed by using the formula pt+1 ({g}) = pt ({g}) −
ψg (t) h∈GO
ψh
.
(5.22)
Quantities {pt+1 ({g})}, g ∈ GO are then normalized so that the following conditions are satisfied pt+1 ({g}) = 1, pt+1 ({g}) > pmin ∀g ∈ GO (5.23) g∈GO
where pmin is also the constant that characterizes this strategy. The description of the above strategy follows the papers of Davis [52], [53], Julstrom [90], Stańczak [181] and the Arabas monograph [5]. Another kind of modification of the operation sampling probability was proposed by Stańczak [182]. He first introduced the quality coefficients. {q t (x, g)}, x ∈ Pt , g ∈ GOt
(5.24)
The probability distribution belonging to M(GOt ) that will be utilized for the genetic operation sampling will be constructed separately for each genotype x ∈ U . This kind of strategy makes sense only in the case of finite genetic universa #U < +∞. Let pt (x, ·) ∈ M(GOt ) be the probability distribution used for genetic operation sampling which will further act on the genotype x in the genetic epoch t. The suggested modification procedure is the following.
132
5 Adaptation in genetic search
pt (x, {g}) =
q t (x, g) t h∈GO t q (x, h)
q 0 (x, {g}) = q 0 , g ∈ GO0 ⎧ ∆t+1 (x, g) ⎪ ⎪ + αq t (x, g) g = actual q0 + ⎪ ⎪ ⎪ fmax (t) ⎨ q t+1 (x, g) = (x) g = new q ⎪ ⎪ ⎪ new ⎪ ⎪ ⎩ t q (x, g) for other operations g
(5.25)
where: ∆t+1 (x, g) positive fitness increment of the individual x resulting from the action of the operation g, actual the genetic operation currently selected in the t + 1 epoch, new the new selected operation in the t + 1 epoch, small number, which guarantees the non-zero probability of q0 sampling, α ∈ (0, 1) “forgetting” coefficient, arbitrary value of the new operation. qnew This strategy has been tested for various TSP problem instances (see Stańczak [182]). All the methods described above prefer genetic operations which force the fast increment of the fitness value. Such strategies belonging to class A. may be successfully applied at the initial phase of evolution. They are particularly advantageous if the specialized operations are included beside the classical ones: crossover and mutation. In such situations the ontogenetic operation selection as in formula 5.25 would be a costly, but effective solution. Davis [52], [53], Julstrom [90] also described another strategy of the adoptive genetic operation selection. Each genetic operation obtains its own time period (number of genetic epochs) during which it can affect the population individuals. The overall fitness increment which appears in this period may be assigned to this genetic operation. Depending on this fitness increment, the probability distribution of selecting among the current operation and the operations which were active before was established. Concurrent breeding of the genetic operation population (α.3.3) The application of evolutionary methods for the genetic operation selection was firstly suggested by Grefenstette [77], Raptis and Tzefastas [132]. The quite similar strategy of concurrently processing individuals that represent potential solutions to the target global optimization problem and the population of genetic operations acting on these individuals was introduced by Stańczak [182]. The single genetic epoch for the operation’s population corresponds to T > 1 epochs for the solution’s population. The evolutionary
5.3 Single- and twin-population strategies (α)
133
algorithm of the (µ + 1) type that modifies the operation’s population utilizes t → R+ which is changed the following fitness function fQ : t=0,1,... GO after each T -length period of genetic epochs for the algorithm that solves the target problem. ⎛ ⎞ T ∆t+k (x, g) t+T ⎝ + αq t+k+1 (x, g)⎠ fQ (g) = q 0 + (5.26) f (t + k) max t+k k=1
x∈P
In the above formula the operation g is taken from the set GOt+T . Other notations are identical to notations used in the formula 5.25. The type of evolutionary algorithm (µ + 1) which deals with operations informs us that only one new operation supplements the parental pool before selection is performed. The evolutionary algorithm that solves the target global optimization problem may be an arbitrary one of the type (µ +, λ). 5.3.4 Introducing local optimization methods to the evolution (α.4) Some evolutionary strategies introduce the local optimization methods acting in the admissible set D ⊂ RN into the stochastic search scheme. Their application is restricted, of course, by the proper regularity of the objective function. We will use the encoding operator code : U → D and the inverse coding dcode : D −→ U . The mapping code is injective while dcode is surjective, which takes a constant value on some neighborhood of each phenotype from Dr ⊂ D. Moreover, the coherency condition holds ∀ x ∈ U dcode(code(x)) = x (see definition 3.2). We assume that the dcode mapping can be extended in some way to the whole admissible domain D so that it is the well-defined function dcode : D → U (see fourth item in Remark 3.3). The result of the local optimization method which is activated at the point y ∈ D will be denoted as in Section 2.1 by loc(y) ∈ D. We assume that the local method is “non deteriorative” on the set of phenotypes Dr which means that Φ(loc(code(x))) ≥ Φ(code(x)), ∀x ∈ U. (5.27) The simplest possibility of including the local method in evolutionary computation is monitoring the set of pairs {dcode(loc(code(x))), Φ(loc(code(x)))}, ∀x ∈ Pt
(5.28)
and then exploiting monitoring results in various adaptation strategies. Other possibilities will be presented in the next two sections. Selection by the local method (Baldwin effect) (α.4.1) This complex computing technology utilizes the local optimization method by fitness evaluation. Basically, the fitness of the individual with the genotype
134
5 Adaptation in genetic search
x ∈ U is computed as Φ(loc(code(x))) or the simple, monotonic transformation value of this quantity e.g. scaling or non-linear “scaling” Scale(Φ(loc(code(x)))) (see Section 3.2). If the local optimization method is sufficiently accurate, then Φ(loc(code(x))) ≈ Φ(y + )
(5.29)
where y + is the local extreme to Φ so that code(x) lies in its set of attraction x ∈ Ryloc + defined similarly as in Section 2.1 (see definition 2.5). The procedure described above leads to the fitness flattening for all individuals whose phenotypes belong to the set of attraction of the particular local extreme of the function Φ (see Figure 5.5).
Fig. 5.5. The fitness flattening resulting from the local method application by individual evaluation (Baldwin effect).
The procedure described before may be related to the Baldwin effect observed in biology, which consists of the influence of the acquired features, which is compared here to the activity of local optimization methods which can improve the individual fitness. This strategy was discussed in the papers of Anderson [2], Whitley, Gordon, Mathias [197] and Arabas [5]. The fitness flattening for individuals whose phenotypes belong to the set of attraction of the particular local extreme of the objective function leads to the uniform distribution of individuals among sets loc Uyloc + = dcode(Ry + ∩ Dr )
(5.30)
where y + are the local extremes to the objective Φ. As a consequence, the number of individuals whose phenotypes are located close to the set of attraction border ∂Ryloc + is much bigger than in the case of traditional individual evaluation by the function Φ(loc(code(x))) (or by the Scale(Φ(loc(code(x)))) function).
5.3 Single- and twin-population strategies (α)
135
If the population phenotypes are mainly enclosed in the set of attraction of the single local extreme Ryloc + , then the larger number of individuals which are close to the attractor’s border ∂Ryloc + may increase the probability of passing the population to the neighboring attractor Rzloc + (exactly to the population loc passage from the set Uyloc + to Uz + ). The passage would happen by the mutation of such individuals moving their phenotypes outside the border, to the set Rzloc + and automatically evaluated better than at the previous location if Φ(z + ) > Φ(y + ) (see formula 5.29). Such an effect appears stronger if the encoding satisfies the Arabas condition 3.3 (see [5], Section 4.2 also see Remark loc 3.5), because the closeness of the attraction sets Ryloc + and Rz + implicates the loc loc closeness of genotype sets Uy+ , Uz+ . The strategies that utilize local optimization methods by individual evaluation help the population to pass evolutionary barriers, so they may be classified to group B. Gradient mutation and other genetic operations using local methods (Lamarc effect) (α.4.2) This strategy consists of introducing the new genetic operations inspired by the local optimization method started for the phenotype of its argument. Perhaps the simplest case of such an operation is the “gradient mutation” U x → x = dcode(loc(code(x))) ∈ U
(5.31)
This operation was introduced by Littman and Ackley [105]. Burczyński and Orantek [41] utilized the strategy in which only the best fitted individual is transformed by this operation in each genetic epoch. They implemented loc as steepest slope, conjugated gradient and variable metric methods. Burczyński and Orantek [41] also give the test results for the Rastrigin benchmark and the weight optimization by neural network learning. They obtained about 150% speedup in comparision to the simple genetic algorithm when finding the global optimizer. Smith [174], Yen, Liao, Lee and Rendolph [206] suggest the crossover supported by the single step of the Neldewr-Mead crawling simplex method. Introducing genetic operations that are based on the local optimization methods quicken the concentration of individual phenotypes close to the local extremes of the objective function Φ, so such strategies may be classified to group A. This group of operations has to be carefully applied (e.g. only in some regions of the admissible set or for a small fraction of population individuals) because the may significantly weaken the global search ability. 5.3.5 Fitness modification (α.5) This idea allows us to design perhaps the most effective single population genetic algorithms that perform the global search in the whole admissible domain D.
136
5 Adaptation in genetic search
Random fitness modification (α.5.1) The fitness is modified randomly, according to the formula (f (x)) = f (x) + s N (0, σ) ∀x ∈ U
(5.32)
where s ∈ R+ stands for the scaling parameter. The normally distributed random variable N (0, σ) is sampled independently by each fitness evaluation. The strategy then has two parameters s, σ. A more extended description of this strategy is given by Arabas [5]. Its application results in smoothing the fitness function graph so that the less distinct local extremes (the basin of attraction has little volume in comparison to the whole volume of the admissible domain, and the fitness variation over this basin is small) do not attract a significant fraction of the population. As a result, this strategy quickens the bounding of individuals to global extremes. Niching (α.5.2) This technique, which appears in many variants, consists of forcing the population to spread into several groups of individuals. Separate groups concentrate near separate local extremes of the objective function Φ. This kind of strategy may be classified to group C. This coercion is performed by the proper modification of the fitness function during the evolution process. Basic information about this technology may be found in the Goldberg book [74], Arabas [5] and Chapter written by Mahfoud [107]. Fitness sharing (α.5.2.1) This method consists of decreasing the evaluation (fitness value) of each individual with the genotype x ∈ U if its phenotype code(x) is closely surrounded by phenotypes of other individuals. The fitness modification may take the following form f (x) , ∀x ∈ Pt f (d(code(x), code(y))) y∈Pt s
(f (x)) =
(5.33)
where d is some metric in the space of phenotypes V , the function fs : R+ → [0, 1] is called “sharing function” so that fs (0) = 1 and limζ→+∞ f (ζ) = 0. Sharing functions are usually taken as non-increasing in their whole domain. The typical proposition is the function that restricts the range of fitness sharing effect ⎧ "α ! d ⎪ ⎨1 − d < σshare σshare (5.34) fs (d) = ⎪ ⎩ 0 otherwise where the real numbers σshare and α stand for parameters of this function, such as sharing range and sharing degree respectively.
5.3 Single- and twin-population strategies (α)
137
Such a fitness adaptation technique, well tuned to the global optimization problem to be solved, forms groups of individuals called “niches” whose phenotypes are concentrated near distinct local extremes of the objective function Φ. Fitness sharing strategies can be assigned to group C. because of their global search ability. The reader is referred to Goldberg and Richardson [73], Goldberg [74], De Jong [54] and Spears [177] for details. Sequential niching (α.5.2.2) The fitness modification performed here leads to the leveling of the objective function graph in the central part of the basin of attraction of local extremes already encountered. Selection removes individuals from such areas forcing them to find regions of larger fitness e.g. the basins of attraction of other extremes not yet found. Sequential niching was introduced by Beasley, Bull and Martin [19]. Obuchowicz and Patan [122] verified the fitness modification by using the formula , + N ! code(x)i − (ycenter (t))i "2 (5.35) (f (x)) = fm (t) exp − σi i=1 where (ycenter (t))i =
1 code(x)i η(x) , i = 1, . . . N µ
(5.36)
x∈Pt
are coordinates of the expected centroid of the phenotypes that correspond to the population Pt and σi2 =
1 (code(x)i − mi )2 , i = 1, . . . N µ
(5.37)
x∈Pt
stands for the variance of the individual location at the ith direction, while fm (t) is the maximum individual fitness that appears in the population Pt . Arabas [5] delivered another formula for fitness modification that leads to population niching. ⎧ (f (x)) = f (x) Deter(x, x∗ ) ⎪ ⎨ "α
! (5.38) d(code(x) − code(x∗ )) ⎪ Deter(x, x∗ ) = min ⎩ ,1 r where x∗ is the genotype of one of the best fitted individuals in the population Pt , the real number α is the strategy parameter while r stands for the range of modification. It can be evaluated using the next formula. √ N (5.39) r= N √ 2 p where p is the predicted number of local extremes.
138 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
5 Adaptation in genetic search t ← 0; Intialize P0 ; f ← f0 ; repeat Evaluate Pt ; Distinguish the best fitted individual x∗ from Pt ; if (trap_test) then Memorize x∗ ; f ← f ; end if Perform selection with the fitness f ; Perform genetic operations; t ← t + 1; until (stop_condition)
Algorithm 1: Draft of the ESSS-DOF strategy Obuchowicz [117] exploits the fitness modifications defined above (see formulas 5.35 - 5.39) in his strategy called Evolutionary Search with Soft Selection - Deterioration of the Objective Function (ESSS-DOF). The draft of this strategy is presented as the Algorithm 1. Test results of this effective approach to the global search were presented in the papers of Obuchowicz [118], Obuchowicz and Patan [122], Obuchowicz and Korbicz [120]. Trap tests applied here are the same as those described in Section 5.3.1. The logical variable trap_test is true if 1. the mean fitness in the population increases less than p% during the last ntrap genetic epochs, or 2. the displacement norm of the centroid of the set of phenotypes associated with the population is less than the mutation range σ measured in the space V (exactly the standard deviation of the individual phenotype displacement resulting from mutation) during the last ntrap genetic epochs. The logical variable stop_condition is true if the proper stopping rule for the whole strategy is satisfied. The simplest possible stopping rule may be the limited number of genetic epochs after the last fitness modification during which no trap was found. Telega and Schaefer [184], [185], [186] introduced the sequential niching strategy that utilizes the following fitness modification formula (f (x)) =
f (x)
if code(x) ∈ D \ CL(t)
fmin (t) if code(x) ∈ CL(t)
(5.40)
where fmin (t) is the minimum fitness value encountered in the genetic epochs 0, 1, . . . , t. If the minimization problem is solved, then fmin (t) corresponds to the maximum value of the objective Φmax (t) so that
5.3 Single- and twin-population strategies (α)
Φmax (t) = max Φ(y)| y = code(x), x ∈
t -
139
) Pτ
(5.41)
τ =0
The set CL(t) ⊂ D stands for the sum of the central parts of the basins of attraction of local extremes recognized up to the tth genetic epoch. A detailed description and analysis of this strategy will be made in Section 6.3. Temporary fitness perturbations – impatience strategy (α.5.3) This group of strategies allows us to modify the fitness along the evolution process (along the progress of genetic epochs) but they do not lead to population niching. One interesting case of such a strategy, called “impatience strategy”, was introduced by Galar and Kopciuch [71]. It utilizes a scheme similar to ESSS-DOF presented by the Algorithm 1. The fitness modification is performed according to the formula ⎧ (f (x)) = α(x) f (x) ⎪ ⎪ ⎪ ⎪ ⎨ β |code(x) − ycenter (t)| α(x) = (5.42) diam(code(Pt )) ⎪ ⎪ ⎪ ⎪ ⎩ diam(code(Pt )) = maxx,y∈Pt {|code(x) − code(y)|} where ycenter (t) is the centroid of the population’s Pt phenotypes and β stands for the strategy parameter. If the trap test defined similarly as in case of ESSS-DOF, turns back trap_test = f alse then genetic computations are performed without the fitness modification. If trap_test = true then the fitness modification given by the formula 5.42 results in drawing aside the individual’s phenotypes that are concentrated close to the particular local extreme of Φ. The population’s phenotypes usually polarize into two groups which are located on the opposite sides of the local extreme currently occupied. These groups go around the extreme, preserving the opposite location mainly due to the crossover symmetry. If at least one of this group finds the saddle in the evolutionary landscape, then approximately half of the population falls into the set of attraction of another extreme (if such a set is located sufficiently close to the saddle). This strategy may help populations to cross the saddle in order to attain the basin of attraction of the better fitted local extreme. It may be classified then to group B. 5.3.6 Additional replacement of individuals (α.6) These strategies generally mobilize the population for the periodical or permanent checking of the whole admissible domain D, so they may be classified to group B.
140
5 Adaptation in genetic search
Introducing random individuals (α.6.1) The offspring Ot (see the scheme in Figure 3.2) is enriched by several individuals produced in this same way as individuals of the initial population P0 , i.e. by sampling with the same probability distribution as by P0 creation. Such a procedure may be performed permanently (for all t = 1, 2, . . . ) or periodically in selected genetic epochs. This strategy is discussed in [5]. Periodic population refreshment (α.6.2) This group of strategies postulates the almost total replacement of the population’s individuals by new ones sampled according to the probability distribution, which may depend on the state and the history of the genetic search. This kind of strategy was initially described by Goldberg [75]. The search performance, either in genotype or phenotype representation, is monitored. If stagnation in evolution is observed, the population is reinitialized by using the cataclysmic mutation (see Eshelman [64]) and the best fitted individual as the pattern string. Krishnakumar [100] proposed the strategy called Micro-Genetic Algorithm (µGA) that processes small, several-individual populations (typically 5 individuals). If the population diversity sufficiently decreases, then the best fitted individual is kept and the remaining part is re-sampled using a uniform distribution. The µGA strategy does not use mutation. A similar idea was described in [5] and may be applied to the arbitrary (µ +, λ) evolutionary algorithm. If the trap test of type 1 is satisfied (see Section 5.3.1) then the algorithm is restarted with utilization of the best fitted individual x ˆ ∈ U that appears in the history. A new population is created by using the large-range mutation e.g. normal mutation with the mean value x ˆ and a sufficiently large standard deviation. Pre-selection and crowding (α.6.3) Individuals are replaced in the population if a new individual is produced by the genetic operations. The removal of the individual is governed by the following rules (see De Jong [54], Goldberg [74]): 1. The worst fitted individual being one of the parents or the child. 2. The individual whose genotype is nearest to the genotype of the newlyproduced one using some metric in U . The removed individual may be selected from the whole population or from its well-specified sub-multiset. This strategy is called crowding (see De Jong [54]).
5.3 Single- and twin-population strategies (α)
141
5.3.7 Speciation (α.7) Speciation strategies enforce concurrent local searches performed by groups of individuals belonging to the single population, which may lead to finding multiple local extremes of the objective function. This strategy may be classified then to group C. Arbitrary speciation (α.7.1) This is perhaps the simplest speciation strategy introduced by Goldberg [74]. It may be formalized as the four steps that are repeated consecutively after the initial population P0 is created. 1. Partition the population Pt for several sub-multisets Pt1 , . . . , PtSpec . 2. Perform genetic operations on each of the sub-multisets Pti , i = 1, . . . , Spec separately. 3. Merge the sub-multisets and perform common selection. 4. Go to step 1 if the stop criterion is not satisfied, otherwise stop. Restricted range of crossover – mating (α.7.2) Deb and Goldberg [57] introduced the genetic algorithm called mating in which crossover is restricted to individuals whose phenotypes are sufficiently close to one another with respect to the metric in V . Their paper also contains advantageous test results for the multimodal objective optimization. The first parental individual for crossover x1 is selected from the whole parental population Pt (see Figure 3.2) with the uniform probability distribution. The second parent x2 is searched so that the distance between both parental phenotypes is less than d(code(x1 ), code(x2 )) < σmate where σmate stands for the parameter of this strategy. If the operation fails (there are no individuals with the phenotype inside the open ball with the center code(x1 ) and the radius σmate ) then new selection of the first parent x1 is performed. If the second procedure fails after several samplings of the first parental individual e.g. the phenotype density is less than (σmate )−1 then both parents are chosen from Pt with the uniform probability distribution. Cellular genetic algorithm (α.7.3) The strategies described in Section 5.3.7 postulate the selection of the second parental individual x2 to the crossover operation from the neighborhood of the first parental individual x1 . The neighborhood was taken according to the topology, which changes during evolution (changes from one genetic epoch to the next, consecutive epoch). In the cellular genetic algorithm, individuals have stiff topological connections e.g. they are located in nodes of
142
5 Adaptation in genetic search
the p-dimensional lattice. These connections are not usually imposed by the topology in the genetic universum U . For each arbitrary individual x ∈ Pt its neighborhood Nr (x) is established. Usually Nr (x) overlaps the neighborhoods of other individuals in the population forming cells of this strategy. The parameter r denotes the diameter of the neighborhood expressed in units that characterize the particular individual’s structure (e.g. it may be the length of the path in the connection graph of the structure). Genetic operations may be performed only for arguments coming from the single cell, so the individual x can be crossed only with other individuals from its neighborhood Nr (x). The child-individual compete for the place in its cell. Operations in cells can be performed asynchronously, concurrently for nonoverlapping cells. The strategy can be easily implemented in the environment with many processors and local memory e.g. the distributed environment of a computer network. Details of the cellular genetic strategy may be found in the papers of De Jong, Sarma [55], and Seredyński [166]. 5.3.8 Variable accuracy searches (α.8) One of the main disadvantages of the classical genetic algorithms applied for solving continuous global optimization problems is their low accuracy, measured in the space of solutions V (e.g. the space which contain the population’s phenotypes). In the case of discrete encoding (e.g. binary affine encoding) the accuracy is restricted by the discretization error imposed by the density and the topology of phenotype mesh Dr selection. In both discrete and continuous encoding cases, the low accuracy may result from the low efficiency of the genetic search caused by the small number of individuals in comparison to the volume of the admissible set D and low progress of evolution slowed down by the necessary diversity improvements. The natural way to avoid such problems is to modify the classical genetic search techniques toward concentrating the search in the regions of D in which the solution’s appearance probability grows along the evolution. Two single population strategies of this type will be presented in following sections. Two multi-deme strategies which can also modify the search accuracy (HGS and iGP) will be mentioned in Sections 5.4.3 and 5.4.4. Dynamic Parameter Encoding (DPE) (α.8.1) This strategy, introduced by Schraudolph and Belew [161], is based on binary, regular affine encoding (see Section 3.1.1) which produces the regular N dimensional mesh Dr ⊂ D ⊂ RN in the admissible domain. This mesh limits the maximum accuracy which may be obtained in the DPE search. We denote by s the length of binary strings that are used to encode each coordinate of points from Dr , so the total length of strings is l = N · s. Such strings form the basic genetic universum Ω.
5.3 Single- and twin-population strategies (α)
143
However, the searching DPE population operates on strings from the set ΩDP E of the total length lDP E = N ·sDP E , which is significantly shorter than l. At the start of the DPE adaptation process, strings from ΩDP E represent the sDP E most significant bits of genotypes from Ω that encode parameter intervals containing (s − sDP E ) points at each dimension, so each genotype from ΩDP E corresponds to the brick in D that contains N (s − sDP E ) points from Dr . Each string from ΩDP E is completed by a (s − sDP E )-length suffix randomly generated (with respect to the uniform probability distribution) in order to evaluate its fitness which is originally defined on Ω. Suffix bits are not involved in genetic operations on ΩDP E genotypes, but they are rewritten to the child suffix. After stagnation in the evolutionary process is observed, the most promising bricks associated with the fixed sstep prefix are distinguished. They are called target bricks. The search process is focused on target bricks only in the next DPE adaptation steps. In the next step ΩDP E genotypes represent the Dr points lying in target bricks. They are completed now by the fixed sstep prefix and the randomly selected (s − sDP E − sstep )-length suffix in order to evaluate their fitness. Individuals whose phenotypes are contained in target bricks remain while those outside are transformed in some way in order to fall again into target bricks. The adaptation procedure is repeated until the satisfactory accuracy measured in the phenotype space is obtained or (s − sDP E − n · sstep ) becomes less than zero for the nth adaptation step (searching population intervals represent the less significant bits of genotypes from Ω). The strategy may significantly improve the efficiency of finding the final, global extreme with the maximum accuracy imposed by the network Dr in comparison with the traditional genetic search performed by the population of individuals with genotypes from Ω, because the cascade of short time searches with growing accuracy has a much lower computational cost than a single search with a large population performed during many more genetic epochs. However, the strategy may fail to find the global extreme if bad target bricks are selected at the initial steps. Delta Coding (α.8.2) Some similar ideas to the consecutive increasing of the search accuracy as in Dynamic Parameter Encoding are represented in the Delta Coding strategy, invented and studied by Whitley, Mathias and Fitzhorn [199]. The strategy is also related to the discrete, affine, binary encoding code : Ω → Dr which maps genotypes of the length l = N · s from the genetic universum Ω to the adequate set of phenotypes Dr ⊂ D ⊂ RN (see Section 3.1.1). The whole strategy is composed of two phases. The first one is the conventional genetic computation (authors of the paper [199] used the effective GENITOR algorithm with elitist selection) in which Ω and Dr play the roles of genetic universum and phenotype set respectively. This phase is finished if
144
5 Adaptation in genetic search
the population is sufficiently concentrated (i.e. the population diameter in the Hamming metric is sufficiently small), which indicates stagnation in the evolution process. Additionally, the best fitted individual x ˆ ∈ Ω is distinguished and memorized at the end of this phase. Next, the strategy passes to the second phase called the delta phase. Now the search is performed in the restricted area neighboring the best fitted individual x ˆ. In order to perform such a delta genetic search, the new genetic universum Ωδ which gathers the binary strings of the length lδ = N (1 + sδ ) is defined. A string ∆ ∈ Ωδ encodes each coordinate increment at sδ bits plus one bit for encoding the coordinate increment sign. Of course sδ has to be much less than s. In order to evaluate individuals from Ωδ the new fitx + ∆) ∈ R+ is established. The sum (ˆ x + ∆) ∈ Ω ness fδ : Ωδ ∆ → f (ˆ is interpreted as the catenation of the N sums evaluated independently for each s-length substring of x ˆ and the adequate (1 + sδ )-length substring of ∆ respecting its sign coded by the first bit. No mutation is utilized in the delta phase. The delta phase is repeated iteratively. The delta population is completely reinitialized at the start to each step. The evolution in each delta step is finished if the diversity in the population sufficiently decreases, i.e. if the Hamming distance between the best and the worst fitted individuals is less than or equal one. The best individual x ˆ is updated and the range of the delta search is decreased by reducing sδ after each iteration step. Iterations are finished if the satisfactory accuracy is obtained or the sδ is reduced to zero. From similar reasons to DPE, the delta coding strategy may improve the efficiency of finding a global extreme with the maximum accuracy imposed by the network Dr in comparison with the traditional genetic search performed by the population of individuals with genotypes from Ω. The special procedure performed in order to preserve the ability of the global search decreases the risk of the delta population getting bogged down in the basin of attraction of the local extreme.
5.4 Multi-deme strategies (β) The genetic strategies mentioned in the section’s title constitute serious competition with single-population ones, especially when solving heavy multimodal global optimization problems. Multi-deme genetic strategies process many equivalent or hierarchically-linked populations called demes. They may be compared to a single colony of many species that develop in the common environment. It is worth mentioning that multi-deme strategies, in spite of their formal complication, are significantly less computationally complex (they need much fewer evaluations of the fitness function) in comparison to the single-population ones when solving the hardest multimodal problems, in
5.4 Multi-deme strategies (β)
145
which the basins of attractions of local extremes are separated by huge areas on which fitness exhibits almost “flat” behavior (plateaus). Multi-deme strategies are also best suited for implementation dedicated to the multi-processor environment. There is a possibility to distinguish the coarse- or medium-size-grain computational tasks which can be effectively processed in parallel in a distributed environment. 5.4.1 Metaevolution (β.1) The metaevolution strategy is the parallel process of refining the target genetic algorithm that solves the target global optimization problem in the admissible domain D. The main idea is to use the so-called genetic meta-algorithm to transform the population of the genetic algorithms that solve the target problem. According to the convention introduced in Section 3.9 they are strategies γ (γ ' . These strategies operate then on two levels: of the type µ +, λ µ +, λ ˜ , η˜) represents the multiset of genetic algoMeta level population P˜ = (U ˜ encodes the genetic algorithm rithms. In particular, each genotype s˜ ∈ U gs˜ which operates on the target level. Meta-genetic operations transform P˜ to the new population of algorithms P˜ in each genetic epoch on the meta level. It is convenient to denote the meta level population in the setlike form P˜ = ˜ s1 , . . . , s˜µ imposed by Remark 2.15. The fitness function value f˜(˜ s) on this level is computed as the statistics of the gs˜ behavior on the target level. The arbitrary strategy may be used on this level, but strategies preserving the population size µ = const. are preferred for technical reasons. Target level is composed of µ genetic algorithms {gs˜}, s˜ ∈ P˜ which operate on separate target populations P s˜1 , . . . , P s˜µ . The phenotypes of individuals from all the target populations are encoded points from the admissible domain D of the target global optimization problem. At each genetic epoch of the meta level γ > 1 genetic epochs are performed for all target algorithms {gs˜}, s˜ ∈ P˜ in parallel. After running the γ epoch the special non-negative statistics f˜(˜ s) are evaluated for each algorithm gs˜, s˜ ∈ P˜ . These statistics may be the best fitness that occurs in P s˜ or the mean fitness increment in P s˜ during the last γ target epochs. The value f˜(˜ s) will be the fitness of the individual s˜ at the meta level. Metaevolution may be handled as the generalization of the adaptive strategy described in Section 5.3.1, (α.1.3). The main differences among these two strategies may be reported as follows: • In the adaptation technique described in (α.1.3) only various versions of this same genetic operation may compete. They may differ in parameter values (e.g. mutation rate). The selection of the proper version of the operation is made, respecting their results obtained on the single individual.
146
•
5 Adaptation in genetic search
In the metaevolution strategy, whole genetic algorithms (various strategies, various sets of operations) may compete. They are evaluated based on the statistics that were computed for whole separate target populations.
A more exhaustive description of metaevolution strategies may be found in Freisleben [69]. An interesting strategy, close to the metaevolution idea, is presented by Stańczak [182]. Metaevolution strategies may accomplish various goals A., B. or C. according to assumed statistics f˜(˜ s) and particular meta and target strategies. 5.4.2 Island models (β.2) The island genetic strategy consists of concurrently processing the fixed num ber of populations Pt1 , . . . , Ptµ , t = 0, 1, . . . called “islands”. This processing is not necessarily synchronous. All islands are engaged in solving the same global optimization problem in the admissible domain D. The genetic universum, encoding and fitness have to be the same for all island genetic algorithms while these algorithms are not necessarily identical. Additionally, the starting states of island populations P01 , . . . , P0µ may differ from one another. The island ' γ (γ model of genetic computations may be then classified as µ +, λ µ +, λ . However, the island model is not the simple redoubling of one or several types of single-population genetic searches. The genetic material is exchanged among islands by migration of small groups of individuals. It is possible because of the unified form of individuals (this same genetic universum and encoding assumed). Most frequently, clones of better fitted individuals are sent to another island by the home island process. The migration may be performed synchronously, according to the fixed topology of islands (e.g. ring topology) or asynchronously with the random selection of the destination island. Whitley and Gordon [198] applied the Simple Genetic Algorithm in each island and the synchronous migration of the single clone of the best fitted individual, which follows the ring topology of islands. Each island is independently initialized according to the uniform probability distribution on the genotype universum Ω. Obviously it does not imply the equality of all the starting islands. Whitley, Soraya and Heckerdorn [200] explained the mechanism of the island genetic search based on the Markov model applied for each island. The papers [198] and [200] are also rich in examples that illustrate the possibility of occupying the basins of attractions of various local maxims of the linearly separated objective function by separate genetic islands. They presented, moreover, another genetic islands model in which the more effective GENITOR algorithm is implemented at each island. New papers studying island model efficiency were published by Skolicki and De Jong (see e.g. [169], [170]). A review of genetic island models may also be found in the papers of Seredyński [166], Martin, Lieinig and Cohon [109].
5.4 Multi-deme strategies (β)
147
Almost all island models may be classified to group C. because of their ability to concurrently search in the domain D and, as a result, quickly find many local extremes of the objective function. Because of the coarse computational grain (single island defines the computational task that does not need to communicate intensively with other islands) island strategies perfectly fit the parallel, distributed computation paradigm. 5.4.3 Hierarchic Genetic Strategies (β.3) Hierarchic genetic strategies were introduced by Kołodziej and Schaefer [157]. The first implementation called HGS, using SGA instances as the genetic engine were described in detail in [96]. The next implementation called HGSRN, in which the SGA was replaced by the simple evolutionary mechanism with phenotypic encoding, was delivered by Wierzba, Semczuk and Schaefer [201]. The brief description of both implementations presented below mainly follows the last cited paper. The main idea of the Hierarchical Genetic Strategy is running a set of dependent evolutionary processes in parallel. The dependency relation has a tree structure with a restricted number of levels m. The processes of lower order (close to the root of the structure) represent a chaotic search with low accuracy. They detect the promising regions of the optimization landscape, in which more accurate processes of higher order are activated. Populations evolving in different processes can contain individuals which represent the solution (the phenotype) with different precision. This precision can be achieved by binary genotypes of different length, in the case of binary implementation or by different phenotype scaling, in the case of HGS-RN. The strategy starts with the process of the lowest order 1 called the root. After a fixed number of evolution epochs the best adapted individual is selected. We call this procedure a metaepoch of the fixed period. After every metaepoch a new process of the order 2 can be activated. This procedure is called sprouting operation. Sprouting can be generalized in some way to branches of a population’s tree of a higher order up to m − 1. Sprouting is performed conditionally, according to the outcome of the branch comparison operation. Details of both operations depend strongly upon the strategy implementation. Binary implementation HGS Let us concentrate first on the binary implementation HGS. The HGS genetic process is of the order j ∈ {1, . . . , m} if the individuals from the evolving population have genotypes of the the length sj ∈ N. The lengths of binary strings used in various order processes satisfy the inequality 1 < s1 <, . . . , < sm < +∞. The initial population for the new sprouted branch of the order j + 1 ≤ m contains individuals with prefixes identical to the genotype of the
148
5 Adaptation in genetic search
best adapted individual in the process of the order j. Suffixes of the length sj+1 −sj of these individuals are initialized randomly (according to the uniform distribution). The branch comparison operation in HGS is based on the prefix comparison. The operator acts on populations evolving in the processes of two consecutive orders j and j + 1. Let us assume that we distinguish the best fitted individual x from the branch of the j th order after some metaepochs. If there is at least one individual with the prefix of the length sj identical to x among j + 1 order branches, then a new process of the order j + 1 is not activated. The special kind of hierarchical nested encoding is used in order to obtain the search coherency for branches of various degrees. Let us denote by Ωs the genetic universum composed of binary codes of the length s > 0, so Ωs1 , . . . , Ωsm stand for the binary genetic universa of branches of degrees 1, . . . , m. Each universum is linearly ordered by the relation induced by the natural order among integers represented by binary strings. Moreover, for j = 2, . . . , m we can represent genetic spaces Ωsj in the following way: (5.43) Ωsj = (ω, ξ), ω ∈ Ωsj−1 , ξ ∈ Ωsj −sj−1 . First, we define the hierarchical nested encoding for D ⊂ R and then generalize the construction to D ⊂ RN , N > 1. We intend to define a sequence of meshes Dr1 , . . . , Drm ⊂ D ⊂ R so that #Ωsj = #Drj , j = 1, . . . , m and a sequence of one-to-one encoding mappings codej : Ωsj → Drj . We will do it recursively. First, we arbitrarily define the densest mesh Drm in D ⊂ R and the encoding codem : Ωsm → Drm as a strictly increasing function. Next, we arbitrarily define the set of selections φj : Ωsj → Ωsj+1 −sj , j = 1, . . . , m − 1 which play a fundamental role in the construction of meshes Drj , j = 1, . . . , m − 1. Finally, we put (5.44) Drj = codej+1 (ω, φj (ω)), ω ∈ Ωsj . Figure 5.6 below shows the sample meshes Dr1 , Dr2 and Dr3 ⊂ R in the case of s1 = 2, s2 = 3, s3 = 5, φ1 (00) = φ1 (01) = 1, φ1 (10) = φ1 (11) = 0, φ2 ≡ 01. j = 1, . . . , m-1. In the multidimensional case D ⊂ RN , N > 1 we assume the sequence of sub-string lengths 1 < sˆ1 <, . . . , < sˆm < +∞ used for encoding the single coordinate of phenotypes at each order branch. We have then sj = N sˆj , j = 1, . . . , m. Next, we define the arbitrary sets Dr1m , . . . , DrNm so that Drm = Dr1m × , . . . , × DrNm ⊂ D and then strictly increasing mappings codeim : Ωsˆm → Dri m , i = 1, . . . , N . As in the one dimensional case the construction of the coarse meshes is based on the selections φj : Ωsˆj → Ωsˆj+1 −ˆsj , j = 1, . . . , m − 1. Finally, we can set Dri j = codeij+1 (ω, φj (ω)), ω ∈ Ωsˆj , i = 1, . . . , N, j = 1, . . . , m − 1 (5.45)
5.4 Multi-deme strategies (β)
149
Dr3 Dr2 Dr1 Fig. 5.6. One dimensional nested meshes (D ⊂ R) for Hierarchical Genetic Strategy in the case s1 = 2, s2 = 3, s3 = 5.
and then Drj = Dr1j × , . . . , × DrNj , j = 1, . . . , m − 1.
(5.46)
Encoding mappings codej : Ωsj → Drj , j = 1, . . . , m are defined as the i composition of mappings code1j , . . . , codeN j while each codej is taken for the ith substring of the length sˆj in the argument string x ∈ Ωsj . The Simple Genetic Algorithm defined in Section 3.6 is used for each branch processing during metaepochs. Usually, larger population cardinality is set for lower order branches (close to the root) and much smaller cardinality is set for higher order branches and leafs. Additionally, the mutation rate pm is set higher for the root and main branches in order to strengthen the wide exploration of the admissible domain. The well-asymptotic properties of HGS can be mathematically proved (see [96] and [157]). Its high accuracy and the low computational cost, in comparison with island genetic strategies and with the specialized ESSS-SVA algorithms (see Sections 5.4.2, 5.3.1), were experimentally shown for multimodal benchmarks (see [96], [156]). Moreover, its applicability to the difficult engineering problem of minimizing the coordinate measuring machine geometrical errors was presented in [98]. Real encoding implementation HGS-RN In the HGS-RN both genotypes and phenotypes appearing in each branch of arbitrary order are the N -dimensional vectors of real entries. The genotype universa for branches of different orders are obtained by the proper scaling. Let’s assume that the admissible domain is now D = [a, b]N ⊂ RN , N > 1. We introduce the sequence of scaling coefficients +∞ > ξ1 >, . . . , > ξm = 1. The genetic universa of the various order branches are defined as (see Figure 5.7) 2N 1 b−a , j = 1, . . . , m (5.47) Uj = 0, ξj thus, encoding functions are given by codej : Uj {xi } → {ξj xi + a} ∈ D, j = 1, . . . , m.
(5.48)
150
5 Adaptation in genetic search
Fig. 5.7. Genetic universa in the HGS-RN strategy
We define moreover, the re-scaling function
ξi k ∈ Uj , i, j = 1, . . . , m. x scalei,j : Ui {xk } → ξj
(5.49)
The initial population for the newly sprouted branch of degree j + 1 is randomly chosen according to the N -dimensional probability distribution N ((scalej,j+1 (y))1 , σj ) , . . . , N ((scalej,j+1 (y))N , σj )
(5.50)
where y is the best adapted individual in the parental process of the order j = 1, . . . , m − 1, is the standard deviation specific for the branch order j. The branch comparison operation is based on the arithmetic average of genotypes in a population. Let P be a population of the order j = 1, . . . , m−1 and y its best fitted individual currently distinct from P after a metaepoch. Sprouting is not activated if there exists a population P of the level j + 1 which satisfies the following condition d(¯ y , scalej,j+1 (y)) < cj+1
(5.51)
where y¯ is the genotype’s average in the population P , cj+1 is the branch comparison constant for j + 1 order branches and d stands for the Euclidean distance in RN . The simple evolutionary techniques based on the normal phenotypic mutation and arithmetic crossover described in Section 3.7 are utilized in all branches. Moreover, the proper repairing operations are used in the case of individuals generated outside the genotype universum. The searches performed by branches of the lower order are less accurate and broader because of the restriction of the genotype set imposed by scaling (lower order branches search in smaller genotype sets). The search in higher order branches may be additionally narrowed by reducing standard deviations in the mutation operation. As in the case of HGS strategy, the population cardinality decreases along the growing degree of branches. Basic implementation of both HGS and HGS-RS assume synchronization of all branches after each metaepoch while the branches are processed in parallel between these checkpoints. At each checkpoint, branch comparison and
5.4 Multi-deme strategies (β)
151
sprouting is performed. This partially synchronous approach was described in detail in [156]. A more relaxed approach in which HGS-RN was run as the dynamic collection of intelligent agents was proposed and tested in [113]. There is no total synchronization among computing branches. Instead of the branch comparison and the conditional sprouting mechanisms, special type agents visit populations of the same order reducing the search redundancy. The HGS-RN strategy has been intensively tested and exhibits exceptional accuracy especially in the case of difficult multimodal benchmarks (see [201]) much higher than HGS. 5.4.4 Inductive Genetic Programming (iGP) (β.4) This strategy introduced by Slavov, Nikolaev [171], [172] is based on the decomposition of the evolutionary landscape suggested by Stadler [180]. The fitness function f of the complicated behavior is presented as the composition of the set {fn } of simpler mappings. Each of fn pointed out one of the characteristics of the f variability. One of the effective methods for obtaining the set {fn } is to perform the Fourier transformation of fitness f in the standard orthogonal basis {ei } which is proper for the set of genotypes U . f=
+∞
ai ei , fi = ai ei , i = 1, . . . , +∞
(5.52)
i=1
The strategy is composed of the superior population that explores the landscape imposed by the function f and the finite set of secondary populations, which explore landscapes associated with functions {fi }, i = 1, . . . m. Individuals from secondary populations may migrate to the superior one. The general rule of such migration is the following: 1. The best fitted individual x ˆ from the secondary population that evolves with the fitness fi is selected after some genetic epochs. 2. The individual x ˆ is transformed using the strictly ascent random walk (see Section 2.2) using mutation as the operation for obtaining the next step sample and the superior fitness f for the sample evaluation. 3. If the random walk described above stagnates in the neighborhood of the local extreme of f , then the process is stopped and its final result is passed to the superior population. The idea of using secondary populations is to more easily detect the local extremes that are better exposed by the particular secondary fitnesses fi . In particular, if fi = ai ei is the Fourier component of f , then it better exposes local extremes that appear with the specific frequency, imposed by the basic ˆ, is then function ei . The rough location of one such extreme, represented by x refined by the random walk (see step 2. of this strategy) with respect to the superior fitness f .
152
5 Adaptation in genetic search
The iGP strategy was successfully applied to the problems with complicated fitness functions in the Slavov and Nikolaev papers [171], [172], already mentioned as well as in the paper of Nikolaev and Hitoshi [115].
6 Two-phase stochastic global optimization strategies Written by Henryk Telega
6.1 Overview of two-phase stochastic global strategies Many stochastic strategies in global optimization consist of two phases: the global phase and the local phase. During the global phase, random points are drawn from the domain of searches D according to a certain, often uniform, ˜ , see Section 2.1 Definition distribution. Then, the objective function Φ (or Φ 2.1 and Remark 2.2) is evaluated in these points. During the local phase the set of drawn points (the sample) is transformed by means of local optimization methods. The goal of this phase is to obtain an approximation of the global extreme or local extremes. The two-phase methods can be described as methods for searching maxima or minima (maxima are considered in the monograph [86], minima are considered for instance by Rinnooy Kan, Timmer in [137] and [138]). In Section 6.1 only minimization problems are considered (see Chapter 2 problems ˜ aj }, i = 1, 2, 3, 4; j = 1, 2, 3). We limit our considerations to cases ˜ i, Π {Π i in which one can find equivalent maximization problems with non-negative objective functions (see also Chapter 2, Remarks 2.2 and 2.3). 6.1.1 Global phase Two desirable properties of the global phase are asymptotic probabilistic correctness and asymptotic probabilistic guarantee of finding all local minimizers or maximizers (see definitions 2.16, 2.17). In this section we assume that random points are cumulated and remembered. The sample stands here for the set of all points generated in subsequent steps of an algorithm from its beginning to a certain point in time. In order to differentiate this sample from the sample defined in Chapter 2 (see schema in Figure 2.2) we could call the former one the cumulated sample. However, for the sake of simplicity, we will call it just sample.
R. Schaefer: Foundation of Global Genetic Optimization, Studies in Computational Intelligence (SCI) 74, 153–197 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
154
6 Two-phase stochastic global optimization strategies
Basic probabilistic space Let the drawing (generating) of a point from the domain D be the elementary event. By Ω let’s denote the set of elementary events, ΣΩ will stand for the σalgebra of subsets of Ω, by Σ we denote the σ-algebra of subsets of RN . In this chapter random variables and random vectors will be underlined (e.g. Θ), their realizations will be denoted without the underline (e.g. Θ). Let xi : Ω → RN for i = 1, 2, ... be the random vector which maps the random event of choosing a point x ∈ D in the ith sampling to this point. In many methods the random points are generated according to the uniform distribution. For such methods ∀ω ∈ Ω, ∀A ⊂ Ω we evaluate the probability (see [28] Chapter 1) Pr{xi (ω) ∈ A} =
meas(A) . meas(D)
(6.1)
Probability of drawing a point from the set A ∈ Σ in the global phase When we assume the uniform distribution, then the probability of drawing a point from any measurable set A ⊂ D after m independent drawings, according to the Bernoulli schema, is equal to ""m ! ! meas(A) . (6.2) 1− 1− meas(D) ˜ which guarantees Any assumption about the objective function Φ or Φ, that the measure of appropriate sets Aε (see Chapter 2, Section 2.1) for maximization and minimization problems is positive, implies that the probability of finding a point from any of these sets becomes close to 1 when m increases (see [83] page 831, and also papers cited there: Brooks [37], Devroye [59]). An important question that should be answered is how to generate random points. This problem will not be covered in this book in detail. Apart from using standard generators, for instance, included in various libraries or attached to compilers, other methods can also be used, such as Halton sequences (see Shaw [167]) or using pre-samples and discarding points that are located near points generated and accepted earlier (see Törn, Viitanen [190]). 6.1.2 Global phase - why stochastic methods? A comparison of the global phase in which points are generated according to the uniform distribution with deterministic strategies like Grid Search (values of the objective function are calculated in points of a regular grid defined on the admissible set) is described by Sukharev [183], Anderssen and Bloomfield [3], Archetti and Betrò [9]. Pure Random Search (see sections 2.2 and 6.1.4) which contains only the global phase has been analyzed there. Analyses show that random points cover the admissible set more efficiently (according to
6.1 Overview of two-phase stochastic global strategies
155
some criteria which take into account the distance from the global minimizer to the nearest point of the grid, and to the nearest point from the random sample) than points of the regular grid, at least if the dimension of the problem is not too low (see also Rinnooy Kan, Timmer [137]). Sobol in [175] noticed that the projection of points from the uniformly distributed random sample onto any subspace gives the uniform distribution. In the case of a regular grid, such a projection can give groups of points which do not cover the admissible set well (at least for certain global optimization problems). This indicates that, for some problems, stochastic random samples are better than regular grids. Another argument which works to the advantage of stochastic methods is the fact that in many optimization methods there is no need to know, or assume, in advance the number of points that have to be generated. Sampling can be finished after an optimal or a suboptimal stopping rule (see Section 6.2) is satisfied. Such rules can be defined for stochastic methods. In many methods, after a stopping rule is satisfied, the distribution of points in D is uniform and the whole set is covered by these points. Such property does not hold for methods based on regular grids. There are stopping rules which determine in advance the number of points m that have to be drawn (see Section 6.2). However, for certain admissible sets D it is not easy to define for all m what a regular grid is. We should note here that there are also deterministic strategies that cover uniformly finite closed sets. Sobol in [175] describes the use of infinite sequences (called LP τ sequences), for which finite subsequences have the so-called measure of nonuniformity less than regular grids. However, these issues will not be discussed in this book. 6.1.3 Local phase Most stochastic methods of global optimization differ in the way in which starting points for local methods are chosen. This part of the local phase will be described separately for each optimization method that is discussed in this chapter. Remarks to local methods that are used in the local phase In theoretical considerations about the properties of local methods, a frequent assumption is that local procedures are - in the case of minimization - strictly descent (see Chapter 2, Definition 2.4 and also Rinnooy Kan, Timmer [137]). In practice, it is difficult to check if an optimization procedure is strictly descent. Instead, ε-descent procedures can be considered. Definition 6.1. A local method loc will be called ε-descent on D if, for each starting point x0 ∈ D and an arbitrary norm · in V it generates a sequence {xi }, i = 1, 2 . . . ⊂ D so that
156
6 Two-phase stochastic global optimization strategies
xi+1 = xi + αi pi ∀i = 0, 1, 2 . . . , pi ∈ V, pi = 1, αi ≥ 0 .
(6.3)
Moreover, the sequence converges to x+ ∈ D being a local minimizer, and satisfies ∀i = 0, 1, 2 . . . ∀j = 1, 2, . . . , ∀ε > 0 α (6.4) i ˜ i + j ε pi ) ≤ Φ ˜ (xi + (j − 1) ε pi ) . Φ(x Int ε Let us introduce new terms (see Rinnooy Kan, Timmer [137]). For x+ ˜ on D being a local minimizer of the function Φ •
Bxε + = {x ∈ Bx+ : ∃x1 ∈ D \ Bx+ and d(x, x1 ) ≤ ε},
˜ ˜ • yε = inf x∈Bε + Φ(x) otherwise, if Bxε + is nonempty, yε = max(Φ(x)) x ˜ • B˜xε + = x ∈ Bx+ : Φ(x) < yε . Remark 6.2. (see Rinnooy Kan, Timmer [137]) Every ε-descent method loc has the following features: ˜ 1. Let x ∈ D be an arbitrary admissible point and y ≤ Φ(x). If there is no point x1 ∈ L(y), x1 ∈ / Lx (y) such that the distance from x1 to an element of Lx (y) is less or equal to ε, then loc(x0 ) ∈ Lx (y) for all starting points x0 ∈ Lx (y). 2. loc(x0 ) = x+ for all starting points x0 ∈ B˜ε + . x
The assumption that procedures are ε-descent simplifies theoretical considerations but does not influence significantly the main results of them. In practice, it may be that local methods are not strictly descent and are not even ε-descent. Moreover, even ε-descent methods can be not convergent for some objective functions and large values of ε. In the case of non-convergence one can try to change the parameters of the local method or choose another method. 6.1.4 Pure Random Search (PRS), Single-Start, Multistart Pure Random Search (PRS) (see Chapter 2, Section 2.2) is a simple stochastic method used to find global minimizers or maximizers. It does not contain the local phase and it does not contain any mechanism which could be used to ˜ 1 . In the find local minima. PRS can be used for solving problems Π1 or Π version that is presented below, each population contains only one point.
6.1 Overview of two-phase stochastic global strategies
157
1: t ← 1; y0 ← ∞ 2: repeat 3: generate x ∈ D (use the uniform distribution) ˜ 4: if Φ(x) < yt−1 then ˜ xt ← x 5: yt ← Φ(x); 6: else 7: yt ← yt−1 ; xt ← xt−1 8: end if 9: t←t+1 10: until the stopping rule is satisfied
Algorithm 2: Pure Random Search PRS algorithm (version for minimization problems) A modification of the above algorithm that involves a local method which starts from the best point (after a certain number of generations) is called Single-Start. Another modification that consists of starting local methods from each sample point is called Multistart . Multistart detects minimizers, so it can be ˜ aj }, i = 1, 2, 3; j = 1, 2, 3. This method gave rise to used to solve problems {Π i clustering methods that are presented in Section 6.1.6. 6.1.5 Properties of PRS, Single-Start and Multistart ∞ , P ∞ ), where Ω ∞ = Ω × Let us define the new probability space (Ω ∞ , ΣΩ ∞ Ω × ... , by ΣΩ we denote the smallest σ-algebra that contains the algebra of cylindric sets. Here the elementary event is an infinite sequence of drawings of points from D. The existence and construction of P ∞ follows from the Kolmogorov theorem of conformation (see [28]). Let’s define the following random variables for k = 1, . . . :
yk : Ω ∞ → R ˜ (ω1 )), Φ(x ˜ (ω2 )), . . . , Φ(x ˜ (ωk )) , y k ((ω1 , ω2 , . . . , ωk )) = min Φ(x 1 2 k
(6.5)
ωi ∈ Ω, i = 1, 2, . . . , k. Random vectors xi have been defined in section 6.1.1. PRS has the following property (see [59], [137], see also Proposition 1 in [83]): Theorem 6.3. For Φ˜ ∈ C(D) P
∞
lim y k = Φ˜∗
k→∞
=1
˜ on D. ˜∗ denotes the global minimum of Φ where Φ
(6.6)
158
6 Two-phase stochastic global optimization strategies
Similar property has been proved for Single-Start and Multistart. For Multistart the random variables yk have to be redefined in the following way: yk : Ω ∞ → R y k ((ω1 , ω2 , . . . , ωk )) = min {loc(x1 (ω1 )), loc(x2 (ω2 )), . . . , loc(xk (ωk ))} . (6.7) PRS, Single-Start and Multistart find the global extremum with the probability equal to 1 if the size of the sample goes to infinity. These methods are asymptotically correct in the probabilistic sense, according to Definition 2.16. For PRS and Multistart there are known optimal and sub-optimal stop criterions (see [59, 208, 32, 30, 23]). These criterions will be discussed in Section 6.2. Multistart is an interesting method because of its mentioned good features, however, it is slow. The obvious drawback of Multistart is that it finds the same local extremum many times. One class of methods which aim at diminishing this drawback and accelerating computations are clustering methods. These methods are more efficient and they are also asymptotically correct. 6.1.6 Clustering methods in continuous global optimization General description Generally, clustering methods (see e.g. [89] and references therein) constitute a specific group of pattern recognition techniques, which do not utilize any learning set. Clustering consists of unsupervised exploration of a given data set, aimed at discovering groups in the data. To be more formal, clustering results in the construction of a partition of a discrete data set X = {x1 , . . . xm } into non-empty, exclusive subsets X1 , . . . Xk ; k ≤ m called clusters . That means clustering is governed by some equivalence relation R on X and all data items classified into the same cluster Xi are equivalent in the sense of R. Clustering algorithms has also been applied in continuous global optimization. A group of such methods is described in Guus, Boender, Romeijn [83]. The idea of applying clustering methods in global optimization is to determine groups of points from which local searches can be started. By clusters in global optimization we can understand sets of points that belong to approximations of sets of attraction of local minimizers. Clustering methods can be used in order to detect all local minima or the global minimum (together with minimizers). What is important, these methods enable us also to approximate the attraction sets of local minimizers. The idea of clustering methods described in [137] and [138] is to employ a local method only once in the basin of attraction of each detected local minimizer (under the assumption that local methods are strictly-descent or ε-descent, see Section 6.1.6). A similar idea can be seen in Multi Level Single Linkage (MLSL) which has been derived from clustering methods. However, in this method
6.1 Overview of two-phase stochastic global strategies
159
clusters are not detected. Another path in the research of clustering methods in global optimization has been developed by Adamska (see [152]). In her approach, a clustering method known from data analysis (FMM) is applied to continuous global optimization. This method gives the estimation of the density of measures. Clusters correspond to level sets of the measure density and can be represented by ellipsoids. This gives the possibility to approximate central parts of the basins of attraction of local extrema. The common schema for clustering methods described in this chapter is as follows: •
generate m random points from the admissible set D according to the uniform distribution,
• transform the set of random points in order to facilitate cluster recognition, • determine (recognize) clusters, • in each cluster start a local method and store the result. This schema can be repeated until a global stopping rule is satisfied. The good point of clustering methods in global optimization is that the number of local searches (which are time consuming) is diminished. This is especially significant when the function evaluation is expensive. Transformations of the set of random points Guus, Boender and Romeijn in the monograph [86] mention two ways in which random points in global optimization clustering methods can be transformed: reduction and concentration. Reduction consists of the rejection of those random points (generated previously) for which the objective value is greater than a certain threshold (see Section 2.2 and also Becker, Lago [20]). The set of remaining points, called a reduced sample, naturally approximate a level set of the objective function. Often this level set is not simply connected. Concentration consists of starting several steps of a simple local method (for instance the gradient method, see Törn [189], the method of random directions, see Wit [202]). Telega proposed another concentration method, through the use of genetic algorithms (see Telega [184], Cabib, Schaefer, Telega [43]). This method called Clustered Genetic Search (CGS) is described in Section 6.3. Cluster determination Cluster determination begins with a seed point. New points are attached to the cluster according to clustering rules. The seed can be this unclustered point (i.e. a point which does not belong to any cluster yet) which has the smallest objective value or a point obtained as the result of a local method
160
6 Two-phase stochastic global optimization strategies
(often simplified) that starts from the sample point with the smallest objective value. Popular clustering rules and clustering methods are Density Clustering (DC) (see Guus, Boender, Romeijn [83], see also Törn [189], Wit [202]), Single Linkage (SL) (see Rinnooy Kan, Timmer [137], see also Guus, Boender, Romeijn [83]). Multi Level Single Linkage (MLSL) is not considered to be a clustering rule or a clustering method, however, it is derived from SL. Analysis of clustering methods and clustering rules given by Rinnooy Kan and Timmer [137, 138] has been based on the assumption that the uniform distribution is used and the sample is transformed by the reduction. This approach enables us to use Bayesian methods in deriving stopping rules, analogously to Multistart (see Section 6.2). The next sections contain a description of the reduction phase and a description of scalar versions of DC, SL and MLSL together with a short analysis of them. 6.1.7 Analysis of the reduction phase Terms (i)
We assume the same terms as in sections 6.1.1 and 6.1.3. Additionally, let yk denote the ith value of the objective function from a sample of the size k m (k drawings, m points in each drawing, no reduction). Values of the objective are ordered ascendingly. One can expect that clustering methods should be quicker than Multistart. It would be desirable for clustering methods to have similar good properties like the asymptotic guarantee of success or the asymptotic correctness. Additionally, stopping rules should be theoretically justified. Such rules based on Bayesian analysis have been elaborated for Multistart. They will be described in Section 6.2. Analogous stopping rules have been proposed for clustering methods in global optimization. Such clustering methods can be treated as methods derived from Multistart, in which the sample is transformed, for instance by the reduction, and local methods are only started from some (unlike in Multistart) points of the transformed sample. Considerations about stopping rules for clustering methods will be preceded by the analysis of modifications that have been introduced in clustering methods in comparison to Multistart. This section contains a short analysis of the reduction phase. Most Bayesian stopping rules for Multistart (non-sequential, sequential optimal and suboptimal) are based on the knowledge about the size of the sample and the number of local minimizers detected. Such stopping rules can be applied to every method, which if it starts from the same set of points as Multistart gives the same set of local minimizers (gives the same results). In particular, they can be applied for methods in which a local method starts exactly once in the attraction sets of local minimizers that contain points from the sample. The Bayesian analysis (see Section 6.2) could be applied almost without changes if the set L(yγ ) is considered instead of the admissible set D. For any
6.1 Overview of two-phase stochastic global strategies
0 < γ < 1, yγ is calculated from the equation: meas x ∈ D : Φ˜ (x) ≤ yγ meas (D)
= γ.
161
(6.8)
However, in practice it is difficult or even impossible to determine the set L(yγ ). Assume that after the reduction, γ k m points are left (actually #γk m$ or %γ k m&, however, by omitting these symbols the analysis can be simplified and the main results are the same). So the set which is considered (γ k m) ). This set changes with k, so a simple application can be defined as L(yk of the Bayesian analysis (where the admissible set does not change) is not possible. However, Rinnooy Kan and Timmer show after Bahadur that the (γ k m) (defined similarly as in Section 6.1.5) converges with random variable y k probability to yγ while k increases, so Bayesian stopping rules can be applied (γ k m) as if L(yk ) does not change with k, at least for a large number of iterations (see Rinnooy Kan, Timmer, [137], Bahadur [18]). The problem of how to determine points from which local methods can start will be described in subsequent sections together with clustering rules. Ideally, a local method should be started exactly once in each attraction set of a local minimizer, in which there is at least one point from the reduced sample. Then the method would give the same set of local minimizers as Multistart (γ k m) )). This would allow us to apply the same stopping rules (on the set L(yk as in Multistart. Of course the total cost of local searches for such a method would be significantly reduced in comparison to Multistart. Remark 6.4. Application of Multistart to the reduced sample (to the sets (γ k m) )) implies the asymptotic probabilistic guarantee of success L(yk (Definition 2.16) , so the reduction phase does not affect this property of clustering methods. Remark 6.5. However, none of the clustering methods that contain the reduction phase has the property of asymptotic probabilistic guarantee of success in the sense of finding all local minimizers (Definition 2.17), regardless of the way in which starting points for local methods (in the local phase) are determined. The local phase may even worsen the effectiveness by omitting some local minimizers that could be found by Multistart with the reduction phase. It is a natural consequence of the rejection of this part of the domain of searches in which values of the objective function are greater than a constant (approximately yγ ). It is possible that a cluster recognized with the use of the clustering method in fact contains more than one minimizer. The DC and SL clustering rules presented below are constructed in such a way that the probability that a local method will be not applied to a point that would lead to an undiscovered local minimizer diminishes to zero when the size of the sample grows.
162
6 Two-phase stochastic global optimization strategies
6.1.8 Density Clustering Let k be the number of drawings of m points, loc denotes the local method, xs is the seed of a cluster, C(xs ) denotes the cluster with the seed xs , X + is the set of local minimizers found, #X + stands for the number of elements in X + . The following algorithm requires the determination of the parameters ri , i = 1, 2, ... according to Formula 6.17, which will be defined later in this section. Algorithm
Construction of Ti sets, asymptotic properties of DC In this method, clusters are built in a stepwise manner. First, a seed xs is determined. It can be the “best” unclustered point x from the reduced sample, like in Törn [189] and Wit [202] or a point that is obtained from a local method loc started from x. Let us denote by T0 the set that contains only the seed of a cluster. In subsequent steps denoted by i, new points from the reduced sample are attached to the cluster. These points belong to the sets Ti , i = 1, 2, . . . which are subsets of the admissible set D, Ti ⊂ Ti+1 , i = 1, 2, . . .. In simple versions of DC the sets Ti , i = 1, 2, . . . are balls. The radius of the ball is increased in subsequent steps until the density of points from the reduced sample is greater than a certain constant, which is, for instance, equal to the average density of the sample points in the admissible set D without the reduction. Such a version has been proposed by Törn [189]. In the version proposed by Rinnooy Kan and Timmer (see [137]) the shape and the volume of Ti , i = 1, 2, ... can be determined after the following reasoning. A cluster (γ k m) ). which is initiated by the seed xs should be related to the set Lxs (yk This suggests that subsequent Ti should be approximations of Lxs (y) for y ˜ ∈ C 2 (D) we can approximate increased stepwise. Under the assumption that Φ ˜ ˜ s ) + 1 (x − xs )T H(xs )(x − xs ), the level sets using the estimation Φ(x) ≈ Φ(x 2 where H denotes Hessian. Hence Ti can be defined as follows: Ti = {x ∈ D : (x − xs )T H(xs )(x − xs ) ≤ ri2 }
(6.9)
for ri < ri+1 , i = 1, 2, . . .. In this approach Ti are ellipsoids. An approximation of Hessian can be easily obtained as a by-product of quasi-Newton local methods. Radiuses ri are set in such a way that guarantees the following asymptotic property: when k increases, the probability that the process of cluster recognition is stopped too early is less or equal to the probability that there is no original sample / Ti−1 }. One can (i.e. not reduced sample) point in ∆Ti = {x ∈ D : x ∈ Ti , x ∈ say that the process is stopped too early if, in the set ∆Ti , there is no point of (γ k m) ). This the reduced sample but there is at least one such point in Lxs (yk situation has been called an error of type I by Boender (see [34]).
6.1 Overview of two-phase stochastic global strategies
163
1: k ← 0; X + ← ∅; 2: repeat 3: k ←k+1 4: /* Determination of reduced sample */ 5: Draw m points x(k−1)m+1 , ..., xk m according to the uniform distribution on D. Choose γ k m "best" points from the cumulated sample x1 , ..., xk m ; 6: j ← 1; /* Number of the cluster that is being recognized */ 7: while Not all points from the reduced sample have been assigned to clusters; do 8: /* Determination of the seed */ 9: if j ≤ #X + then 10: Choose j th local minimizer in X + as the seed xs of the cluster C(xs ) 11: else 12: repeat x ← unclustered point from the reduced sample with the smallest 13: objective value; 14: Apply a local method loc to x; 15: x+ ← loc(x); 16: if x+ ∈ X + then 17: C(x+ ) ← C(x+ ) ∪ {x}; 18: end if / X +; 19: until x+ ∈ + + 20: X ← X ∪ {x+ }; 21: xs ← x+ ; 22: end if 23: i ← 0; Ti ← {xs }; 24: repeat 25: /* New points are added to the cluster that is being recognized */ 26: i ← i + 1; 27: Add to C(xs ) all unclustered points that are located in the set Ti with parameter ri (xs ) (see formulas 6.9 and 6.17); 28: until No new point has been added to C(xs ); 29: j ← j + 1; 30: end while 31: until The global stopping rule is satisfied;
Algorithm 3: Density Clustering The probability of an error of type I in step i is less than or equal to αk , where "k m ! meas (∆Ti ) . (6.10) αk = 1 − meas (D) The set ∆T should contain at least one point of the not reduced sample with the probability (1 − αk ), under the assumption that the distribution is uniform. From Formula 6.10 we have 1 (6.11) meas(∆Ti ) = meas(D) 1 − αkk m
164
6 Two-phase stochastic global optimization strategies
so when we assume that the probability of an error of type I is less than or equal to αk then we can expect that a set whose volume is given by Formula 6.11 contains at least one sample point. We assume that Ti are ellipsoids. The volume of the ellipsoid (x − xs )T H(xs )(x − xs ) ≤ ri2 is equal to N
π 2 riN 1 Γ 1 + 12 N det (H(xs )) 2
(6.12)
where Γ stands for Gamma Euler’s function (see for instance [28] Chapter 18). Hence, the stopping rule can be as follows: no unclustered point x from the reduced sample has been found, for which π(x − xs )T H(xs )(x − xs ) ≤ 4 N2 3 1 1 iΓ 1 + 12 N det (H (xs )) 2 meas(D) 1 − αkk m .
(6.13)
Rinnooy Kan and Timmer have proposed in [137] such a ∆Ti that meas(∆Ti ) = meas(D) Hence
! αk =
1−
σ log(k m) , σ > 0. km
σ log(k m) km
(6.14)
"k m .
(6.15)
The probability αk decreases polynomially with increasing k, because ∀k ∃c1 , c2 such that "k ! σ log k ≤ c2 k −σ . (6.16) c1 k −σ ≤ 1 − k Thus, the parameter ri can be evaluated from the formula: ri = π
− 12
" "1 ! 1 N σ log(k m) N 2 det (H (xs )) meas(D) iΓ 1 + . 2 km
!
(6.17)
Remark 6.6. 1. In DC if the process of cluster recognition is stopped in step i, i.e. there is no point in Ti (the parameter ri is calculated from Formula 6.17), then the probability that a cluster has been recognized improperly with an error of type I decreases polynomially while k increases. This follows directly from the discussion presented above. 2. The basic disadvantage of this version of DC is that the objective function is approximated by the square function and the attraction sets are (γkm) differs significantly approximated by ellipsoids. If the set Lx+ yk from an ellipsoid, then the criterion concerning when to stop the cluster
6.1 Overview of two-phase stochastic global strategies
165
recognition process is incorrect (other estimations given in this section are also incorrect). A good method should more accurately ap clustering (γ k m) with increasing k. One such method is proximate the sets Lx+ yk Single Linkage. It is presented in the next section. 3. Similarly to PRS and Multistart, DC also has the property of asymptotic correctness (i.e. it finds the global minimum with the probability 1 with k increasing to infinity). This follows from the the global phase fact that (γ k m) . has properties of PRS applied to the sets L yk 4. We cannot expect all local minimizers to be found. Points of recognized clusters in a sense approximate some subsets of D, they should be central parts of sets of attraction of local minimizers. However, in reality, these subsets may contain more than one local minimum and more minimizers. A cluster can cover more than one minimizer. Because the local method is started exactly once in each cluster, some local minimizers may remain undiscovered. Also the reduction phase itself can be a reason why some local minimizers could be omitted. DC does not have the property of the asymptotic probabilistic guarantee of finding all local minimizers. 5. The version presented in this section after Rinnooy Kan and Timmer [137] has the property of the asymptotic probabilistic guarantee of finding in each simply connected component of the set one local minimizer ˜ x ∈ D : Φ(x) ≤ yγ , where yγ is calculated according to Formula 6.8 (when we assume that the local method is ε-descent, this is true if the components are not too close). The last conclusion requires comments. Boender proposed use of an additional criterion when points are assigned to a cluster (see [29]). This can be applied in cases of differentiable objective functions. An approximated deriv˜ in the direction xs is equal to ative of Φ ˜ (x + h(xs − x)) − Φ(x) ˜ Φ h||xs − x||2
(6.18)
(|| · ||2 denotes Euclidean norm on V ) for small h is calculated. If the value is positive, then x is not assigned to the cluster. It is easy to give examples which show that this criterion can fail and a point will not be assigned to the cluster, though the point belongs to the set of attraction of the local minimizer. For instance, in a “spiral valley” the value of the derivative of Φ˜ in x in the direction xs can be positive, though any descent local method (e.g. gradient methods) converges to xs . It is possible that such an error could be fixed in the next steps and the unclustered point would be assigned to the appropriate cluster. However, the criterion can also fail because it is possible that a point is assigned to one cluster, but it should be assigned to another one (because it belongs to the set of attraction of another local minimizer).
166
6 Two-phase stochastic global optimization strategies
Another comment to the terminology should also be given. Zieliński (see [208]) defined the basins of attraction for local minima, not local minimizers. However, he considered only cases of isolated minimizers. Our definition of the basin of attraction (see Definition 2.6) refers to local minimizers (not minima), which are isolated. For isolated minimizers, both definitions can be used interchangeably. 6.1.9 Single Linkage This method differs from DC in the way in which basins of attraction of local minimizers are approximated. Let the distance from a point x to a set A be defined as: (6.19) ρ(x, A) = inf ||x − y||2 . y∈A
Pk will stand for the set of all points from the reduced cumulated sample and C = xs C(xs ) will denote the set of points that have already been assigned to clusters. Algorithm Analysis of the properties of Single Linkage can be found in [137]. It is much more complicated than the analysis of the properties of Density Clustering, so here only the final results will be recalled (see Theorem 8 and Theorem 12 in [137]). Theorem 6.7. If the critical distance rk is given by the formula rk = π
− 12
" "1 ! ! σ log(k m) N N meas(D) Γ 1+ 2 km
(6.20)
then: • for σ > 2 the probability that the local method will be started by SL in step k tends to zero with increasing k, •
for σ > 4 even if the sampling of points had been carried endlessly, the total number of local searches started by SL would be finite with the probability equal to 1.
Theorem 6.8. If rk tends to zero with increasing k, then in each component of L(yγ ) that contains points from the sample, the local minimizer will be found with the probability equal to 1 in a finite number of steps. Remark 6.9. 1. The disadvantage of DC related to the way in which the objective function and the basins of attraction are approximated is eliminated in SL.
6.1 Overview of two-phase stochastic global strategies
167
1: k ← 0; X + ← ∅; 2: repeat 3: k ←k+1 4: /* Determination of reduced sample*/ 5: Draw m points x(k−1)m+1 , ..., xk m according to the uniform distribution on D. Choose γ k m "best" points from the cumulated sample x1 , ..., xk m ; 6: j ← 1; /* Number of the cluster that is being recognized */ 7: while Not all points from the reduced sample have been assigned to clusters; do 8: /* Determination of the seed */ 9: if j ≤ #X + then 10: Choose j th local minimizer in X + as the seed xs of the cluster C(xs ) 11: else 12: repeat x ← unclustered point from the reduced sample with the smallest 13: objective value; 14: Apply a local method loc to x; 15: x+ ← loc(x); 16: if x+ ∈ X + then 17: C(x+ ) ← C(x+ ) ∪ {x}; 18: end if / X +; 19: until x+ ∈ + + 20: X ← X ∪ {x+ }; 21: xs ← x+ ; 22: end if 23: l ← 0; 24: repeat 25: /* New points are added to the cluster that is being recognized */ 26: if Pk \C = ∅ /* There are unclustered points (Pk stands for the reduced sample in the step k)*/ then 27: Find xkl ∈ Pk \C such that ρ xkl , C(xs ) = min (y, C(xs ));
y∈Pk \C
28: if ρ xkl , C(xs ) ≤ rk then 29: C(xs ) ← C(xs ) ∪ {xkl } /* xkl is assigned to the cluster C(xs ) */ 30: end if 31: end if 32: l ← l + 1 33: until ρ xkl , C(xs ) > rk OR Pk \C = ∅; 34: j ← j + 1; 35: end while 36: until The global stopping rule is satisfied;
Algorithm 4: Single Linkage
168
6 Two-phase stochastic global optimization strategies
2. The number of local searches can be treated as an effectiveness measure of the method. Theorems 6.7 and 6.8 show that SL has good asymptotic properties. The number of local searches is finite even if the sampling of points is carried out endlessly. 3. Rinnooy Kan and Timmer in [137] mention experiments which show that (γ k m) than DC does. SL better approximates sets Lx∗ yk 4. SL has the property of asymptotic correctness, i.e. it finds the global minimum with the probability equal to 1 when k increases to infinity. It follows from Theorem 6.8 when we assume that rk is set according to Formula 6.20 5. SL does not have the property of probabilistic guarantee of success in the sense of finding all local minimizers. 6. Multi Level Single Linkage, which has been derived from the clustering methods presented above, has the property of probabilistic guarantee of success - it can find all local minimizers. This follows from Theorem 6.8. 6.1.10 Mode Analysis Mode Analysis (MA) has been derived from Single Linkage. The main difference between SL and MA is that greater subsets of the domain are assigned to clusters than single points. In the version which is described below these subsets are hypercubes (see Rinnooy Kan, Timmer [137]). Terms and definitions For the sake of simplicity let√us assume that D is a hypercube that can be divided into κ hypercubes ( N κ is an integer, D ⊂ RN ). •
Hypercubes that are results of the division of D will be called cells.
•
A cell U will be called full if it contains more than meas(U )k m 2 meas(D)
(6.21)
points from the reduced sample. • •
A cell which is not full will be called empty.
Two cells Ua and Ub will be called neighboring if ∀ε > 0 ∃xa ∈ Ua and xb ∈ Ub so that ||xa − xb || < ε. • Clusters will contain these points from the reduced sample that belong to neighboring full cells. The cluster recognition begins with a seed-cell, which contains a point from the reduced sample with the smallest objective value
6.1 Overview of two-phase stochastic global strategies
169
(like in [137]). A seed-cell can also be the cell which contains the “best” minimizer found. •
By C(Us ) we denote the cluster, for which the seed cell is Us .
• In the original approach clusters are sets of points (after the reduction). However, one could also consider such a version of MA, in which clusters are unions of whole hypercubes, which are full cells. Depending on the version by assigning the cell U to a cluster C we can understand: 1. assigning of these points from the reduced sample, that belong to U 2. attaching the whole cell U to cluster C (union).
Algorithm Rinnooy Kan and Timmer compared Single Linkage and Mode Analysis (see [137]). The comparison shows that one cannot state which method is better. In some cases SL recognizes clusters incorrectly while MA recognizes them correctly and vice versa. We recall here two theorems concerning asymptotic properties of the Mode Analysis method (see Theorem 14 and Theorem 15 in [137]). Theorem 6.10. If the number of cells in MA is equal to κ=
km σ log(k m)
(6.22)
then •
for σ > 10 the probability that the local method will be started by MA in step k tends to zero with increasing k,
• for σ > 20 even if the sampling continues forever, the total number of local searches started is finite with the probability equal to 1. Theorem 6.11. If the number of cells in MA is equal to κ which is given by Formula 6.22 (for σ > 0) then for each component L(yγ ) which contains points from the sample, the local minimizer will be found in a finite number of steps with the probability equal to 1. Remark 6.12. 1. All remarks given for SL are also true for MA. 2. Because MA makes use of cells it is more immune to errors related to random irregularity of the sample (that means the local deviation from the uniform covering). However, on the other hand it is possible that in MA points from the sample will be incorrectly assigned to clusters because of too large cells. 3. MA enables us to relatively easily approximate the sets of attraction of local minimizers by a union of hypercubes.
170
6 Two-phase stochastic global optimization strategies
1: k ← 0; X + ← ∅ 2: repeat 3: k ←k+1 4: /* Determination of the reduced sample */ 5: Draw m points x(k−1)m+1 , ..., xk m according to the uniform distribution on D. Choose γ k m "best" points from the cumulated sample x1 , ..., xk m ; 6: Divide D into κ cells. 7: /* Determination of cells, determination of which cells are full and which are empty */ 8: For each cell determine how many points from the reduced sample are inside the cell. If this number is greater than the value given by Formula 6.21 then the cell is full, otherwise the cell is empty. 9: j ← 1; /* Number of the cluster that is being recognized */ 10: while Not all full cells have been assigned to clusters; do 11: /* The determination of the seed of a cluster. The seed is a full cell.*/ 12: if There is an unclustered full cell Us which contains an element from X + (i.e. it contains a local minimizer) then 13: Choose Us as the seed of the cluster C(Us ); 14: while There is an unclustered full cell Ub which is a neighbor of any cell assigned to cluster C(Ua ) do 15: Assign U to C(U ) (assign points from U to C(U )) 16: end while 17: else 18: Determine the point x so that the objective value in this point is the minimum value of the set of values in points which are in unclustered full cells. 19: Apply the local method loc to x; x+ ← loc(x); 20: X + ← X + ∪ {x+ }; 21: end if 22: j ← j + 1; 23: end while 24: until The global stopping rule is satisfied;
Algorithm 5: Mode Analysis 6.1.11 Multi Level Single Linkage and Multi Level Mode Analysis Multi Level Single Linkage (MLSL) has been derived from clustering methods (see [138, 83]). Unlike DC or SL it does not contain the reduction phase. Hence, there are no disadvantages caused by this phase. Two versions of this method will be presented below. The first one contains a clustering process, the second one does not. The goal of both versions is to find all local minimizers. The second version cannot be applied when the sets of attraction of local minimizers are to be found. It has been proved that both versions have good asymptotic properties: the asymptotic probabilistic correctness and probabilistic guarantee of finding all local minimizers.
6.1 Overview of two-phase stochastic global strategies
171
Algorithm version 1
1: k ← 0; X + ← ∅ 2: repeat 3: k ←k+1 4: Draw m points x(k−1)m+1 , ..., xk m according to the uniform distribution on D; ˜ i+1 ), 1 ≤ i ≤ k m − 1 ˜ i ) ≤ Φ(x 5: Sort all points drawn in k steps so that Φ(x 6: j ← 0; /* The number of clusters that are already recognized */ 7: i ← 1; 8: while i ≤ k m do 9: Add xi to each cluster which contains points within the distance rk from xi ; 10: if xi is unclustered then 11: Start a local method loc from xi ; 12: x+ ← loc(xi ); 13: if x+ ∈ X + then 14: C(x+ ) ← C(x+ ) ∪ {xi } 15: else 16: /* A new cluster has been found*/ 17: X + ← X + ∪ {x+ }; 18: j ← j + 1; 19: Set the seed xs of the new cluster C(xs ) to x+ ; 20: end if 21: end if 22: i←i+1 23: end while 24: until The global stopping rule is satisfied;
Algorithm 6: Multi Level Single Linkage version 1 In this method one point from the sample can be assigned to more than one cluster. This reflects the fact that it is possible that there are such subsets of D that different local methods (even strictly descent) which start from these subsets can find different minimizers. Clusters initiated by x+ are neither related (γ k m) ). to Rxloc + nor to the basins Bx+ . They are also not related to Lx+ (yk Clusters are sets that contain such x ∈ D, for which there is a rk -descent sequence of points from the sample x1 = x, x2 , ..., xν = x+ for a certain ν. A sequence is called rk -descent if the distance from two subsequent elements ˜ 1 ) ≥ ... ≥ Φ(x ˜ ν ). If two different methods is not greater than rk and Φ(x that start from the same point find different local minimizers there is a risk that some local minimizers remain undiscovered. However, it has been proved (see Rinnooy Kan, Timmer [138]) that this kind of error does not appear if local methods start from the interior of the basin Bx+ . If k increases and r is suitably small this error does not appear with the probability 1 (see Theorem
172
6 Two-phase stochastic global optimization strategies
6.13 and 6.14 below). The name of the method comes from the fact that it (i) finds the same local minimizers as SL which is applied to the sets L(yk ) for each k = 1, 2, ..., k m, i.e. for subsequent cut-off levels which are determined (i) by yk . Algorithm version 2
1: k ← 0; X + ← ∅ 2: repeat 3: k ←k+1 4: Draw m points x(k−1)m+1 , ..., xk m according to the uniform distribution on D; 5: i ← 1; 6: while i ≤ k m do ˜ i ) and ||xj − xi ||2 < rk ) then ˜ j ) < Φ(x 7: if NOT (there is such a j that Φ(x 8: Start a local method loc from xi ; 9: x+ ← loc(xi ); 10: X + ← X + ∪ {x+ }; 11: end if 12: i←i+1 13: end while 14: until The global stopping rule is satisfied;
Algorithm 7: Multi Level Single Linkage version 2 The asymptotic properties of both versions are the same. They are expressed by two theorems (see Theorem 1 and Theorem 2 in [138]): Theorem 6.13. If the critical distance rk is given by the following formula: " "1 ! ! 1 σ log(k m) N N meas(D) rk = π − 2 Γ 1 + (6.23) 2 km then • if σ > 0, then for any sample point x ∈ D the probability that the local method will start from this point in step k decreases do zero while k increases, •
if σ > 2 then the probability that the local method will start in step k decreases to zero while k increases,
•
for σ > 4 even if sampling is performed infinitely, the total number of local searches ever started by MLSL is finite with the probability 1.
Theorem 6.14. If rk tends to zero with increasing k, then each isolated local minimizer x+ will be found by Multi Level Single Linkage with the probability 1 in a finite number of steps.
6.1 Overview of two-phase stochastic global strategies
173
Remark 6.15. 1. Similarly to SL and DC, MLSL also has the property of asymptotic correctness (it finds the global minimum with probability 1 with k increasing to infinity). This follows from Theorem 6.14. 2. Similarly to SL, MLSL is effective in the sense that the total number of local searches started is finite with the probability equal to 1 (see Theorem 6.13). 3. Unlike DC and SL, MLSL has the property of probabilistic guarantee of success (in the sense of finding all local minimizers). 4. Usually the second version of MLSL is used, information about clusters is not gathered. 5. MLSL can be ineffective when the objective function has a large number of local minimizers which are close to each other and have small basins of attraction. MLSL will tend to start local methods in each basin, which is expensive. Local methods may wander around in shallow basins or areas similar to a plateau. This can be disadvantageous when only “essential” minimizers are to be found. 6. The effectiveness of MLSL depends on the appropriate value of the critical distance. This distance changes during computations. It may take a long time to achieve such values which guarantee that all local minimizers are discovered. Rinnooy Kan and Timmer also proposed another version of MLSL, in which hypercube cells are considered instead of single points (like in MA, see [137]). Properties of this method are similar to properties of the basic version of MLSL. Remarks analogous to items 2 and 3 from Remark 6.15 are true. This version enables us to approximate basins of attraction in a simple way. 6.1.12 Topographic methods (TGO, TMSL) Topographical methods will be briefly described here for two reasons. First, their usefulness and superiority over MLSL for some global optimization problems has been proved (see Ali, Storey [1], Törn, Viitanen [190]). Second, it seems these methods can be modified in such a way that they can give information about the basins of attraction. Detailed modifications will not be proposed here. Only some capabilities will be indicated. The idea of the Topographical Global Optimization (TGO) method is to utilize information contained in the so-called topograph . In the global phase, random points are generated from the uniform distribution. These sample points that are located too closely to points already accepted are rejected. In order to diminish time complexity, the so-called pre-sampling has been proposed. It consists of storing the accepted points of the sample (for instance on a disk), then this stored sample can be used with possible modifications
174
6 Two-phase stochastic global optimization strategies
(for instance, scaling to a domain of searches). The topograph is a directed graph, whose nodes represent sample points and edges connect the g nearest neighbors. The arrows are directed to nodes with a greater objective value. The graph minima are those nodes which do not have neighbors with a lower objective value. These minima should approximate minimizers of Φ˜ at least if g is suitable. The problem of how to set the value of g has not been solved in a satisfactory way (see Törn, Viitanen [190]). The authors emphasize the high complexity of the method, in particular for multidimensional problems. It seems that joined graphs of the g-nearest neighbors could be a source of information about the basins of attraction of local minimizers. The second topographical method, Topographical Multilevel Single Linkage (TMSL) has been described by Ali and Storey in [1]. The authors propose the use of MLSL only for the minima of the topograph. This should accelerate the method significantly. The topograph here replaces the reduction phase (however, the reduction phase could also be employed in MLSL in order to accelerate computations). In the original version, clusters are not recognized and the proper determination of g is not as important as in TGO. When basins are to be found, the determination of g is important. For instance if g is equal to the sample size, only the global minimum can be found.
6.2 Stopping Rules There are many papers and studies concerning the stopping rules of stochastic methods in global optimization. In this chapter the approach initiated by Zieliński (see [208]) will be presented. This approach has also been developed by Boender [29, 35], Rinnooy Kan [31, 32, 30], Betro, Shoen [23, 24] and others. The following simple stopping rule could be employed in all stochastic methods in which random points are generated according to the uniform distribution on the admissible set D. We assume that a point from Aε is to be found (see Chapter 2, Section 2.1). The probability that a point from this set is generated during m drawings is equal to 1 − (1 − µ(Aε ))m , where µ(A) for any measurable A stands for the relative Lebesgue measure µ(A) =
meas(A) . meas(D)
(6.24)
We can stop the algorithm if this probability is greater than 1 − δ for certain δ > 0. Thus, the stopping rule can utilize the number of drawings m. This number should comply with the following inequality: m≥
log δ . log(1 − µ(Aε ))
(6.25)
Such an approach has drawbacks. The set Aε is not known (however, the measure µ(Aε ) could be approximated or could be set arbitrarily). Moreover,
6.2 Stopping Rules
175
no additional information about the problem which is being solved is utilized. Also, the local phase is not considered. Boender, Rinnooy Kan and Vercellis (see [33]) propose the following desirable properties of stopping rules: •
Sample dependency: either sample points or objective values in these points and also information about each local minimizer that is found (for instance how many times it was found) should be taken into consideration
•
Problem dependency: information about the class of the objective function, the number of local minimizers (for instance, that it is less than a certain constant), the relative volumes of the basins of attraction of local minimizers should be taken into consideration.
•
Method dependency: specific features of the algorithm should be taken into consideration.
•
Loss dependency: costs of stopping too early (before the global minimum is found or before all “essential” minimizers are found) should be taken into consideration.
•
Resource dependency: computational costs (time or memory) should be as small as possible.
In this section the Bayesian approach to the stopping problem will be described. The probabilistic model is estimated from information gathered during the run of the optimization procedure. This model can be used to construct the so-called non-sequential stopping rules. Information about the costs of further searches is not used in this approach. However, such information is used in the so-called sequential rules which will also be described in this chapter. The stopping rules presented below have been developed for Multistart. They can also be applied to some other clustering methods in global optimization. This chapter can be treated only as a basic survey of stopping rules in stochastic global optimization. Because local methods are used, the probability of finding a local minimizer x+ ∈ D is equal to the relative volume of its attraction set. In the Bayesian approach the number of local minimizers found is estimated together with relative volumes of their sets of attractions Θ1 , Θ2 , ..., Θw . Let w be a random variable that is equal to the number of different minimizers. Let Θ1 , Θ2 , ..., Θw be random variables that determine the relative measures of sets of attraction of local minimizers (see [32]). For the above random variables a priori distributions are assumed, then (based on the results of Multistart for m points, m stands for the sample size) a posteriori distributions are determined according to the Bayes rule. We assume that for the value of w every positive integer is equally probable, moreover, Θ1 , Θ2 , ..., Θw have the uniform distribution on the (w − 1)-dimensional unit simplex.
176
6 Two-phase stochastic global optimization strategies
6.2.1 Non-sequential rules Under the above assumptions the expected value (a posteriori) of a number of undiscovered minimizers is given by (see Boender [29], Guus, Boender, Romeijn [83]): w(w + 1) (6.26) E(w − w) = m−w−2 (m stands for the number of points drawn, we assume that m > w + 2). The expected value (a posteriori) of the sum of relative volumes of the attraction sets of undiscovered minimizers is equal to: , + w w(w + 1) . (6.27) Θl = E 1− m(m − 1) l=1
The probability (a posteriori) that all local minimizers have been found (i.e. w = w) is equal to: Pr{w = w} =
" w ! m−1−l l=1
m−1+l
.
(6.28)
The following stopping rules can be derived from the above formulas. •
Stop if:
! Int
w(w + 1) m−w−2
" = 0.
(6.29)
• Stop if the expected value of the sum of relative volumes of the attraction sets of undiscovered local minimizers is less than an arbitrary constant η1 : w(w + 1) < η1 . m(m − 1)
(6.30)
This kind of stopping rule can be especially interesting when the objective function has many local minimizers with very small regions of attraction. Continuing the search in order to find all local minimizers can be extremely expensive (long) in such cases. • Stop if the probability of finding all local minimizers is greater than an arbitrary constant η2 : " w ! m−1−l l=1
m−1+l
> η2 .
(6.31)
6.2 Stopping Rules
177
6.2.2 Sequential rules - optimal and suboptimal Bayesian stopping rules Sequential stopping rules take into consideration the cost of sampling. Let us consider the two following loss functions (see Guus, Boender, Romeijn [83]): L1 = ct1 (w − w) + ce1 m +
and L2 =
ct2
1−
w
(6.32)
, Θl
+ ce2 m .
(6.33)
l=1
Coefficients ct1 , ct2 > 0 are related to the cost of stopping the algorithm before all local minimizers are found. This cost is called a termination loss. Coefficients ce1 , ce2 > 0 are related to the cost of continuation of the search. It is called an execution loss. Boender and Rinnooy Kan considered the above functions for ce1 = 1 and ce2 = 1, they gave also other loss functions (see [32]). The following results are given after Guus, Boender, Romeijn [83] and Boender, Rinnooy Kan [32]. Under the assumption that m points give w different minimizers, the expected values a posteriori of L1 and L2 (the so-called expected posterior loss or just the posterior loss) are equal to: E (L1 |(m, w)) = ct1
w(w + 1) + ce1 m m−w−2
(6.34)
E (L2 |(m, w)) = ct2
w(w + 1) + ce1 m . m(m − 1)
(6.35)
The pair (m, w) denotes that after m steps of the algorithm (m drawings and m local searches in Multistart) w different minimizers have been found. The posterior loss after m > m observations is a random variable E(Li |(m , w))) (see [32]). A stopping rule is optimal if it minimizes the sum of E(Li |(m , w)) for m = m + 1, ...∞. For a pair (m, w) only two results of performing one more step are possible - either a new minimizer will be found (it is denoted by (m + 1, w + 1)) or the local method will converge to a known minimizer (this is denoted by (m + 1, w)). The probability that the next step will finish without a new minimizer is equal to the expected value (a posteriori) of the sum of relative volumes of sets of attractions of undiscovered local minimizers (see Formula 6.30). The conditional expected value of the posterior loss of one more step of the algorithm can be estimated on the basis of 6.34 and 6.35 according to the recurrent formula (i = 1, 2): " w(w + 1) E(Li |(m + 1, w)) m(m − 1) w(w + 1) E(Li |(m + 1, w + 1)). (6.36) + m(m − 1) !
E (E(Li |(m + 1), w))|(m, w)) =
1−
178
6 Two-phase stochastic global optimization strategies
One more step of the algorithm means that one more point is generated and all other operations related to this point are performed, in Multistart it is the run of the local search. The optimal rule stops the algorithm if the following criterion is satisfied E (E(Li |(m + 1, w))|(m, w)) > E(Li |(m, w)).
(6.37)
under the assumption that the optimal strategy would also be applied in subsequent steps. The estimation of the optimal rule begins with finding such m∗ (if it exists), that for all pairs (m , w) where m ≥ m∗ Formula 6.37 holds (we assume here that m ≥ w + 2). In other words, for all (m , w), m ≥ m∗ the value of the expected posterior loss never decreases after the execution of the subsequent step. For all pairs (m , w) where m = m∗ the optimal decision is to stop the algorithm. By applying in reverse the rule 6.36 from m = m∗ −1 to the first step one can determine the optimal strategy for all the pairs (m , w), m < m∗ . The results can be stored in a table and can be used later as optimal stopping rules which are independent of the objective function. A stopping rule is one-step-look-ahead subobtimal if it stops immediately after the inequality 6.37 is true. A suboptimal rule can be used when the optimal sequential stopping rule does not exist. Example for L1 : E (E(L1 |(m + 1, w))|(m, w)) − E(L1 |(m, w)) = ce1 − ct1
(6.38)
w(w + 1) . m(m − 1)
This result can be obtained by substituting Formula 6.34 in Formula 6.36 (Boender and Rinnooy Kan showed it for ce1 = 1, see [32]). In real problems often ct1 > ce1 , so in many cases there is no m∗ for which for all the pairs (m , w) where m ≥ m∗ the difference 6.38 is positive (we assumed earlier that m ≥ w + 2). Hence, the optimal sequential stopping rules for L1 do not exist. The suboptimal rule can be applied. The algorithm should stop if ce1 − ct1
w(w + 1) > 0. m(m − 1)
(6.39)
Example for L2 : Boender and Rinnooy Kan showed (see [32]) for L2 that m∗ = After an easy modification for the current case we obtain: m∗ =
ct2 . 3ce2
ct2 3
when ce2 = 1.
(6.40)
By applying in reverse the rule 6.36 one can determine the optimal strategy for all the pairs (m , w), m < m∗ .
6.3 Two-phase genetic methods
179
6.2.3 Stopping rules that use values of the objective function Piccioni and Ramponi (see [127] and also [83]) extended the above approach by a mechanism which takes into consideration function values in local minimizers. The stopping rules consider undiscovered minimizers in which the objective value is less than the smallest value found earlier. The following stopping rules are analogous to rules 6.29, 6.30, 6.31. •
Stop if the expected number of undiscovered better local minimizers is equal to zero, i.e.: " ! w = 0. (6.41) Int m−w−2
• Stop if the expected value of the sum of relative volumes of the basins of attraction of better local minimizers is less than a constant η1 w < η1 . m(m − 1)
(6.42)
• Stop if the probability that a better local minimizer does not exist is greater than a constant η2 m−w−1 < η2 . (6.43) m−1 A suboptimal stopping rule can also be estimated for L1 in a similar way (see [83]): •
Stop if ce1 − ct1
w > 0. m(m − w − 2)
(6.44)
The value of m∗ for the optimal stopping rule related to the loss function L2 can be estimated from the formula: 5 ct2 ∗ . (6.45) m = ce2
6.3 Two-phase genetic methods 6.3.1 The idea of Clustered Genetic Search (CGS) The idea of using genetic algorithms (GA) in clustering methods in global optimization follows from the observation that GA constitute systems that transform measures (see Section 4.2). This indicates that GA should be used in order to obtain information about certain sets of positive measure rather than to obtain information about the precise location of minimizers (maximizers). Moreover, for some types of global optimization problems, a clustering algorithm that uses GA would give not only local minimizers (maximizers) but
180
6 Two-phase stochastic global optimization strategies
also approximations of the basins of attraction (or some level sets which are central parts of basins) of local minimizers (maximizers). Such approximations can be helpful in many cases, for instance in parameter inverse problems, when some technical reasons imply that multi-criteria optimization methods have to be employed. One example is the problem of how to determine parameters of materials that have to be used in a construction, when parameters that correspond to exact local minimizers are not available and the choice must be delimited to some series. The essential question is how far one can go from local extremes. In this chapter, we assume that the basins of attraction are approximated by unions of small hypercubes, i.e. raster is defined in the domain of searches. The volume of hypercubes should be a compromise between the accuracy of approximation and the capabilities of the hardware. By a cluster now we mean not the set of points that belong to the basin of attraction of a local minimizer (like in clustering methods described in Section 6.1), but a set of hypercubes (raster cells) that contain sample points. The Clustered Genetic Search (CGS) algorithm proposed in this chapter makes use of the fact that a population of GA concentrates in the basins of attraction of local minimizers (maximizers). A rough sketch of CGS is as follows. Clusters (which are now unions of hypercubes) are recognized in a stepwise manner. Each step (a stage) results in the determination of some parts of clusters. These parts can be called subclusters. In each stage a genetic algorithm is started from the uniform initial population. In each stage, after the stopping rule is satisfied, the final population concentrates on parts of the basins of attraction of some minimizers. The final population determines subclusters, which are recognized by means of density analysis. Clusters are built as unions of subclusters. In subsequent stages clusters can be enlarged by attaching new subclusters. The seed of a subcluster can be the raster cell that contains “the best” individual (point). Another possibility is the cell that contains the result of a local method started from the best point. During the subcluster’s recognition phase, neighboring cells, in which the density of individuals is greater than a certain threshold, are attached to the subcluster. A rough local method is started from each subcluster. It enables subclusters to join and to create clusters. The fitness function is modified on those hypercubes that are assigned to subclusters. The modification consists of setting the fitness value to the maximum that is found so far (or to an arbitrary value). The goal of this operation is to push individuals in new generations away from subclusters that are already recognized. The proposed strategy utilizes the Simple Genetic Algorithm with the standard binary encoding, one-point cross-over and a non-zero mutation. This allows us to evaluate some properties of the whole strategy. A simple parallel version of CGS has been tested. In this version the domain of searches is divided into subdomains. Each subdomain is divided into hypercubes of the same size. All computations are coordinated by the central coordinator called the Master process. Slave processes are distributed in a
6.3 Two-phase genetic methods
181
computer network. The slaves recognize subclusters in parallel. Each slave is responsible for its own subdomain. After the global stopping rule is satisfied in each subdomain, slaves send to the coordinator the following data: minimum of the objective function and the minimizer for each cluster recognized, the size of each cluster (the number of raster cells). Exact information about each subcluster is distributed throughout the nodes. The coordinator finally joins subclusters into clusters if the distance between the subclusters’ minimizers is less than the diagonal of the raster cell. The exact assignment of raster cells to clusters can be stored in the distributed manner or can be stored centrally in the coordinator (if it is possible and required). 6.3.2 Description of the algorithm The parallel distributed version of the Clustered Genetic Search algorithm will be presented in this section. Master
1: 2: 3: 4:
Divide the domain of searches D into p subdomains. Start p slave processes. Wait for the results (local minima and minimizers, sizes of subclusters). After all results are obtained, join these subclusters in which the distance between minimizers is less than the diagonal of the raster cell.
Algorithm 8: Parallel version of CGS, Master. The number of subdomains p can be greater than the number of workstations. In such a case new slave processes can be started after one of the working slaves finishes its work. Slaves In the version of CGS that was tested on 2-6 dimensional problems the raster was implemented in a table. Each raster cell corresponded to a cell in the table. Each table cell contained either the subcluster’s number or zero (if the corresponding raster cell was not assigned). Additionally, a list of subclusters contained minimizers and minima that were found. Figure 6.1 presents the idea of fitness modification. A single cluster is an approximation of a central part of the basin of attraction of a local minimizer. A cluster can be recognized in one or more iterations of the outer loop (line 3 in Algorithm 9). In the inner loop (line 8) a Simple Genetic Algorithm is performed. This loop stops when the criterion
182
6 Two-phase stochastic global optimization strategies
1: Do in parallel (in subdomains): 2: Determine the raster in the subdomain. 3: repeat 4: Generate the initial population according to the uniform distribution on the domain. ˜ outside recognized subclusters. 5: Evaluate fitness function f = Φ ˜ 6: Determine M AX (the largest value of Φ). 7: Modify the fitness function (f ← M AX) in raster cells that are assigned to subclusters. 8: repeat 9: Steps of Genetic Algorithm, evaluation of subsequent generations. 10: Every certain number of generations check if subclusters can be recognized. 11: if Subclusters can be recognized then 12: Recognize subclusters using density analysis. 13: Start a local search in each subcluster. 14: Join subclusters using information from local searches (line 13). 15: else 16: Check if the distribution of individuals outside recognized subclusters is uniform 17: end if 18: until Complex stopping rule has been satisfied: 19: Subclusters can be recognized OR 20: The distribution of individuals outside of recognized subclusters is uniform. 21: until 22: Criterion from line 20 has been satisfied OR 23: All raster cells are assigned to clusters OR 24: Satisfactory set of clusters has been found.
Algorithm 9: Parallel version of CGS, slaves.
~
f,Φ
f
~ Φ
rozpoznane subklastry
recognized subclasters D Fig. 6.1. Modification of fitness function in recognized clusters.
6.3 Two-phase genetic methods
183
from line 19 is satisfied. Then the subcluster recognition process is started. The analysis of the density of individuals in raster cells is used in this process. The cell that contains “the best” individual which is not assigned to a cluster, becomes the seed of a subcluster. In the alternative tested version the seed is the raster cell that contains the minimizer found by a local optimization method started from the best individual. Neighboring cells that contain more individuals than a certain threshold are added to the subcluster. Neighboring cells are those cells that are in contact with each other. A rough local method is started in each new subcluster. If the resulting minimizer is contained in an already recognized subcluster, subclusters are joined. This can be seen in Figure 6.2.
subdomain j subdomain j
trajectory of a local method
Fig. 6.2. Joining of subclusters.
6.3.3 Stopping rules and asymptotic properties Bayesian stopping rules for clustering methods in global optimization (see Section 6.2) cannot be simply transferred to Clustered Genetic Search. The reason is the assumption made for Bayesian rules that the sample should be uniformly distributed in the domain of searches, so that the sample can easily approximate the Lebesgue measure. In CGS the distribution of individuals is usually not uniform. Moreover, it is difficult to estimate this distribution in subsequent generations. The Clustered Genetic Search algorithm differs significantly from clustering methods derived from the Multistart. Hence, different methods of analysis are required. The algorithm is closer to methods proposed by Törn. However, unlike those methods, in which after the concentration phase one can say hardly anything about the sample, for CGS we can characterize populations and some properties of the algorithm.
184
6 Two-phase stochastic global optimization strategies
The stopping rule of the inner loop is complex and takes into consideration three kinds of algorithm behavior: 1. First, when the SGA finds subclusters and distinct concentrations of individuals can be determined. 2. Second, when individuals are evenly distributed outside recognized clusters. This means that an area which resembles a plateau is recognized. 3. Third, when one cannot detect any stabilization of populations after an appropriately large number of generations. This indicates that the algorithm should be stopped and the parameters of the SGA should be modified. For the sake of simplicity this is not described in Algorithm 9. We assume that the algorithm is well tuned to the problem (see Definition 4.63), which means the limit sampling measure given by the stationary point of the genetic operator G (see Definition 4.16) concentrates on the central parts of the basins of attractions. Before the process of cluster recognition starts it is important to verify the convergence of a sequence of counting measures (given by populations) to the limit sampling measure. Any sequence of counting measures converges when µ → ∞, k → ∞ (it results from Theorem 4.65). Theorem 4.66 implies that the sequence of estimators of measure density converges. Hence, a heuristic criterion which detects a stagnation of the sequence of density estimators is proposed. The criterion in line 10 in Algorithm 9 can be as follows: •
Check the convergence of the sequence of estimators of measure density. This can be treated as the “convergence” of generations. If the convergence is detected, then check if an appropriate number (given as a parameter) of raster cells contain more individuals than a certain threshold. If yes, clusters can be recognized.
Theorem 4.54 enables us to justify the stopping of the algorithm when populations of the SGA stabilize with the distribution close to uniform. This is related to the recognition of a plateau-like area (an area with a small variability of the fitness function) outside recognized clusters. If, in the subsequent generations, the SGA does not go away from a certain neighborhood of the central point of Λr−1 , this point is recognized as a stationary point of the genetic operator G. This stopping rule makes the algorithm especially useful for a certain class of objective functions, those with large plateau-like areas. Theorem 4.54 and the assumptions imply a certain kind of asymptotic probabilistic guarantee of success in the sense that all “essential” extremes will be found. This can be justified in the following way: •
The SGA is well tuned to the problem for the set of local minimizers that belong to the set W (see Definition 4.63, Remark 2.2, and Remark 2.3). Theorem 4.66 guarantees that sampling measures corresponding to
6.3 Two-phase genetic methods
•
185
populations concentrate on all the basins of attraction of local minimizers. So we expect that all essential minimizers can be discovered by clustering. The construction of the global stopping rule and Theorem 4.54 imply that the process of clusters recognition is not stopped before all essential minimizers are found.
What does it mean “essential” minimizer? The algorithm cannot recognize extremes with basins of attraction smaller than the raster cell. Moreover, some basins which are too shallow can be omitted because of mutation. Larger mutation rates mean that individuals fill in the basin more and more and can “overflow”. Thus, some basins can “vanish”, they become indistinguishable from neighboring basins. The dependence between the mutation rate and the detectability of basins is not well recognized yet and requires further studies either theoretically or practically. In tests of the proposed algorithm the values have been set after experiments. In practice, either for optimization problems or for parameter inverse problems such a detectability condition can be justified: often only “essential” extremes are to be found, which means extremes with large and sufficiently deep basins. The algorithm has the property of asymptotic probabilistic correctness if the global extreme is essential in the above sense. One can see an analogy between the way in which mutation and crossover influences the proposed algorithm and the way in which the reduction phase influences such clustering methods as DC or SL. Both mechanisms mean that some extremes may stay undetected. However, unlike DC and SL with the reduction phase, the SGA is a filter which eliminates extremes that are not “essential”. Moreover, the genetic algorithm can detect extremes regardless of their fitness value, but the reduction phase in classical clustering methods eliminates the possibility of detection of minima above a certain level. The Clustered Genetic Search has certain good properties that distinguish it from other clustering methods in global optimization. The considerations presented here should be treated as the first step to further research. However, the step has its importance, because in most applications of SGA (or in general evolutionary algorithms) estimation of even asymptotic properties does not exist and heuristic stopping rules are very common. 6.3.4 Illustration of performance of Clustered Genetic Search In this section the results of some tests of CGS for chosen functions are briefly described. The functions are: Rastrigin function, Rosenbrock function, a sine of products and a function with a large plateau-like area with two “essential” isolated local minimizers. The results for a parameter inverse problem has been presented in Chapter 2.
186
6 Two-phase stochastic global optimization strategies
Plan and the goal of tests The tests that are described in this chapter have been carried out on some known functions that are, for some reasons, difficult and are often used as test functions for methods in global optimization. In spite of the fact that they are not representative of most problems in engineering, because the cost of local methods is relatively low for them, they can be used in order to exhibit some properties of the proposed algorithm. Such purpose of the tests justifies the use of two dimensional domains. Tests for more than two dimensions have been carried out by Telega (see [184]). The chosen test functions are sources of different difficulties for global optimizations algorithms: many local minimizers, large plateau-like areas and curved valleys. In distributed versions the domain of searches has been divided in such a way that the joining of subclusters recognized in different subdomains was needed. It was a source of an additional difficulty for the algorithm. Each function has been tested with different parameters of the SGA engine and CGS: mutation rate, size of the population, threshold density of individuals (for the cluster recognition) and the number of generations in one iteration of the algorithm. Rastrigin function (cosine added to a square function) has been used in order to check a filter property. Tests of this function have been carried out in two domains (a smaller and a larger one), emphasizing the different influence of two components of the sum: trigonometrical and square. Sine of products test function has been tested with a variant of CGS without local methods started from each subcluster. This modification was necessary because the local minimizers were not isolated which was a reason of an additional difficulty. The function with a large plateau-like area has been used in order to check usefulness of the stopping rule. In other tests the algorithm was stopped after a certain number of iterations of the outer loop. The proposed stopping rule applied to the Rastrigin or Rosenbrock function would mean that every raster cell should be assigned to some cluster and this would be ineffective. All tests have been carried out with the use of the scalar version and the simple distributed version. The reference algorithm that was used for comparisons was a version of Multistart in which a local method is started from each raster cell. Cells for which local methods find minimizers located not further than the length of the diagonal of the cell are assigned to the same cluster. This algorithm has been called Raster Multistart (RMultistart) . Tests made for known standard functions with different values of parameters such as the population size, stopping rule before clusters can be recognized etc., can bring some hints about how to use the algorithm for certain classes of problems. However, quantitative conclusions drawn from tests cannot be a base for definitive statements. Remark 6.16. In figures that graphically present the results of CGS, the function map (located in the left upper part of the figure) is oriented differently
6.3 Two-phase genetic methods
187
than the part which shows the results of clustering. This is caused by the graphical library that was used. The same orientation can be obtained by the exchange of the X and Y axis. This remark concerns all examples. Rastrigin function The function that has been tested has the following form: f (x, y) = x2 + y 2 − cos(18x) − cos(18y) + 2
(6.46)
Figure 6.3 presents the two dimensional map and the three dimensional graph of the function. Different function values are presented as different shades of gray. The domain is a square −0.5 ≤ x ≤ 0.5, −0.5 ≤ y ≤ 0.5.
0.5 0.4 0.3 0.2 4 0.1
3.5 3
-0
2.5 -0.1
2 1.5
-0.2
1
-0.3
0.5 -0.4
0 -0.5 -0.4 -0.3 -0.2 -0.1 -0
-0.5 -0.5
-0.4
-0.3
-0.2
-0.1
-0
0.1
0.2
0.3
0.4
0.5
-0.5 -0.4 -0.3 -0.2 -0.1 -0 0.1 0.2 0.3 0.1 0.2 0.4 0.3 0.4 0.5 0.5
Fig. 6.3. Rastrigin function for −0.5 ≤ x ≤ 0.5, −0.5 ≤ y ≤ 0.5.
In this domain there are nine local minimizers. The global minimum equal to 0 is located at (0,0). Figure 6.4 presents the graph of and the map of the same function but in the square domain −10 ≤ x ≤ 10, −10 ≤ y ≤ 10. minima from Figure 6.3 cannot be seen because the scale of the map. Rastrigin Function for −0.5 ≤ x ≤ 0.5, −0.5 ≤ y ≤ 0.5 The domain is divided along x and y axis (y = 0, x = 0) into four subdomains. Each subdomain is divided into 225 raster cells (900 cells in the whole domain). Figure 6.5 graphically presents the results of CGS for the square domain −0.5 ≤ x ≤ 0.5, −0.5 ≤ y ≤ 0.5 and the following parameters: population size = 80, mutation coefficient = 0.001, number of generations in one iteration = 10, number of iterations = 20, threshold density of individuals in one raster cell = 12. Parameters were not optimized. Different subclusters found in different
188
6 Two-phase stochastic global optimization strategies
8
42 6
-10 -8 -6 -4 -2 0
Fig. 6.4. Rastrigin function for −10 ≤ x ≤ 10, −10 ≤ y ≤ 10.
Fig. 6.5. Rastrigin function, graphic presentation of the results of CGS.
6.3 Two-phase genetic methods
189
subdomains (before the coordinator joins them) are presented by different shades of gray. All 9 local minimizers were found. The total number of objective evaluations, the number of local searches and the number of objective evaluations in each local search were counted. The number of function evaluations can be a comparative criterion with other methods. CGS was proposed for time expensive functions, so such comparison reflects the differences in execution time well. For RMultistart the number of function evaluations can be estimated as the products of the number of raster cells and the average number of function evaluations in the local search. The number of local searches in CGS (with non-optimized parameters) was considerably less than in RMultistart (63 in comparison to 900), but the number of function evaluations was greater. The fitness function is very easy for local methods. The number of function evaluations in one local search was between 30 and 50, on average a little more than 40. The number of function evaluations in CGS was almost equal to 68000. In RMultistart it can be estimated at about 36000. Figure 6.6 presents the results of CGS after fine tuning the parameters. The goal of fine tuning is to obtain good results of minimization with the smallest time cost of the process. The values of parameters are as follows: population size = 50, mutation rate = 0.004, number of generations in one iteration = 20, threshold density of individuals in one raster cell = 6. The number of fitness evaluations was equal to 29044, the number of local searches = 83, the number of function evaluations in local searches = 3223.
Fig. 6.6. Results of CGS after fine tuning.
In each test CGS found all 9 local minimizers in the domain. The master process joined subclusters. Nine clusters were always found.
190
6 Two-phase stochastic global optimization strategies
Because the complexity of CGS can be now comparable to the complexity of RMultistart for simple (“easy”) objective functions, one can expect it to be better for more “difficult” functions. Rastrigin Function for −10 ≤ x ≤ 10, −10 ≤ y ≤ 10 Rastrigin function has many local minimizers in the square domain given by −10 ≤ x ≤ 10, −10 ≤ y ≤ 10. However, the dominant component of the sum (Formula 6.46) is the sum of the squares of x and y. When only the global minimizer is to be found, methods of smoothing of the objective function can be applied (see Coleman, Zhijun Wu [50]). The SL and MLSL algorithms are aimed at finding all local minimizers, so they are not effective for such problems. The goal of tests of CGS for Rastrigin function in the considered domain is such a selection of parameters that cause the algorithm to “see” the objective function as the square function. Here subclusters should be related rather to level sets than to real basins of attraction of local minimizers. We want to check the following filter property: most local minimizers should remain undiscovered, the algorithm should only find the shape of the valley which contains local minimizers. Such a property can be valuable when the objective function has “noisy” valleys with many irrelevant local minimizers with small basins of attractions and similar values. In many cases, only a rough recognition of the whole valley and an approximation of the global minimizer is important. In such cases CGS can be an interesting proposition, but one modification should be made. Local searches should not be used in order to join subclusters. Clusters should be built with the use of only the density analysis. Another possibility (implemented in tests) is to join subclusters when their minimizers are located not farther than a certain threshold (for instance several cell diagonals). Below, some results of tests are presented. The initial parameters (not optimized) were as follows: 4 subdomains, raster consisted of 900 cells, population size = 50, mutation rate = 0.02, threshold density in one cell = 4, number of generations before subclusters are recognized = 10, number of iterations = 20. The obtained results: number of function evaluations about 44000, number of local searches = 98, number of function evaluations in local searches = 4491. The master process joined all the subclusters found in subdomains into one cluster. Figure 6.7 graphically presents the results. Figure 6.8 presents results of CGS with optimized parameters (parameters were optimized in order to obtain a lower time cost): population size = 60, mutation rate = 0.02, threshold density of individuals in one cell = 3, number of SGA generations before subclusters are recognized = 3, number of iterations = 20. The obtained results: number of function evaluations = 23797, number of local searches = 125, number of function evaluations in local searches = 5958. The master process joined all the subclusters found in subdomains into
6.3 Two-phase genetic methods
191
Fig. 6.7. Rastrigin function for −10 ≤ x ≤ 10, −10 ≤ y ≤ 10.
one cluster. The version without local searches was quicker: number of function evaluations = 17839.
Fig. 6.8. Results of CGS after fine tuning.
192
6 Two-phase stochastic global optimization strategies
Rosenbrock function The form of the function is the following: f (x, y) = 100(y − x2 )2 + (1 − x)2
(6.47)
The domain was a square −1 ≤ x ≤ 1, −1 ≤ y ≤ 1.
1 0.8 0.6 0.4 0.2 -0 -0.2 -0.4 -0.6 -0.8 -1 -1
-0.8 -0.6 -0.4 -0.2
-0
0.2
0.4
0.6
0.8
1
1 0.8 0.6 400
0.4
350
0.2
300 250
-0
200
-0.2
150 100 50 0 -1 -0.8 -0.6 -0.4 -0.2 -0 0.2 0.4 0.6 0.8 1 1
-1 -0.8 -0.6 -0.4 -0.2 -0 0.2 0.4 0.6 0.8
-0.4 -0.6 -0.8 -1 -1
-0.8 -0.6 -0.4 -0.2
-0
0.2
0.4
0.6
0.8
1
Fig. 6.9. Rosenbrock function.
This function is also known as a Rosenbrock curved valley. It has one isolated global minimizer (1,1), the minimum value is equal to 0. In the significant part of the domain the function graph is almost flat and the valley is curved. This causes difficulties for local methods. This feature is the reason why Rosenbrock function is used as a standard test function for local methods. Figure 6.9 shows the screen with the graph of Rosenbrock Function and graphically presents the results of CGS after parameters have been refined. The parameters of CGS were as follows: 4 subdomains, number of raster cells = 400 in the whole domain, population size = 35, mutation rate = 0.0015, threshold number of individuals in one cell = 6, number of SGA steps before
6.3 Two-phase genetic methods
193
subclusters are recognized = 5, number of iterations = 20. The obtained results: number of function evaluations = 19500, number of local searches = 48, number of function evaluations in local searches = 6275. The master process joined all the subclusters found in subdomains into one cluster. One global minimizer was found. In this case the average number of function evaluations in one local search was equal to 131 (in some tests even about 180). We can estimate that RMultistart would require 52400 (or 72000) function evaluations. The time cost of CGS is about 35% of the cost of RMultistart. Additionally, CGS gives more information about the basin of attraction. Sine of a product The form of the function is the following: f (x, y) = sin(xy) + 1
(6.48)
The domain is a square −3 ≤ x ≤ 3, −3 ≤ y ≤ 3. This function has been tested in order to check the behavior and usefulness of CGS in cases when local minimizers are not isolated. Local methods were not used here. Subclusters were recognized only by means of analysis of the density of individuals in raster cells. Figure 6.11 presents the function and the results of modified CGS. The parameters were as follows: number of subdomains = 4, number of raster cells in the domain = 400, population size = 35, mutation rate = 0.01, threshold density of individuals in one raster cell = 7, number of SGA steps before subclusters are recognized = 4, number of iterations = 18. The number of function evaluations was equal to 11508. This function has been tested only with the scalar version of CGS. The conclusion of tests is that when local methods can be dispensed, better results can be obtained with the use of other methods than CGS. A test function with large plateau-like area Let us introduce the following notation: (logical_condition1 AND logical_condition2) means 1 if both logical conditions are satisfied, 0 otherwise. The form of the function is the following: f (x, y) = 12 + 0.01 sin(0.05x) + 0.01 sin(0.05y) − 0.0009(x2 + y 2 − 10)(x > 60 AND x < 70 AND y > 60AND y < 70) − 0.0009(x2 + y 2 − 10)(x > −50 AND x < −20 AND y > −70 AND y < −40) (6.49) It is the sum of the function fI (x, y) = 12+0.01 sin(0.05x)+0.01 sin(0.05y) and the square function fII = −0.0009(x2 + y 2 − 10) in some squares. The
194
6 Two-phase stochastic global optimization strategies
Fig. 6.10. Sine of a product.
whole domain is given by −100 ≤ x ≤ 100, −100 ≤ y ≤ 100. Figure 6.11 presents the graph of this function.
Fig. 6.11. A test function with large plateau.
6.3 Two-phase genetic methods
195
“Folds” caused by the component fI cannot be seen, because of the scale. Some tests have been carried out on the version with the stopping rule presented in Algorithm 9, some with an assumed number of iterations. Figure 6.12 presents the results of both versions (they were the same) for the following parameters (not optimized): number of subdomains = 4, number of raster cells in the whole domain = 400, population size = 40, mutation rate = 0.001, threshold number of individuals in one cell = 24, number of SGA steps before clusters can be recognized = 5.
Fig. 6.12. Results of CGS.
Two clusters and two local minimizers have been found. The number of function evaluations for the version with the stopping rule given in Algorithm 9 was equal to 10306, number of local searches = 4, number of function evaluations in local searches 629. The results for the version with the simplified stopping rule were as follows: number of function calls = 19573, number of local searches = 4, number of function evaluations = 776. The number of local searches is small. This is a good prognosis for cases when the local search is much more expensive.
196
6 Two-phase stochastic global optimization strategies
The stopping rule proposed in Algorithm 9 means that the number of function evaluations is almost twice as small as for the version with the simplified stopping strategy. In two subdomains only one iteration has been performed and no local minimizer has been found. A filter property can be seen here. In fact both subdomains contained a local minimizer. Minima were omitted because their basins of attractions were too shallow to be recognized by CGS. After the tuning of parameters the time cost was diminished, but the basins have been approximated less accurately. Figure 6.13 a) and b) presents the graphical results of CGS with the stopping rule from Algorithm 9 and the following parameters: •
population size = 20, mutation rate = 0.071, threshold number of individuals in one raster cell = 6, number of SGA steps before subclusters are recognized = 3, number of function evaluations = 2873, number of local searches = 5, number of function evaluations in local methods = 759.
•
population size = 15, mutation rate = 0.09, threshold number of individuals in one cell = 6, number of SGA steps before subclusters are recognized = 3, number of function evaluations = 1967, number of local searches = 3, number of function evaluations in local searches = 652.
Fig. 6.13. Results of CGS for different parameters.
Summary of tests Clustered Genetic Search can be used effectively in order to solve clustering problems in global optimization for chosen test functions. Populations concentrate in the basins of attraction of local minimizers. When parameters of the algorithm are refined, CGS can be much less expensive than the reference RMultistart. CGS can be especially effective for objective functions with
6.3 Two-phase genetic methods
197
large plateau-like areas. However, so far there is no other answer to the question of how parameters should be chosen than the answer they should be set via tests. This is the common way in which parameters of genetic algorithms are set in most applications. On the other hand, there are some estimations about the asymptotic properties of CGS and this singles out this method. Test showed also that CGS has some good properties that distinguish it from standard clustering methods in global optimization. These properties can be advantageous in many practical problems. The CGS algorithm presented in this book should be treated as the first step towards using genetic algorithms in clustering global optimization methods. Further research and tests are required. Some improvements to CGS have been proposed by Telega, Podolak in [129]. The goal of modifications is to store information about basins more effectively. Ellipsoids can be used instead of hypercubic raster or, in a more robust version, clusters can be remembered by a neural network.
7 Summary and perspectives of genetic algorithms in continuous global optimization
We would like to close this book with several short remarks that summarize the research into the design and analysis of genetic algorithms, as well as their application in solving continuous global optimization problems. Some conjectures concerning the direction of further research into these areas will also be made. Global optimization algorithms, based on genetic mechanisms inherited from biology, can be applied effectively when solving continuous problems only in difficult cases in which other, much faster computing strategies have failed. This is mainly due to the large computational cost, even if simple genetic techniques are run. Continuous global optimization problems with multimodal objective functions, in which the basins of attraction of local extremes are separated by large areas on which moderate or low objective variability is observed (plateaus), belong to the group of important difficult ones. Another type of difficulty is caused by the low regularity of the objective function when the gradient and the Hessian computation require costly approximative routines or is generally meaningless. Genetic algorithms adapted to solve continuous problems are usually ineffective or even ill defined if we need to find a very accurate approximation of extremes. However, they may satisfactorily compete with the Monte Carlo methods as global phase algorithms in two-phase stochastic global optimization strategies. The review of research into genetic algorithms presented in this book exhibits two of their important directions: •
Studying the population dynamics modeled as a single point in the specially defined space of states E. The leading model in this group is the stochastic process as a dynamic system in the space of probabilistic measures M(E). For several cases (e.g. the Simple Genetic Algorithm SGA) this model may be reduced to the uniform Marcov chain. In this book, the results of this approach were mainly described in Sections 4.1.2 and 4.2.2. Moreover, the dynamics rule of the sampling measures on the admissible
R. Schaefer: Foundation of Global Genetic Optimization, Studies in Computational Intelligence (SCI) 74, 199–201 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
200
7 Summary and perspectives of genetic algorithms. . .
domain D may be lead from this model. The results of this group may be utilized for convergence verification and the construction of stopping rules. They are especially helpful in verifying the stopping rules of genetic algorithms used at the global phase of two-phase stochastic global optimization strategies (see Section 6.3.3). • Studying of the local behavior of the individual in an evolving population. The model of global convergence mentioned in Section 4.1.3 and the results presented in Section 4.2.1 fall into this group. In addition, important research directions, such as Building Block Theory for discrete encoding genetic algorithms as well as evaluation of the first hitting time for the neighborhood of the global extreme, may be added to this group (see e.g. Garnier, Kaller, Schoenauer [72], Rudolph [142]). Theoretical results obtained in this way require strong assumptions with respect to both the genetic algorithm and the optimization problem. An interesting new direction of research, which is not described in this book, is the study of spacial statistic dynamics of medium size populations (e.g. mean phenotype dynamic or the dynamic of the dynamics of the standard deviation of population phenotypes). Such results may be found in the papers of Chor¸ażyczewski and Galar [48], [46], [47]. Their results show, that mediumsized populations remain a compact group of points in the phenotype space during the whole evolution time. The global search is performed by the whole population displacement toward the consecutive basins of attraction of local extremes. This movement proceeds periodically by repeating three phases: • the climbing phase in which the population bounds to the central part of the basin of attraction of the particular local extreme of the objective function, • the phase of filling the central part of this basin of attraction, • the phase of crossing the saddle in the evolutionary landscape toward the next basin. Moreover authors motivate that there is almost impossible to design the single genetic algorithm which may behave effectively in each of three phases. Some adaptation techniques that can relax this obstacle were mentioned in the Chapter 5. The local character of the single population genetic search, as well as the difficulty of designing a universal algorithm that can be effectively utilized in each of the three phases described above, motivate the theoretical considerations and inventions toward multi-deme parallel genetic searches. In such strategies, deme components share tasks of various accuracy degree searches, search regions and can also be specialized in climbing, filling or saddle crossing separately. The basic multi-deme model that incorporates the above requirements is the island model (see Section 5.4.2). It seems, that hierarchical strategies
7 Summary and perspectives of genetic algorithms. . .
201
described in the Section 5.4.3 satisfied these needs more comprehensively. The result of multi-deme searches may be strengthen by the proper post processing (e.g. Clustered Genetic Search described in Section 6.3) and the next phase of solving the global optimization problem by using accurate, low cost local optimization methods. The high efficiency of multi-deme global searches and their post-processing is confirmed by mathematical derivations and numerical experiments (see e.g. Whitley, Soraya, Hackerdorn [200], Kołodziej [95] CantúPaz [44], [45] and Schaefer, Adamska, Telega [152]). Summing up, multi-deme genetic strategies supported by refined postprocessing techniques constitute the most promising direction in the global optimization search in large, multidimensional continuous domains.
List of Symbols
V d(·, ·) dH (·, ·) ·2 · N D Φ : D → R+ ˜ : D → R+ Φ x∗ x+ meas(·) Loc loc(·) loc Rx + Bx+ P Pr(·) M(A) b(·) Dr ⊂ D r = #Dr U code : U → Dr dcode : D −→ U Ω l Z2
solution space for the global optimization problem distance function in the space V Hamming distance function Euclidean norm in the space V arbitrary norm in the space V dimension of the global optimization problem (N = dim(V )) admissible set of solutions to the global optimization problem objective function of a maximization problem objective function of a minimization problem global maximizer (or minimizer) to the global optimization problem local maximizer (or minimizer) to the global optimization problem Lesbegue measure function set of local optimization methods local optimization method attractor of the local minimizer x+ with respect to the strictly decreasing local optimization method loc(·) basin of attraction of the local minimizer x+ random sample, population probability measure space of probabilistic measures over the set A function that selects the best fitted individual in a population grid of encoded points (set of phenotypes) the cardinality of phenotype set genetic universum (set of genotypes) encoding bijective function partial function of inverse encoding (decoding partial function) binary genetic universum in Chapters 3 - 5, space of elementary events in Section 6.1.5 binary code length group {0, 1} with the addition modulo 2
204
List of Symbols
⊕ codea (·) codeG (·) f : U → R+ Scale(·) µ self (·) Elite 1 ˆi pm pc [·] ⊗ type mutx (·) crossx,y (·) N (e, C) N (e, σ) U [0, 1] λ κ S H ∆(H) ℵ(H) E(·) E eqp Sµ Λr−1 Xµ ⊂ Λr−1 #A n = #E πt τ Q F (·)
the addition operator in Z2 , coordinate-by-coordinate addition of binary vectors from Z2 ×, . . . , ×Z2 affine binary encoding Gray binary encoding fitness function nonlinear function that modifies fitness number of individuals in the parental population probability distribution of selection subset of individuals that pass to the next epoch with the probability 1 vector of l ones the inverse of the binary vector i (ˆi = 1 ⊕ i) rate of binary mutation rate of binary crossover evaluation function for boolean expressions coordinate-by-coordinate multiplication of binary vectors type of binary crossover probability distribution for the binary mutation of an individual x probability distribution for the binary crossover of individuals x and y realization (result of sampling) of the N -dimensional Gauss random variable with the mean vector e and the covariance matrix C realization (result of sampling) of the one dimensional Gauss random variable with the mean value e and the standard deviation σ realization (result of sampling) of the random variable of the uniform probability distribution over the real interval [0, 1] number of offspring individuals in a single genetic epoch individual life time parameter of the evolutionary algorithm in section 5.3.2, raster resolution in section 6.1.10 space of schemata schemata from the space S length of the schemata H degree of the schemata H expected value operator space of states of genetic algorithms equivalence relation among vectors of genotypes group of permutations of the µ-element set unit r − 1 simplex in Rr finite set of states of the genetic algorithm with finite population (µ < +∞) cardinality of the set or multiset A cardinality of the space of states genetic algorithm state probability distribution in the epoch t Markov transition function of genetic algorithm states probability transition matrix selection operator for the simple genetic algorithm
List of Symbols mixing operator for the simple genetic algorithm genetic (heuristics) operator for the simple genetic algorithm mixing matrix set of fixed points of the genetic operator G dimension of the vector space V diameter of the subset A of a metric space set of natural numbers set of integers set of nonnegative integers Z+ = N ∪ {0} set of rational numbers set of nonnegative rational numbers topology on V (the family of open sets in V ) the topological closure of the set A in the proper topology the complement of the set A, i.e. (A)− = V \ A if it is contained in the space V the characteristic function of the set A. Assuming A is contained in the space V χA : V → {0, 1}, χA (x) = 1 if x ∈ A, χA (x) = 0 otherwise operator of the upper round “ceiling” operator of the lower round “floor” function turning back the nearest integer to the argument interior of the set A in the proper topology
M (·) G(·) M K ⊂ Λr−1 dim(V ) diam(A) N Z Z+ R R+ top(V ) A (A)− χA
·
· Int : R → N int(A) 2
diag : RN → RN function turning back the square diagonal matrix diag(v) with the diagonal equal to the vector v supp(g) support of the real valued function g : A → R, supp(g) = {x ∈ A; g(x) = 0} I identity mapping I the matrix of linear identity of finite dimensional vector space Dom(f ) domain of the function f u(t) vector of parameters that control genetic operations in the epoch t L(Rr → Rr ) space of linear mappings from Rr into itself Γ Gamma Euler’s function
205
References
1. Ali M, Storey C (1994) Topographical Multilevel Single Linkage. Journal of Global Optimization 5:349–358. 2. Anderson RW (1997) The Baldwin Effect. Chapter C.3.4.1 in [15]. 3. Anderssen RS, Bloomfield P (1975) Properties of the random search in global optimization. Journal of Optimization Theory and Applications 16:383–389. 4. Arabas J (1995) Evolutionary Algorithms with the varying population cardinality and the varying crossover range. PhD Thesis, Warsaw University of Technology, Warsaw (in Polish). 5. Arabas J (2001) Lectures in Evolutionary Algorithms. WNT, Warsaw (in Polish). 6. Arabas J (2003) Sampling measure of an evolutionary algorithm. In: Proc. of 6-th KAEiOG Conf., Łagów Lubuski, 15–20. 7. Arabas J, Michalewicz Z, Mulawka J (1994) GAVaPS – a Genetic Algorithm with Varying Population Size. Proc. of ICEC’94, Orlando, Florida, IEEE Press, 73–76. 8. Arabas J, Słomka M (2000) Pseudo-random number generators in the initial population generation. In: Proc. of 4-th KAEiOG Conf., Lądek Zdrój, 7–12 (in Polish). 9. Archetti Betrò B (1978) On the Effectiveness of Uniform Random Sampling in Global Optimization Problems. Technical Report, University of Pisa. 10. Bäck T, Fogel DB, Michalewicz Z eds. (2000) Evolutionary Computation 1. Basic Algorithms and Operators. Institute of Physics Publishing, Bristol and Philadelphia. 11. Bäck T, Fogel DB, Michalewicz Z eds. (2000) Evolutionary Computation 2. Advanced Algorithms and Operators. Institute of Physics Publishing, Bristol and Philadelphia. 12. Bäck T (1996) Evolutionary Algorithms in Theory and Practice. Oxford Univ. Press. 13. Bäck T (1997) Self–adaptation. Chapter C.7.1 in [15]. 14. Bäck T (1997) Mutation Parameters. Chapter E.1.2 in [15]. 15. Bäck T, Fogel DB, Michalewicz Z (1997) Handbook of Evolutionary Computations. Oxford University Press. 16. Bäck T, Schütz M (1996) Intelligent Mutation Note Control in Canonical Genetic Algorithm. Foundation of Intelligent Systems. In: Ras ZW, Michalewicz Z eds. Proc. of 9-th Int. Symp. ISIM’96., LNCS 1079, Springer.
208
References
17. Bagley JD (1967) The Behavior of Adaptive Systems with Employ Genetic and Correlation Algorithms. PhD Thesis, University of Michigan, Dissertation Abstracts International 28(12), 5106B. (Univ. Microfilms No. 68–7556). 18. Bahadur RR (1966) A Note on Quantiles in Large Samples. Annals of Mathematical Statistics 37:577–580. 19. Beasley D, Bull DR, Martin RR (1993) A Sequential Niche for Multimodal Function Optimization. Evolutionary Computation Vol. 1, No. 2, 101–125. 20. Becker RW, Lago GV (1970) A Global Optimization Algorithm. In: Proc. of Allerton Conf. on Circuits and System Theory, Monticallo, Illinois, 3–15. 21. Bethke AD (1981) Genetic Algorithms as Function Optimizers. PhD Thesis, University of Michigan, Dissertation Abstracts International 41(9), 3503B, (Univ. Microfilms No. 8106101). 22. Betrò B (1981) Bayesian Testing of Nonparametric Hypotheses and its Application to Global Optimization. Technical Reports CNR-IAMI, Italy. 23. Betrò, B Schoen F(1987) Sequential Stopping Rules for the Multistart Algorithm in Global Optimization. Mathematical Programming 38:271–286. 24. Betrò, B Schoen F(1992) Sequential Stopping Rules for the Multistart Algorithm in Global Optimization. Mathematical Programming 52:445–458. 25. Beyer HG (1995) Toward a Theory of Evolution Strategies: Self-adoption. Evolutionary Computation 3:311–348. 26. Beyer HG (2001) The Theory of Evolution Strategies. Springer. 27. Beyer HG, Rudolph G (1997) Local Performance Measures. Chapter B.2.4. in [15]. 28. Billingsley P (1979) Probability and Measure. John Willey and Sons, New York, Chichester, Brisbane, Toronto. 29. Boender CGE (1984) The Generalized Multinominal Distribution: A Bayesian Analysis and Applications. PhD Thesis, Erasmus University, Rotterdam, Centrum voor Wiskunde en Informatica, Amsterdam. 30. Boender CGE, Rinnooy Kan AHG (1991) On when to Stop Sampling for the Maximum. Journal of Global Optimization 1:331–340. 31. Boender CGE, Rinnoy Kan AHG (1985) Bayesian Stopping Rules for a Class of Stochastic Global Optimization Methods. Technical Report, Econometric Institute, Erasmus University, Rotterdam. 32. Boender CGE, Rinnoy Kan AHG (1987) Bayesian Stopping Rules for Multistart Global Optimization Methods. Mathematical Programming 37:59–80. 33. Boender CGE, Rinnoy Kan AHG, Vercellis C (1987) Stochastic Optimization. In: Andreatta G, Mason F, Serafini P eds. Advanced School on Stochastics in Combinatorial Optimization. Word Scientific, Singapore. 34. Boender CGE, Rinnoy Kan AHG, Stougie L, Timmer GT (1982) A Stochastic Method for Global Optimization. Mathematical Programming 22:125–140. 35. Boender CGE, Zieliński R (1985) A Sequential Bayesian Approach to Estimating the Dimension of a Multinominal Distribution. In: Sequential Methods in Statistics. Banach Center Publications Vol. 16, PWN–Polish Scientific Publisher, Warsaw. 36. Borowkow AA (1972) A course in probabilistic. Nauka, Moscow (in Russian). 37. Brooks SH (1958) A Discussion of Random Methods for Seeking Maxima. Operational Research 6:244–251. 38. Burczyński T (1955) Boundary Element Method in Mechanics. WNT, Warsaw (in Polish).
References
209
39. Burczyński T, Długosz A, Kuś W, Orantek P (2000) Evolutionary Design in Computer Aided Engineering. In: Proc. of 5-th Int. Conf. on Computer Aided Engineering, Polanica Zdrój. 40. Burczyński T, Kuś W, Nowakowski M, Orantek P (2001) Evolutionary Algorithms in Nondestructive Identification of Internal Defects. In: Proc. of 5-th KEGiOG Conf., Jastrzębia Góra, 48–55. 41. Burczyński T, Orantek P (1999) The hybrid genetic–gradient algorithm. In: Proc. of 3-rd KEGiOG Conf., Potok Złoty, 47–54 (in Polish). 42. Cabib E, Davini C, Chong-Quing Ru (1990) A Problem in the Optimal Design of Networks Under Transverse Loading. Quaternary of Appl. Math. Vol. XLVIII, 252–263. 43. Cabib E, Schaefer R, Telega H (1998) A Parallel Genetic Clustering for Inverse Problems. LNCS 1541, Springer, 551–556. 44. Cantú Paz E (2000) Markov Chain of Parallel Genetic Algorithm. IEEE Transactions on Evolutionary Computation 4:216–226. 45. Cantú Paz E (2000) Efficient and accurate parallel genetic algorithms. Kluwer Academis Publishers. 46. Chorążyczewski A (2000) Some Restrictions of the Standard Deviation Modifications in Evolutionary Algorithms. In: Proc. of 4-th KEGiOG Conf., Lądek Zdrój, 45–50 (in Polish). 47. Chorążyczewski A (2001) The Analysis of the Adoption Skills of Evolving Population and their Applications in Global Optimization. PhD Thesis, Wrocław University of Technology, Wrocław (in Polish). 48. Chorążyczewski A, Galar R (2001) Evolutionary Dynamics in Space of States. In: Proc. of CEC’2001 Seoul, Korea, 1366–1373. 49. Chow YS, Teicher H (1978) Probability Theory. Springer. 50. Coleman TF, Zhijun Wu (1995) Parallel continuation-based global optimization for molecular conformation and protein folding. Journal of Global Optimization Vol. 8, No. 1, 49–65. 51. Cromen TH, Leiserson ChE, Rivest RL, Stein C (2001) Introduction to Algorithms. MIT Press. 52. Davis L (1989) Adapting Operator Probabilities in Genetic Search Algorithms. In: Proc. of ICGA’89 Conf., San Mateo, CA, Morgan Kaufman. 53. Davis L ed. (1991) Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. 54. De Jong KA (1975) An Analysis of the Behavior of a Class of Genetic Adaptive Systems. PhD Thesis, Univ. of Michigan. 55. De Jong KA, Sarma J (1995) On Decentralizing Selection Algorithm. In: Eselman LJ ed. Proc. of 6-th Conf. on Genetic Algorithms, Morgan Kaufman, 17–23. 56. Deb K (2000) Encoding and Decoding Functions. Chapter 2 in [11]. 57. Deb K, Goldberg D (1989) An investigation of Niche and Species Formation in Genetic Function Optimization. In: Schaffer JD ed. Proc. of Int. Conf. on Genetic Algorithms, Morgan Kaufman, 42–50. 58. Derski W (1975) An Introduction to the Mechanics of Continua. PWN, Warsaw (in Polish). 59. Devroye L (1978) Progressive Global Random Search of Continuous Function. Mathematical Programming 15:330–342. 60. Dixon LCW, Szegö GP eds. (1975) Toward Global Optimization. North Holland, Amsterdam.
210
References
61. Dixon LCW, Gomulka J, Szegö GP (1975) Toward Global Optimization. In [60]. 62. Dulęba I, Karcz–Dulęba I (1996) The Analysis of a Discrete Dynamic System Generated by an Evolutionary Process. In: Works of IX Symposium: Simulation of Dynamic Processes, Poland, 351–356 (in Polish). 63. Eiben AE, Raué PE, Ruttkay Z (1994) Genetic Algorithms with Multi-Parent Recombination. In: Davidor Y, Schwefel HP, Mäner R Proc. of PPSN III, Lecture Notes in Computer Science 866, Springer:68–77. 64. Eshelman L (1991) The CHC Adaptive Search Algorithm: How to have Safe Search when Engaging in Nontraditional Genetic Recombination. In: Rawlins G ed. Foundations of Genetic Algorithms, Morgan Kaufman, 265–283. 65. Feller W (1968) An Introduction to Probability. Theory and Applications. John Wiley & Sons Publishers. 66. Findeisen W, Szymanowski J, Wierzbicki A (1980) Theory and Methods of Optimization. PWM, Warsaw (in Polish). 67. Fogarty TC (1989) Varying the Probability of Mutation in the Genetic Algorithm. In: Schaffer JD ed. Proc. of 3-rd Int. Conf. on Genetic Algorithms, San Mateo CA, Morgan Kaufman, 104–109. 68. Fogel TC (1992) Evolving Artificial Intelligence. PhD Thesis, Univ. of California. 69. Freisleben B (1997) Metaevolutionary Approaches. Chapter C.7.2 in [15]. 70. Galar R, Karcz–Dulęba I (1994) The Evolution of Two. An Example of Space of States Approach. In: Sebald AV, Fogel LJ eds. Proc. of the Thrid Annual Conf. on Evolutionary Programming, San Diego CA, Word Scientific, 261–268. 71. Galar R, Kopcouch R (1999) Ipaciency and Polarization in Evolutionary Processes. In: Proc. of 3-rd KAEiOG Conf., Potok Złoty, 115–122 (in Polish). 72. Garnier J, Kaller L, Schoenauer M (1999) Rigorous Hitting Times for Binary Mutation. Evolutionary Computation Vol. 7, No. 2, 167–203. 73. Goldberg D, Richardson J (1987) Genetic Algorithms with Sharing for Multimodal Function Optimization. Genetic Algorithms and Their Applications. In: Proc. of 2-nd Int. Conf. on Genetic Algorithms, Lawrence Erlbaum Associates, Inc., 41–49. 74. Goldberg DE (1989) Genetic Algorithms and their Applications. Addison– Wesley. 75. Goldberg D (1989) Sizing Populations for Serial and Parallel Genetic Algorithms. Proc. Third Int. Conf. on Genetic Algorithms, Morgan Kaufman, 70–79. 76. Goldberg DE, Deb K, Clark J (1992) Genetic Algorithms, Noise and the Sizing of Populations. Complex Systems 6:333–362. 77. Greffenstette JJ (1986) Optimization of Control Parameters of Genetic Algorithms. IEEE Transactions on Systems, Man and Cybernetics Vol. SMC-16, No. 1. 78. Greffenstette JJ (1993) Deception Considered Harmful. In: Rawlins G ed. Foundations of Genetic Algorithms, Morgan Kaufman. 79. Greffenstette JJ, Bayer JE (1989) How Genetic Algorithms Work. A Critical Look at Implicit Parallelism. In: Schaffer A ed. Proc. of the 3-rd Int. Conf. on Genetic Algorithms, Morgan Kaufman. 80. Grygiel K (1996) On Asymptotic Properties of a Selection-with-Mutation Operator. In: Proc. of 1-th KAEiOG Conf., Murzasichle, Poland, 50–56.
References
211
81. Grygiel K (200) Genetic algorithms with AB-mutation. In: Proc. of 4-th KAEiOG Conf., Lądek Zdrój, Poland, 91–98 (in Polish). 82. Grygiel K (2001) Mathematical models on Evolutionary Algorithms. In: Schaefer R, Sędziwy S eds. Advances in Multi-Agent Systems. Jagiellonian University Press, Kraków, 139–148. 83. Guus C, Boender E, Romeijn EH (1995) Stochastic Methods. In [86]. 84. Harik G, Cantú-Paz E, Goldberg DE, Miller BL (1997) The Gambler’s Ruin Problem, Genetic Algorithms and Sizing Populations. In: Bäck T ed. Proc. of the IEEE Int. Conf. on Evolutionary Computation, Piscataway, NJ, USA:IEEE, 7–12. 85. Holland JH (1975) Adaptation in Natural and Artificial Systems. Univ. of Michigan Press, Ann. Arbor. 86. Horst R, Pardalos PM (1995) Handbook of Global Optimization. Kluwer. 87. Hulin M (1997) An Optimal Stop Criterion for Genetic Algorithm, a Bayesian Approach. In: Bäck T ed. Proc. of ICGA’97, 135–142. 88. Iosifescu M (1988) Finite Markov Processes and their Applications. PWN, Warsaw (in Polish). 89. Jain AK, Murty MN, Flynn PJ (1999) Data Clustering ACM Computing Surveys, Vol. 31, No 3, 264–323. 90. Julstrom BA (1995) What Have You Done for Me Lately? Adapting Operator Probabilities in a Steady-State Genetic Algorithm. In: Eshelman LJ ed. Proc. of ICGA’95, Pittsburg, Pennsylvania, Morgan Kaufman, 81–87. 91. Karcz-Dulęba I (2000) The Dynamics of Two-Element Population in the State Space. The Case of Symmetric Objective Functions. In: Proc. of 4-th KAEiOG Conf., Lądek Zdrój, 115–122 (in Polish). 92. Karcz-Dulęba I (2001) Evolution of Two-Element Population in the Space of Population States. Equilibrium States for Asymetrical Fitness Functions. In: Proc. of 5-th KAEiOG Conf., Jastrzębia Góra, 106–113. 93. Kieś P, Michalewicz Z (2000) Foundations of Genetic Algorithms. Matematyka Stosowana 1:68–91 (in Polish). 94. Kołodziej J (1999) Asymptotic behavior of Simple Genetic Algorithm. In: Proc. of 3-rd KAEiOG Conf., Potok Złoty, 167–174. 95. Kołodziej J (2001) Modeling Hierarchical Genetic Strategy as a Family of Markov Chains. In: Proc. of 4-th PPAM Conf., Nałęczów, Poland, LNCS 2328, Springer, 595–598. 96. Kołodziej J (2003) Hierarchical Strategies of the Genetic Global Optimization. PhD Thesis, Jagiellonian University, Faculty of Mathematics and Informatics, Kraków, Poland (in Polish). 97. Kołodziej J, Schaefer R, Paszyńska A (2004) Hierarchical Genetic Computation in Optimal Design. Journal of Theoretical and Applied Mechanics, Vol. 42, No. 3, 78–97. 98. Kołodziej J, Jakubiec W, Starczak M, Schaefer R (2004) Identification of the CMM Parametric Errors by Hierarchical Genetic Strategy Applied. In: Burczyński T, Osyczka A eds. Solid mechanics and its Applications, Vol. 117, Kluwer, 187–196. 99. Koza JR (1992) Genetic Programming. Part 1, 2. MIT Press, (1992–Part 1, 1994–Part 2). 100. Krishnakumar K (1989) Micro-Genetic Algorithms for Stationary and Nonstationary Function Optimization. In: SPIE’s Intelligent Control and Adaptive Systems Conf., Vol. 1196, Philadelphia, PA.
212
References
101. Kwietniak D (2006) Chaos in the Devaney sense, its variants and topological entropy. PhD Thesis, Jagiellonian University, Faculty of Mathematics and Informatics, Kraków, Poland (in Polish). 102. Langdon WB, Poli R (2002) Foundation of Genetic Programming. Springer. 103. Lis J (1994) Classification algorithms based on Artificial Neural Networks. PhD thesis, Inst. of Biocybernetics and Biomedical Eng., Polish Academy of Science, Warsaw (in Polish). 104. Lis J (1995) Genetic Algorithms with the Dynamic Probability of Mutation in the Classification Problem. Pattern Recognition Letters 16:1311–1321. 105. Littman M, Ackley D (1991) Adaptation in Constant Utility Nonstationary Environment. In: Belew R, Broker L eds. Proc. of ICGA’91 Conf., San Mateo CA, Morgan Kaufman. 106. Lucas E (1975) Stochastic Convergence. Academic Press, New York. 107. Mahfoud SW (1997) Niching Methods. Chapter C.6.1 in [15]. 108. Manna Z (1974) Mathematical Theory of Computation. Mc Graw Hill, New York, St. Luis, San Francisco. 109. Martin WN, Lieing J, Cohon JP (1997) Island (Migration Models: Evolutionary Algorithms Based on Punctuated Equilibria). Chapter C.6.3 in [15]. 110. Michalewicz Z (1996) Genetic Algorithms + Data Structures = Evolutionary Programs. Springer. 111. Michalewicz Z, Nazhiyath G, Michalewicz M (1966) A note of the usefulness of geometrical crossover for numerical optimization problems. In: eds. Fogel LJ, Angeline PJ, Bäck T Proc. 5-th Ann. Conf. on Evolutionary Programming MIT Press. 112. Miller BL (1997) Noise, Sampling and Efficient Genetic Algorithms. PhD Thesis, Univ. of Illinois at Urbana-Chamapaign. 113. Momot J, Kosacki K, Grochowski M, Uhruski P, Schaefer R (2004) MultiAgent System for Irregular Parallel Genetic Computations. LNCS 3038, Springer, 623–630. 114. Neveu J (1975) Discrete–Parameters Martingales. North Holland, Amsterdam. 115. Nikolaev N, Hitoshi I (2000) Inductive Genetic Programming of Polynominal Learning Networks. In: Yao X ed. Proc. of 1-th IEEE Symp. on Combinations of Evolutionary Computation and Neural Networks ECNN-2000, IEEE Press, 158–167. 116. Nix E, Vose D (1992) Modeling Genetic Algorithms with Markov Chains. Annals of Math. and Artificial Intelligence 5(1):79–88. 117. Obuchowicz A (1997) The Evolutionary Search with Soft Selection and Deterioration of the Objective Function. In: Proc. of 6-th Symp. Intelligent Information Systems and Artificial Intelligence IIS’97, Zakopane, Poland, 288–295. 118. Obuchowicz A (1999) Adoption of the Time–Varying Landscape Using an Evolutionary Search with Soft Selection Algorithm. In: Proc. of 3-rd KEGiOG Conf., Potok Złoty, Poland, 245–251. 119. Obuchowicz A (2003) Evolutionary Algorithms for Global Optimization and Dynamic System Diagnosis. University of Zielona Góra Press. 120. Obuchowicz A, Korbicz J (1999) Evolutionary Search with Soft Selection Algorithms in Parameter Optimization. In: Proc. of the PPAM’99 Conf., Kazimierz Dolny, Poland, 578–586. 121. Obuchowicz A, Patan K (1997) An Algorithm of Evolutionary Search with Soft Selection for Training Multi-layered Feed Forwarded Networks. In: Proc. of the
References
122. 123. 124. 125.
126. 127. 128. 129. 130. 131. 132.
133.
134. 135.
136. 137. 138. 139.
140.
141.
142.
143.
213
3-rd Conf. Neural Networks and their Applications KSN’97, Kule, Poland, 123–128. Obuchowicz A, Patan K (1997) About some Evolutionary Algorithm Cases. In: Proc. of the 2-nd KAGiOG Conf., Rytro, Poland, 193–200 (in Polish). Ombach J (1993) Introduction to Probaility Theory. Jagiellonian University Press, Textbooks 686 (in Polish). Osyczka A (2002) Evolutionary Algorithms for Single and Multicriteria Design Optimization. Springer. Pelczar A (1989) Introduction to Theory of Differential Equations. Part II – Elements of the Quantitative Theory of Differential Equations. PWN, Warsaw, Poland (in Polish). Petty CC (1997) Diffusion (Cellural) Models. Chapter C.6.4 in [15]. Stopping Rules for the Multistart Method when Different Local Minima Have Different Function Values. Optimization 21. Pintér JD (1996) Global Optimization in Action. Kluwer. Podolak I, Telega H (2005) Hill crunching clustered genetic search and its improvements. Nowy Sącz Academic Review 2:9–69. Podsiadło M (1996) Some Remarks About the Schemata Theorem. In: Proc. of 1-th KAEiOG Conf., Murzasichle, Poland, 119–126 (in Polish). Preprata FP, Shamos MI (1985) Computational Geometry. Springer. Raptis S, Tzefastas S (1998) A Blueprint for a Genetic Meta-Algorithm. In: Proc. 6-th European Conf. on Intelligent Techniques & Soft Computing, Verlag Maintz, Vol. 1, 429–433. Rechenberger I (1978) Evolutionsstrategien, Simulationsmethoden in der Medizin und Bioligie. In: Schneider B, Ranft U eds. Simulationsmethoden in der Medizin und Biologie, Springer, 83–114 (in German). Reeves CR, Rowe JE (2003) Genetic Algorithms: Principles and Perspectives. A Guide to the GA Theory. Kluwer Academic Publishers. Renders JM, Bersini H (1994) Hibridizing genetic algorithms with hill-climbing methods for global optimization: two possible ways. In: Proc. of 1-st IEEE Conf. on Evolutionary Computation. IEEE Press: 312–317. Richardson M (1935) On the homology characters of symmetric products. Duke Math. J. 1 No. 1: 50–69. Rinnoy Kan AHG, Timmer GT (1987) Stochastic Global Optimization Methods. Part 1: Clustering Methods. Mathematical Programming 39:27–56. Rinnoy Kan AHG, Timmer GT (1987) Stochastic Global Optimization Methods. Part 2: Clustering Methods. Mathematical Programming 39:57–78. Rosenberg RS (1967) Simulation of Genetic Populations with Biomechanical Properties. PhD Thesis, Univ. of Michigan, Dissertation Abstracts International 28(7), 2732B, (Univ. Microfilms No. 67–17, 836). Rudolph G (1994) Convergence of Non-Elitist Strategies. In: Proc. of 1-th IEEE Conf. on Computational Intelligence Vol. 1, (Piscateway, NJ:IEEE), 63–66. Rudolph G (1996) Convergence of Evolutionary Algorithms in General Search Spaces. In: Proc. of 3-rd IEEE Conf. on Evolutionary Computations ICEC, IEEE Press, 50–54. Rudolph G (1994) How Mutation and Selection Solve Long Path Problem in Polynominal Expected Time. Evolutionary Computation, Vol. 2, No. 2, 207–211. Rudolph G (1997) Stochastic Processes. Chapter B.2.2 in [15].
214
References
144. Rudolph G (1997) Models of Stochastic Convergence. Chapter B.2.3 in [15]. 145. Rudolph G (2000) Evolution Strategies. Chapter 9 in [10]. 146. Rzewuski M, Szreter M, Arabas J (1997) Looking for the More Effective Operators for Evolution Strategies. In: Proc. of the 2-nd KAGiOG Conf., Rytro, Poland, 237–243 (in Polish). 147. Schaefer R (2000) Adaptability and Self-Adaptability in Genetic Global Optimization. In: Proc. of AIMETH’00, Gliwice, Poland, 291–298. 148. Schaefer R (2001) Simple Taxonomy of the Genetic Global Optimization. Computer Assisted Mechanics and Engineering Sciences CAMES 9:139–145. 149. Schaefer R (with the chapter 6 written by Telega H) (2002) An Introduction to the Global Genetic Optimization. Jagiellonian University Press (jn Polish). 150. Schaefer R (2003) Essential Features of Genetic Strategies. In: Proc. of the CMM’03 Conf., Wisła, Poland, 41–42. 151. Schaefer R (2004) Detailed Evaluation of the Schemata Cardinality Modification at the Single Evolution Step. In: Proc. of the 7-th KAEiOG Conf., Kazimierz, Poland, 143–147. 152. Schaefer R, Adamska K, Telega H (2004) Genetic Clustering in Continuous Landscape Exploration. Engineering Applications of Artificial Intelligence EAAI, Elsevier, 17:407–416. 153. Schaefer R, Adamska K (2004) Well-Tuned Genetic Algorithm and its Advantage in Detecting Basins of Attraction. In: Proc. of the 7-th KAEiOG Conf., Kazimierz, Poland, 149–154. 154. Schaefer R, Jabłoński ZJ (2002) On the Convergence of Sampling Measures in Global Genetic Search. LNCS 2328, Springer, 593–600. 155. Schaefer R, Jabłoński ZJ (2002) How to Gain More Information from the Evolving Population? Chapter in: Arabas J ed. Evolutionary Computation and Global Optimization, Warsaw Technical University Press, 21–33. 156. Schaefer R, Kołodziej J, Gwizdała R, Wojtusiak J (2000) How Simpletons can Increase the Community Development – an Attempt to Hierarchical Genetic Computation. In: Proc. of 4-th KAEiOG Conf., Lądek Zdrój, 187–198. 157. Schaefer R, Kołodziej J (2003) Genetic Search Reinforced by the Population Hierarchy. In: De Jong KA, Poli R, Rowe JE eds. Foundations of Genetic Algorithms 7, Morgan Kaufman, 383–399. 158. Schaefer R, Telega H, Kołodziej J (1999) Genetic Algorithm as a Markov Dynamic System. In: Proc. of the Int. Conf. on Intelligent Techniques in Robotics, Control and Design Making, Polish-Japanese Institute of Information Technology Press, Warsaw. 159. Schaffer JD, Morishima A (1987) An Adoptive Crossover Distribution Mechanism for Genetic Algorithm. In: Greffenstette JJ, Hillsdale NJ eds. Proc. of 2-nd Int. Conf. on Genetic Algorithms, Erlbaum, 36–40. 160. Schlierkamp-Voosen D, Müchlenbein H (1996) Adaptation of Population Sizes by Competing Subpopulation. Proc. of 1-th IEEE Conf. on Evolutionary Computation, 330–335. 161. Schraudolph NN, Belew RK (1992) Dynamic parameter encoding for genetic algorithms. Machine Learning Journal, Volume 9, Number 1, 9–21. 162. Schwartz L (1967) Analyse Mathematique. Hermann, Paris (in French). 163. Schwefel HP (1977) Numerische Optiemierung von Computer Modellen Mittels der Evolutionsstrategie. Interdisciplinary System Research 26, Birkhäuser, Basel (in German).
References
215
164. Schwefel HP, Bäck T (1977) Artificial Evolution: How and Why? In: Proc. of EUROGEN’97, Willey, 1–19. 165. Schwefel HP, Rudolph G (1995) Contemporary Evolution Strategies. Advances in Artificial Life. In: Morgan F ed. Proc. of 3-rd Conf. in Artificial Life, LNCS 928, 893–907. 166. Seredyński F (1998) New Trends in Parallel and Distributed Evolutionary Computing. Fundamenta Informaticae 35:211–230. 167. Shaw JEH (1988) A Quasirandom Approach to Integration in Bayesian Statistics. The Annals of Statistics 16:895–914. 168. Skiena S (2003) Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Cambridge University Press. 169. Skolicki Z, De Jong K (2004) Improving Evolutionary Algorithms with Multirepresentation Island Models, Proc. of 8-th Int. Conf. on Parallel Problem Solving from Nature - PPSN VIII, Birmingham, UK, LNCS, Vol. 3242, Springer, 420–429. 170. Skolicki Z, De Jong K (2005) The influence of migration sizes and intervals on island models, Proc. of Genetic and Evolutionary Computation Conference GECCO-2005, Washington DC, ACM Press, 1295–1302. 171. Slavov V, Nikolaev NI (1997) Inductive Genetic Programming and the Superposition of Fitness Landscapes. In: Bäck T ed. Proc. of 7-th Int. Conf. on Genetic Algorithms, ICGA-97, Morgan Kaufman, 97–104. 172. Slavov V, Nikolaev NI (1999) Genetic Algorithms, Fitness Sublandscapes and Subpopulations. In: Banzhaf W, Reeves C eds. Foundations of Genetic Algorithms 5, Morgan Kaufman. 173. Smith PA (1933) The topology of involutions. In: Proc. Nat. Acad. Sci. 19, No. 6: 612–618. 174. Smith S (1998) The Simplex Method and Evolutionary Algorithms. In: Proc. of ICEC’98 Conf., Alaska, USA, IEEE Press. 175. Sobol IM (1982) On the Estimate of the Accuracy of a Simple Multidimensional Search. Soviet. Math. Dokl. 26:398–401. 176. Sobol IM, Statnikow RB (1981) The Selection of Optimal Paraleters in the Multiparameter Problems. Nauka, Moskva (in Russian). 177. Spears WM (1994) Simple Subpopulation Systems. In: Proc. of 3-rd Annual Conf. on Evolutionary Programming, World Scientific, 296–307. 178. Spears WM (2000) Evolutionary Algorithms. Springer. 179. Spivak M (1969) Calculus on Manifolds. W.A. Benjamin, Inc. New York, Amsterdam. 180. Stadler P 1995 Towards a theory of Landscapes. In: Lopéz-Pena, Capovilla R, Garcia-Pelayo R, Waelbrock H, Zertouche F eds. Complex systems and Binary Networks, Springer, 77–173. 181. Stańczak J (1999) The Development of the Algorithm Concept for SelfRefinement of Evolutionary Systems. PhD Thesis, Warsaw Technical University, Warsaw, Poland. 182. Stańczak J (2000) Evolutionary Algorithms with the Population of “Intelligent” Individuals. In: Proc. of 4-th KAEiOG Conf., Lądek Zdrój, 207–218 (in Polish). 183. Sukharev AG (1971) Optimal Strategies of the Search for the Extremum. Computational Mathematics and Mathematical Physics 11:119–137. 184. Telega H (1999) Parallel Algorithms for Solving Selected Inverse Problems. PhD Thesis, AGH University of Science and Technology, Kraków, Poland.
216
References
185. Telega H, Schaefer R (1999) Advances and Drawbacks of a Genetic Clustering Strategy. In: Proc. of 3-rd KEGiOG Conf., Potok Złoty, Poland, 291–300. 186. Telega H, Schaefer R (2000) Testing the Genetic Clustering with SGA Evolutionary Engine. In: Proc. of 4-th KAEiOG Conf., Lądek Zdrój, Poland, 227–263. 187. Thierens D (1995) Analysis and Design of Genetic Algorithms. PhD Thesis, Katholieke Univ. Leuven, Leuven, Belgium. 188. Törn A (1975) A Clustering Approach to Global Optimization. In [60]. 189. Törn A (1976) Cluster Analysis Using Seed Points and Density Determined Hyperspheres with an Application to Global Optimization. In: Proc. of the 3-rd Int. Conf. of Pattern Recognition, Colorado, California, 394–398. 190. Törn A, Viitanen S (1994) Topographical Global Optimization Using PreSampled Points. Journal of Global Optimization 5:267–276. 191. Vose MD (1996) Modeling Simple Genetic Algorithms. Evolutionary Computation 3(3):453–472. 192. Vose MD (1997) Logarithmic Convergence of Random Heuristic Search. Evolutionary Computation 4(4):395–404. 193. Vose MD (1999) The Simple Genetic Algorithm,MIT Press. 194. Vose MD, Liepnis GE (1991) Punctuated Equilibria in Genetic Search. Complex Systems 5:31–34. 195. Vose MD, Wright AD (1995) Stability of Vertex Fixed Points and Applications. Evolutionary Computation. 196. Whitley D (1994) A Genetic Algorithm Tutorial. Statistics and Computing 4:65–85. 197. Whitley D, Gordon VS, Mathias K (1994) Lamarckian Evolution, the Baldwin Effect and Function Optimization. In: Davidor Y, Schwefel HP, Mäner R eds. Lecture Notes in Computer Science 866:6–15 Berlin, Springer. 198. Whitley D, Gordon VS (1993) Serial and Parallel Genetic Algorithms as Function Optimizers. In: Forrest S ed. Proc. of ICGA’97, Morgan Kaufman, San Mateo, CA, 177–183. 199. Whitley D,Mathias K, Fitzhorn P (1991) Delta Coding: An Iterative search Strategy for Genetic Algorithms. In: Belew RK, Booker LB eds. Proc. of the 4-th Int. Conf. on Genetic Algorithms. Morgan Kaufman, San Mateo, CA, 77–84. 200. Whitley D, Soraya R, Heckerdorn RB (1997) Island Model Genetic Algorithms and Linearly Separable Problems. In: Proc. of AISB’97 Workshop on Evolutionary Computing, Manchester, 112–129. 201. Wierzba B, Semczuk A, Kołodziej J, Schaefer R (2003) Hierarchical Genetic Strategy with real number encoding. In: Proc. of 6-th KAEiOG Conf. Łagów Lubuski, Poland, 231–237. 202. Wit R (1986) Nonlinear Programming Methods. WNT, Warsaw (in Polish). 203. Wright AH (1994) Genetic Algorithms for Real Parameter Optimization. In: Davis L ed. Foundations of Genetic Algorithms, Morgan Kaufman: 205–218. 204. Wright AH (2000) The Exact Schema Theorem. Technical report, University of Montana, Missoula, MT 59812, USA, 1999. http://www.cs.umt.edu/ u/wright/. http://citeseer.ist.psu.edu/article/wright99exact.html 205. Wright AH, Rowe J, Stephens Ch, Poli R (2003) Bistability in a GenePool GA with Mutation. In: De Jong KA, Poli R, Rowe JE eds. Foundations of Genetic Algorithms 7, Morgan Kaufman, 63–80.
References
217
206. Yen J, Liao J, Lee B, Rendolph D (1998) A Hybrid Approach to Modeling Metabolic Systems Using a Genetic Algorithm and Simplex Method. IEEE Transactions on Systems. Man and Cybernetics. Part B: Cybernetics 28:173–191. 207. Zeidler E (1985) Nonlinear Functional Analysis and its Application. (II.1.A. Linear Monotone Operators, II.1.B. Nonlinear Monotone Operators, III. Variational Methods and Optimization). 208. Zieliński R (1981) A Stochastic Estimate of the Structure of Multi-Extremal Problems. Math. Programming 21:348–356. 209. Zieliński R, Neuman P (1986) Stochastic Methods of Searching Function Minima. WNT, Warsaw (in Polish). 210. Zienkiewicz OC, Taylor R (1991) The Finite Element Method Vol. 1, 2 (fourth edition), McGraw-Hill, London.
Index
(r − 1)-dimensional simplex Λr−1 in Rr , 56, 57, 61, 63, 71–73, 77 ε-descent local optimization method, 155 “almost surely” convergence, 92, 94 AB-mutation, 97 adaptive genetic algorithm, 4, 116 affine space, 7 arithmetic crossover, 22, 24, 49, 150 asymptotic correctness in the probabilistic sense, 5, 20, 60, 68, 112, 153, 160, 161, 165, 170, 173, 185 asymptotic guarantee of success, 20, 60, 68, 112, 153, 160, 161, 170, 173, 184 basin of attraction, 180 basin of attraction Bx+ , 4, 12, 27, 30, 84, 117, 127–129, 136, 137, 139, 144 Bayesian stopping rule, 18, 113, 160, 183 binary affine encoding codea , 34, 37, 38, 80, 96, 142 binary crossover, 44, 46, 47, 125 binary encoding Hierarchic Genetic Strategy HGS, 147 binary genetic universum Ω, 34, 43, 47, 55, 61, 62, 80, 96, 105, 142, 143, 146 binary schemata H, 105 bistability, 84
cardinality operator #, 14 cataclysmic mutation, 140 cellular genetic algorithm, 141 cluster analysis, 158 Clustered Genetic Search CGS, 5, 18, 21, 28–30, 113, 159, 179, 181 Clustered Genetic Search CGS, tests, 185 clustering methods in global optimization, 158, 160 clusters, 158, 180 complete convergence, 20, 92, 94 concentration of random points, 159 convergence in mean, 94 convergence in probability, 92, 94 covariance matrix C, 49, 99, 123, 124 crossover, 1, 24, 43, 46, 47, 58, 61, 62, 64, 72, 73, 88, 93, 107, 109, 110, 115, 116, 118, 122, 124, 125, 130, 132, 135, 139, 141 crossover mask, 44, 45, 107, 125 crossover probability distribution crossx,y , 46 crossover rate pc , 45, 116, 122 crossover type type, 45, 116, 122 crowding, 140 degree of schemata ℵ(H), 106 Delta Coding, 143 deme, 31, 52, 144 Density Clustering DC, 160, 162 density of the normal random variable ρN (m,C) , 88, 99, 124
220
Index
discrete probabilistic measure on the admissible set θ ∈ M(D), 33, 77 distance function d : V × V → R+ , 7, 18, 124 domain operator Dom(·), 32 Dynamic Parameter Encoding DPE, 142 dynamic semi-system, 64 elitist selection, 40, 42, 88 encoding function code : U → Dr , 32, 133 equivalence eqp, 15, 56, 88, 98 ergodicity, 60, 67, 93, 111 evolutionary algorithm EA, 38, 48, 87 evolutionary channel, 102 fitness function f , 25, 38, 39, 61, 62, 64, 70, 80, 91, 92, 95, 100, 102, 103, 106, 112, 113, 132, 135, 139 fitness scaling Scale, 39, 133 Gamma Euler’s function, 164 genetic algorithm heuristics, 60, 77 genetic epoch, 39 genetic material of population, 39 genetic operation, 18, 24, 39, 43, 48, 52, 117, 122, 130, 132 genetic operator G, 4, 61, 69 genetic universum (set of all genotypes) U , 32, 34, 55, 87, 116 genetic universum cardinality r, 32, 56 genotype, 22, 24, 26, 32, 47, 61 geometric crossover, 50 global convergence, 20 global extreme, 4 global maximizer, 9 global optimization method, 1 global optimization problem, 3, 7, 8, 55, 118, 133, 137 gradient mutation, 24, 27 Gray encoding, 37, 78 Hamming cliffs, 38 Hamming metrics, 37 Hilbert space, 7 identification problem, 24 independent mutation, 46
individual, 31, 39, 41 Inductive Genetic Programming iGP, 151 initial sample creation, 17 intermediate sample Pt , 52 internal reflection rule, 51 inverse binary affine encoding dcodea , 36, 78 inverse encoding partial function (decoding) dcode : D −→ U , 32, 133 island model, 146 iterate of the Markov transition function τ (t) , 59 Lamarcean evolution, 18, 21 length of schemata ∆(H), 106 Lesbegue measure meas(·), 3, 9, 33, 77 level set, 11, 82, 84, 87, 91, 113 Lipschitz boundary, 8, 51, 122 local extreme, 4, 7, 13, 18, 30, 112, 127–129, 134–137, 139, 141, 144, 145, 147, 151 local extremes filtering, 30, 190 local isolated maximizer, 9 local maximizer, 9, 10, 20, 79–81, 83, 84, 87, 105, 117, 118 local optimization method loc, 11, 18, 133 logarithmic convergence, 73 Markov chain, 2, 55, 57–61, 67, 88, 93, 111, 112, 116 mating, 141 maximum individual life time κ, 127 metaevolution, 145 Micro-Genetic Algorithm µGA, 140 mixing, 46, 47, 89 mixing matrix M, 62 mixing operator M , 62, 69 mixing probability distribution mx,y , 47 Mode Analysis MA, 168 monochromatic population, 57, 72, 112 Monte Carlo, 1, 19 Multi Level Single Linkage MLSL, 158, 160, 170 multi-deme strategy, 52, 53, 115, 118, 142, 144 multi-point mutation, 44, 47, 97, 107
Index multimodal problem, 1 multiset, 14, 47 Multistart, 157 mutation, 1, 115, 116, 122, 126, 132 mutation mask, 44, 46, 97, 107, 108 mutation probability distribution mutx , 44 mutation rate pm , 44, 97, 116, 122, 123 niching, 136 non-sequential stopping rules, 175, 176 normal phenotypic mutation, 24, 48, 88, 99, 123, 150 normal random variable N (m, C), 49, 88, 99, 124 objective function of the maximization problem Φ, 8, 38, 91, 96, 118, 133, 153 objective function of the minimization ˜ 10, 23, 153 problem Φ, occurrence function η, 14 offspring Ot , 52, 140 offspring cardinality λ, 52, 87 one-point crossover, 45, 107, 109 one-step-look-ahead subobtimal stopping rule, 178 permutational power, 15 phenotype, 32, 135 phenotypic encoding, 3, 4, 38, 87, 98, 126 piecewise constant measure density ρθ (·), 34, 78 population, 1, 14, 22, 26, 39, 57, 106 population cardinality µ, 40, 42, 52, 56, 87 population frequency vector, 56, 61 preselection, 140 probabilistic measure on the set of phenotypes θ ∈ M(Dr ), 33, 77 probability transition function τ , 58, 88, 116 proportional selection, 40, 41, 98 proportional selection operator F , 62 Pure Random Search PRS, 19, 154, 156 random variable with the uniform distribution U[0, 1], 49
221
random walk, 17, 98, 113, 151 rank selection, 22, 24, 40, 42 Raster Multistart RMultistart, 186 Rastrigin function, 187 Real Encoding Hierarchic Genetic Strategy HGS-RN, 149 reduction of random points, 159 regular genetic operator, 72 repairing operation, 150 reparing operation, 51 reproduction, 52 Rosenbrock function, 192 sample evaluation, 18 sample modification, 18 sample reduction, 18 search space, space of solutions V , 7 selection, 1, 40, 115 selection operator F , 69 selection probability distribution self , 40 self-adaptive genetic algorithm, 4, 116, 130 sequential niching, 137 sequential stopping rules, 175, 177 set of admissible solutions D, 8, 31 set of attraction, 158 set of attraction Rxloc + , 11, 134 set of non-negative real numbers R+ = {x ∈ R; x ≥ 0}, 7 set of phenotypes Dr , 32 SGA sampling measure θˆ ∈ M(Dr ), 77 sharing function, 136 Simple Genetic Algorithm SGA, 2, 19, 30, 47, 53, 61, 119, 146, 147, 180, 181 simplex crossover, 50 Single Linkage SL, 160, 166 Single-Start, 157 space of linear mappings L(Rr → Rr ), 69 space of probabilistic measures on the admissible domain M(D), 17 space of schemata S, 105 space of states of a genetic algorithm E, 55, 57, 59–61, 69, 75–77, 88, 89, 91–93, 97, 98, 100, 111, 116 speciation, 141
222
Index
standard deviation σ, 82, 88, 99, 100, 104, 123, 126, 138, 140 state probability distribution πµt , 57, 112, 116 stopping rule, 3, 5, 13, 18, 19, 138, 174 strictly descent local optimization method, 11, 155 subclusters, 180 succession, 52 supermartingal, 93 support of the function η, supp(η) = {y ∈ Dom(η); η(y) > 0}, 16 surrounding effect, 49 topograph, 173 Topographical Global Optimization TGO, 173 topographical methods, 173
Topographical Multilevel Single Linkage TMSL, 174 topological optimization, 21 topology in V , top(V ), 7 total fitness of population, 41 tournament mate, 42 tournament selection, 24, 40, 42 transition probability matrix Q, 59, 61, 64 two-phase global optimization strategy, 4, 18, 27, 153 uniform crossover, 46 vector of parameters u(t), 19, 58, 117 well tuned SGA, 30, 79, 113, 184