Ajith Abraham, Crina Grosan and Witold Pedrycz (Eds.) Engineering Evolutionary Intelligent Systems
Studies in Computational Intelligence, Volume 82 Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com
Vol. 70. Javaan Singh Chahl, Lakhmi C. Jain, Akiko Mizutani and Mika Sato-Ilic (Eds.) Innovations in Intelligent Machines-1, 2007 ISBN 978-3-540-72695-1
Vol. 60. Vladimir G. Ivancevic and Tijana T. Ivacevic Computational Mind: A Complex Dynamics Perspective, 2007 ISBN 978-3-540-71465-1
Vol. 71. Norio Baba, Lakhmi C. Jain and Hisashi Handa (Eds.) Advanced Intelligent Paradigms in Computer Games, 2007 ISBN 978-3-540-72704-0
Vol. 61. Jacques Teller, John R. Lee and Catherine Roussey (Eds.) Ontologies for Urban Development, 2007 ISBN 978-3-540-71975-5 Vol. 62. Lakhmi C. Jain, Raymond A. Tedman and Debra K. Tedman (Eds.) Evolution of Teaching and Learning Paradigms in Intelligent Environment, 2007 ISBN 978-3-540-71973-1 Vol. 63. Wlodzislaw Duch and Jacek Ma´ndziuk (Eds.) Challenges for Computational Intelligence, 2007 ISBN 978-3-540-71983-0 Vol. 64. Lorenzo Magnani and Ping Li (Eds.) Model-Based Reasoning in Science, Technology, and Medicine, 2007 ISBN 978-3-540-71985-4 Vol. 65. S. Vaidya, L.C. Jain and H. Yoshida (Eds.) Advanced Computational Intelligence Paradigms in Healthcare-2, 2007 ISBN 978-3-540-72374-5 Vol. 66. Lakhmi C. Jain, Vasile Palade and Dipti Srinivasan (Eds.) Advances in Evolutionary Computing for System Design, 2007 ISBN 978-3-540-72376-9 Vol. 67. Vassilis G. Kaburlasos and Gerhard X. Ritter (Eds.) Computational Intelligence Based on Lattice Theory, 2007 ISBN 978-3-540-72686-9 Vol. 68. Cipriano Galindo, Juan-Antonio Fernández-Madrigal and Javier Gonzalez A Multi-Hierarchical Symbolic Model of the Environment for Improving Mobile Robot Operation, 2007 ISBN 978-3-540-72688-3 Vol. 69. Falko Dressler and Iacopo Carreras (Eds.) Advances in Biologically Inspired Information Systems: Models, Methods, and Tools, 2007 ISBN 978-3-540-72692-0
Vol. 72. Raymond S.T. Lee and Vincenzo Loia (Eds.) Computation Intelligence for Agent-based Systems, 2007 ISBN 978-3-540-73175-7 Vol. 73. Petra Perner (Ed.) Case-Based Reasoning on Images and Signals, 2008 ISBN 978-3-540-73178-8 Vol. 74. Robert Schaefer Foundation of Global Genetic Optimization, 2007 ISBN 978-3-540-73191-7 Vol. 75. Crina Grosan, Ajith Abraham and Hisao Ishibuchi (Eds.) Hybrid Evolutionary Algorithms, 2007 ISBN 978-3-540-73296-9 Vol. 76. Subhas Chandra Mukhopadhyay and Gourab Sen Gupta (Eds.) Autonomous Robots and Agents, 2007 ISBN 978-3-540-73423-9 Vol. 77. Barbara Hammer and Pascal Hitzler (Eds.) Perspectives of Neural-Symbolic Integration, 2007 ISBN 978-3-540-73953-1 Vol. 78. Costin Badica and Marcin Paprzycki (Eds.) Intelligent and Distributed Computing, 2008 ISBN 978-3-540-74929-5 Vol. 79. Xing Cai and T.-C. Jim Yeh (Eds.) Quantitative Information Fusion for Hydrological Sciences, 2008 ISBN 978-3-540-75383-4 Vol. 80. Joachim Diederich Rule Extraction from Support Vector Machines, 2008 ISBN 978-3-540-75389-6 Vol. 81. K. Sridharan Robotic Exploration and Landmark Determination, 2008 ISBN 978-3-540-75393-3 Vol. 82. Ajith Abraham, Crina Grosan and Witold Pedrycz (Eds.) Engineering Evolutionary Intelligent Systems, 2008 ISBN 978-3-540-75395-7
Ajith Abraham Crina Grosan Witold Pedrycz (Eds.)
Engineering Evolutionary Intelligent Systems With 191 Figures and 109 Tables
123
Ajith Abraham Centre for Quantifiable Quality of Service in Communication Systems (Q2S) Centre of Excellence Norwegian University of Science and Technology O.S. Bragstads plass 2E N-7491 Trondheim Norway
[email protected]
Crina Grosan Department of Computer Science Faculty of Mathematics and Computer Science Babes-Bolyai University Cluj-Napoca, Kogalniceanu 1 400084 Cluj - Napoca Romania
Witold Pedrycz Department of Electrical & Computer Engineering University of Alberta ECERF Bldg., 2nd floor Edmonton AB T6G 2V4 Canada
ISBN 978-3-540-75395-7
e-ISBN 978-3-540-75396-4
Studies in Computational Intelligence ISSN 1860-949X Library of Congress Control Number: 2007938406 c 2008 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover Design: Deblik, Berlin, Germany Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com
Contents
Engineering Evolutionary Intelligent Systems: Methodologies, Architectures and Reviews Ajith Abraham and Crina Grosan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Architectures of Evolutionary Intelligent Systems . . . . . . . . . . . . . . . . . 3 Evolutionary Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Evolutionary Search of Connection Weights . . . . . . . . . . . . . . . . . 3.2 Evolutionary Search of Architectures . . . . . . . . . . . . . . . . . . . . . . . 3.3 Evolutionary Search of Learning Rules . . . . . . . . . . . . . . . . . . . . . 3.4 Recent Applications of Evolutionary Neural Networks in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Evolutionary Fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Evolutionary Search of Fuzzy Membership Functions . . . . . . . . . 4.2 Evolutionary Search of Fuzzy Rule Base . . . . . . . . . . . . . . . . . . . . 4.3 Recent Applications of Evolutionary Fuzzy Systems in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Evolutionary Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Recent Applications of Evolutionary Design of Complex Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Multiobjective Evolutionary Design of Intelligent Paradigms . . . . . . . 8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 5 6 7 8 9 10 12 12 13 14 15 16 18 18
Genetically Optimized Hybrid Fuzzy Neural Networks: Analysis and Design of Rule-based Multi-layer Perceptron Architectures Sung-Kwun Oh and Witold Pedrycz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1 Introductory remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2 The architecture of conventional Hybrid Fuzzy Neural Networks (HFNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
VI
Contents
3
The architecture and development of genetically optimized HFNN (gHFNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Fuzzy neural networks based on genetic optimization . . . . . . . . . 3.2 Genetically optimized PNN (gPNN) . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Optimization of gHFNN topologies . . . . . . . . . . . . . . . . . . . . . . . . . 4 The algorithms and design procedure of genetically optimized HFNN (gHFNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The premise of gHFNN: in case of FS FNN . . . . . . . . . . . . . . . . . 4.2 The consequence of gHFNN: in case of gPNN combined with FS FNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Experimental studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Nonlinear function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Gas furnace process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 NOx emission process of gas turbine power plant . . . . . . . . . . . . 6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 27 31 33 34 35 36 40 40 45 49 53 55 55
Genetically Optimized Self-organizing Neural Networks Based on Polynomial and Fuzzy Polynomial Neurons: Analysis and Design Sung-Kwun Oh and Witold Pedrycz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2 The architecture and development of the self-organizing neural networks (SONN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.1 Polynomial Neuron (PN) based SONN and its topology . . . . . . . 62 2.2 Fuzzy Polynomial Neuron (FPN) based SONN and its topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3 Genetic optimization of SONN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 The algorithm and design procedure of genetically optimized SONN (gSONN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Experimental studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1 Gas furnace process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Chaotic time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Evolution of Inductive Self-organizing Networks Dongwon Kim and Gwi-Tae Park . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 2 Design of EA-based SOPNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 2.1 Representation of chromosome for appropriate information of each PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 2.2 Fitness function for modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Contents
VII
3
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.1 Gas furnace process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.2 Three-input nonlinear function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Recursive Pattern based Hybrid Supervised Training Kiruthika Ramanathan and Sheng Uei Guan . . . . . . . . . . . . . . . . . . . . . . . . . 129 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 1.2 Organization of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 2 Some preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 2.2 Simplified architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 2.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 2.4 Variable length genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 134 2.5 Pseudo global optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3 The RPHS training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.1 Hybrid recursive training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4 Summary of the RPHS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5 The two spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 Heuristics for making the RPHS algorithm better . . . . . . . . . . . . . . . . . 141 6.1 Minimal coded genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2 Seperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.3 Computation intensity and population size . . . . . . . . . . . . . . . . . . 143 6.4 Validation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.5 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.1 Problems considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.2 Experimental parameters and control algorithms implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Enhancing Recursive Supervised Learning Using Clustering and Combinatorial Optimization (RSL-CC) Kiruthika Ramanathan and Sheng Uei Guan . . . . . . . . . . . . . . . . . . . . . . . . . 157 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 2 Some preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 2.2 Problem formulation for recursive learning . . . . . . . . . . . . . . . . . . 160 2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
VIII
Contents
3
The RSL-CC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 3.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4 Details of the RSL-CC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.1 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 4.2 Termination criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5 Heuristics for improving the performance of the RSL-CC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.1 Minimal coded genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.2 Population size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.3 Number of generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.4 Duplication of chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.5 Problems Considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.6 Experimental parameters and control algorithms implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6 Conclusions and future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Evolutionary Approaches to Rule Extraction from Neural Networks Urszula Markowska-Kaczmar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 2 The basics of neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 3 Rule extraction from neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 3.2 The existing methods of rule extraction from neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 4 Basic concepts of evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . . . 184 5 Evolutionary methods in rule extraction from neural networks . . . . . 185 5.1 Local approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5.2 Evolutionary algorithms in a global approach to rule extraction from neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Cluster-wise Design of Takagi and Sugeno Approach of Fuzzy Logic Controller Tushar and Dilip Kumar Pratihar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 2 Takagi and Sugeno Approach of FLC . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 4 Clustering and Linear Regression Analysis Using the Clustered Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Contents
IX
4.1 Entropy-based Fuzzy Clustering (EFC) . . . . . . . . . . . . . . . . . . . . . 217 4.2 Approach 1: Cluster-wise Linear Regression . . . . . . . . . . . . . . . . . 219 5 GA-based Tuning of Takagi and Sugeno Approach of FLC . . . . . . . . . 220 5.1 Genetic-Fuzzy System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.1 Modeling of Abrasive Flow Machining (AFM) Process . . . . . . . . 223 6.2 Modeling of Tungsten Inert Gas (TIG) Process . . . . . . . . . . . . . . 231 7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 8 Scope for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Evolutionary Fuzzy Modelling for Drug Resistant HIV-1 Treatment Optimization Mattia Prosperi and Giovanni Ulivi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 1.1 Artificial Intelligence in Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . 251 1.2 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 2 Background on HIV Treatment and Drug Resistance Onset . . . . . . . . 252 2.1 HIV Replication and Treatment Design . . . . . . . . . . . . . . . . . . . . . 253 2.2 Experimental Settings and Data Collection . . . . . . . . . . . . . . . . . . 254 3 Machine Learning for Drug Resistant HIV . . . . . . . . . . . . . . . . . . . . . . . 255 4 Fuzzy Modelling for HIV Drug Resistance Interpretation . . . . . . . . . . 256 4.1 Fuzzy Medical Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 4.2 Fuzzy Relational System for In-Vitro Cultures . . . . . . . . . . . . . . . 259 4.3 Models for In-Vivo Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . 260 5 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 5.1 Fuzzy Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 5.2 Random Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 5.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 6.1 Phenotype Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 6.2 In-Vivo Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 A New Genetic Approach for Neural Network Design Antonia Azzini and Andrea G.B. Tettamanzi . . . . . . . . . . . . . . . . . . . . . . . . 289 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 2 Evolving ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 3 Neuro-Genetic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 3.1 Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 3.2 Individual Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 3.3 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
X
Contents
3.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 3.5 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 3.6 Recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 4 Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 4.1 Fault Diagnosis Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 4.2 Brain Wave Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 4.3 Financial Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 A Grammatical Genetic Programming Representation for Radial Basis Function Networks Ian Dempsey, Anthony Brabazon, and Michael O’Neill . . . . . . . . . . . . . . . . 325 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 2 Grammatical Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 3 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 4 GE-RBFN Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 4.1 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 4.2 Example Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 5 Experimental Setup & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 6 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 A Neural-Genetic Technique for Coastal Engineering: Determining Wave-induced Seabed Liquefaction Depth Daeho Cha, Michael Blumenstein, Hong Zhang, and Dong-Sheng Jeng . . 337 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 1.1 Artificial Neural Networks in Engineering . . . . . . . . . . . . . . . . . . . 337 1.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 1.3 ANN models trained by GAs (Evolutionary Algorithms) . . . . . . 338 1.4 Wave-induced seabed liquefaction . . . . . . . . . . . . . . . . . . . . . . . . . . 339 2 A neural-genetic technique for wave-induced liquefaction . . . . . . . . . . 341 2.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 3.1 Neural-genetic model configuration for wave-induced liquefaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 3.2 ANN model training using GAs for wave-induced liquefaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 3.3 Results for determining wave-induced liquefaction . . . . . . . . . . . . 347 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Contents
XI
On the Design of Large-scale Cellular Mobile Networks Using Multi-population Memetic Algorithms Alejandro Quintero and Samuel Pierre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 3 Memetic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 3.1 Basic Principles of Canonical Genetic Algorithms . . . . . . . . . . . . 359 3.2 Basic Principles of Memetic Algorithms . . . . . . . . . . . . . . . . . . . . . 360 3.3 Multi-population Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 4.1 Memetic Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . 362 4.2 Local Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 4.3 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 5 Performance Evaluation and Numerical Results . . . . . . . . . . . . . . . . . . 369 5.1 Comparison with Other Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 369 5.2 Quality of the Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 A Hybrid Cellular Genetic Algorithm for the Capacitated Vehicle Routing Problem Enrique Alba and Bernab´e Dorronsoro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 2 The Vehicle Routing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 3 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 3.1 Problem Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 3.2 Recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 3.3 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 3.4 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 4 Looking for a New Algorithm: the Way to JCell2o1i . . . . . . . . . . . . . . 389 4.1 Cellular vs. Generational Genetic Algorithms . . . . . . . . . . . . . . . . 390 4.2 On the Importance of the Mutation Operator . . . . . . . . . . . . . . . 391 4.3 Tuning the Local Search Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 5 Solving CVRP with JCell2o1i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 5.1 Benchmark by Augerat et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 5.2 Benchmark by Van Breedam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 5.3 Benchmark by Christofides and Eilon . . . . . . . . . . . . . . . . . . . . . . 400 5.4 Benchmark by Christofides, Mingozzi and Toth . . . . . . . . . . . . . . 401 5.5 Benchmark by Fisher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 5.6 Benchmark by Golden et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 5.7 Benchmark by Taillard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 5.8 Benchmark of Translated Instances from TSP . . . . . . . . . . . . . . . 406 6 Conclusions and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
XII
A B
Contents
Best Found Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Particle Swarm Optimization with Mutation for High Dimensional Problems Jeff Achtnig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 1.2 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 1.3 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 2 PSO Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 2.1 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 2.2 Random Constriction Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 3.1 Standard Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 3.2 Neural Network Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 4.1 Comparison with Differential Evolution . . . . . . . . . . . . . . . . . . . . . 437 5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Preface
Evolutionary Computation (EC) has become an important and timely methodology of problem solving among many researchers working in the area of computational intelligence. The population based collective learning process, self adaptation and robustness are some of the key features of evolutionary algorithms when compared to other global optimization techniques. Evolutionary Computation has been widely accepted for solving several important practical applications in engineering, business, commerce, etc. As the optimization problems to be tackled in the future will be growing in terms of complexity and volume of data, we can envision a rapidly growing role of the EC over the passage of time. Evolutionary design of intelligent systems is gaining much popularity due to its capabilities in handling several real world problems involving optimization, complexity, noisy and non-stationary environment, imprecision, uncertainty and vagueness. This edited volume is aimed to present the latest state-of-the-art methodologies in ’Engineering Evolutionary Intelligent Systems’. This book deals with the theoretical and methodological aspects, as well as various EC applications to many real world problems originating from science, technology, business or commerce. This volume comprises of 15 chapters including an introductory chapter which covers the fundamental definitions and outlines some important research challenges. These fifteen chapters are organized as follows. In the first Chapter, Abraham and Grosan elaborate on various schemes of evolutionary design of intelligent systems. Generic hybrid evolutionary intelligent system architectures are presented with a detailed review of some of the interesting hybrid frameworks already reported in the literature. Oh and Pedrycz introduce an advanced architecture of genetically optimized Hybrid Fuzzy Neural Networks (gHFNN) resulting from a synergistic usage of the genetic optimization-driven hybrid system generated by combining Fuzzy Neural Networks (FNN) with Polynomial Neural Networks (PNN). FNNs support the formation of the premise part of the rule-based structure of the gHFNN. The consequence part of the gHFNN is designed using
XIV
Preface
PNNs. The optimization of the FNN is realized with the aid of a standard back-propagation learning algorithm and genetic optimization. In the third Chapter, Oh and Pedrycz introduce Self-Organizing Neural Networks (SONN) that is based on a genetically optimized multilayer perceptron with Polynomial Neurons (PNs) or Fuzzy Polynomial Neurons (FPNs). In the conventional SONN, an evolutionary algorithm is used to extend the main characteristics of the extended Group Method of Data Handling (GMDH) method, that utilizes the polynomial order as well as the number of node inputs fixed at the corresponding nodes (PNs or FPNs) located in each layer during the development of the network. The genetically optimized SONN (gSONN) results in a structurally optimized structure and comes with a higher level of flexibility in comparison to the one encountered in the conventional SONN. Kim and Park discuss a new design methodology of the self-organizing technique which builds upon the use of evolutionary algorithms. The selforganizing network dwells on the idea of group method of data handling. The order of the polynomial, the number of input variables, and the optimal number of input variables and their selection are encoded as a chromosome. The appropriate information of each node is evolved accordingly and tuned gradually using evolutionary algorithms. The evolved network is a sophisticated and versatile architecture, which can construct models from a limited data set as well as from poorly defined complex problems. In the fifth Chapter, Ramanathan and Guan propose the Recursive Pattern-based Hybrid Supervised (RPHS) learning algorithm that makes use of the concept of pseudo global optimal solutions to evolve a set of neural networks, each of which can solve correctly a subset of patterns. The patternbased algorithm uses the topology of training and validation data patterns to find a set of pseudo-optima, each learning a subset of patterns. Ramanathan and Guan improve the RPHP algorithm (as discussed in Chapter 5) by using a combination of genetic algorithm, weak learner and pattern distributor. The global search component is achieved by a clusterbased combinatorial optimization, whereby patterns are clustered according to the output space of the problem. A combinatorial optimization problem is therefore formed, which is solved using evolutionary algorithms. An algorithm is also proposed to use the pattern distributor to determine the optimal number of recursions and hence the optimal number of weak learners suitable for the problem at hand. In the seventh Chapter, Markowska-Kaczmar proposes two methods of rule extraction referred to as REX and GEX. REX uses propositional fuzzy rules and is composed of two methods REX Michigan and REX Pitt. GEX takes an advantage of classical Boolean rules. The efficiency of REX and GEX were tested using different benchmark data sets coming from the UCI repository. Tushar and Pratihar deals with Takagi and Sugeno Fuzzy Logic Controllers (FLC) by focusing on their design process. This development by
Preface
XV
clustering the data based on their similarity among themselves and then cluster-wise regression analysis is carried out, to determine the response equations for the consequent part of the rules. The performance of the developed cluster-wise linear regression approach; cluster-wise Takagi and Sugeno model of FLC with linear membership functions and cluster-wise Takagi and Sugeno model of FLC with nonlinear membership functions are illustrated using two practical problems. In the ninth Chapter, Prosperi and Ulivi propose fuzzy relational models for genotypic drug resistance analysis in Human Immunodeficiency Virus type 1 (HIV-1). Fuzzy logic is introduced to model high-level medical language, viral and pharmacological dynamics. Fuzzy evolutionary algorithms and fuzzy evaluation functions are proposed to mine resistance rules, to improve computational performance and to select relevant features. Azzini and Tettamanzi present an approach to the joint optimization of neural network structure and weights, using backpropagation algorithm as a specialized decoder, and defining a simultaneous evolution of architecture and weights of neural networks. In the eleventh Chapter, Dempsey et al. present grammatical genetic programming to generate radial basis function networks. Authors tested the hybrid algorithm considering several benchmark classification problems reporting on encouraging performance obtained there. In the sequel Cha et al. propose neural-genetic model to wave-induced liquefaction, which provides a better prediction of liquefaction potential. The wave-induced seabed liquefaction problem is one of the most critical issues for analyzing and designing marine structures such as caissons, oil platforms and harbors. In the past, various investigations into wave-induced seabed liquefaction have been carried out including numerical models, analytical solutions and some laboratory experiments. However, most previous numerical studies are based on solving complicated partial differential equations. The neuralgenetic simulation results illustrate the applicability of the hybrid technique for the accurate prediction of wave-induced liquefaction depth, which can also provide coastal engineers with alternative tools to analyze the stability of marine sediments. In the thirteenth Chapter, Quintero and Pierre propose a multi-population Memetic Algorithm (MA) with migration and elitism to solve the problem of assigning cells to switches as a design step of large-scale mobile networks. Being well-known in the literature to be an NP-hard combinatorial optimization problem, this task requires the recourse to heuristic methods, which can practically lead to good feasible solutions, not necessarily optimal, the objective being rather to reduce the convergence time toward these solutions. Computational results reported on an extensive suite of extensive tests confirm the efficiency and the effectiveness of MA to provide good solutions in comparison with other heuristics well-known in the literature, especially those for large-scale cellular mobile networks.
XVI
Preface
Alba and Dorronsoro solve the Capacitated Vehicle Routing Problem (CVRP) of 160 instances using a Cellular genetic algorithm (cGA) hybridized with a problem customized recombination operation, an advanced mutation operator integrating three mutation methods, and an inclusion of two well-known local search algorithms formulated for routing problems. In the last Chapter, Achtnig investigates the use of Particle Swarm Optimization (PSO) in dealing with optimization problems of very high dimension. It has been found that PSO with some of the concepts originating from evolutionary algorithms, such as a mutation operator, can in many cases significantly improve the performance of the PSO. Further improvements have been reported with the addition of a random constriction coefficient. We are very much grateful to all the authors of this volume for sharing their expertise and presenting their recent research findings. Our thanks go to the referees for their outstanding service and a wealth of critical yet highly constructive comments. The Editors would like to thank Dr. Thomas Ditzinger (Springer Engineering In house Editor, Studies in Computational Intelligence Series), Professor Janusz Kacprzyk (Editor-in-Chief, Springer Studies in Computational Intelligence Series) and Ms. Heather King (Editorial Assistant, Springer Verlag, Heidelberg) for the editorial assistance and excellent collaboration during the development of this volume. We hope that the reader will share our excitement and find the volume ‘Engineering Evolutionary Intelligent Systems’ both useful and inspiring. Trondheim, Norway Cluj-Napoca, Romania Alberta, Canada
Ajith Abraham Crina Grosan Witold Pedrycz
Contributors
Ajith Abraham Center of Excellence for Quantifiable Quality of Service Norwegian University of Science and Technology, Trondheim Norway
[email protected] Jeff Achtnig Nalisys (Research Division)
[email protected] Enrique Alba Department of Computer Science University of M´ alaga
[email protected] Antonia Azzini Universit` a degli Studi di Milano Dipartimento di Tecnologie dell’Informazione via Bramante 65 26013, Crema - Italy
[email protected] Michael Blumenstein School of Information and Communication Technology Griffith University Gold Coast Campus, QLD 4215 Australia
[email protected]
Anthony Brabazon Natural Computing Research and Applications Group University College Dublin Ireland
[email protected]
Daeho Cha Griffith School of Engineering Griffith University, Gold Coast Campus QLD 4215, Australia
[email protected]
Ian Dempsey Natural Computing Research and Applications Group University College Dublin Ireland ian.dempsey@PipelineFinancial. com
Bernab´ e Dorronsoro Department of Computer Science University of M´ alaga
[email protected]
XVIII Contributors
Crina Grosan Department of Computer Science Faculty of Mathematics and Computer Science Babes-Bolyai University Cluj-Napoca, Romania
[email protected] Sheng Uei Guan School of Engineering and Design Brunel University, Uxbridge Middlesex, UB8 3PH, UK
[email protected] Dong-Sheng Jeng School of Civil Engineering The University of Sydney NSW 2006 Australia
[email protected] Dongwon Kim Department of Electrical Engineering Korea University, 1 5-ka, Anam-dong Seongbuk-ku, Seoul 136-701 Korea
[email protected] Urszula Markowska-Kaczmar Wroclaw University of Technology Institute of Applied Informatics Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland urszula.markowska-kaczmar@pwr. wroc.pl Sung-Kwun Oh Department of Electrical Engineering The University of Suwon San 2-2, Wau-ri Bongdam-eup, Hwaseong-si Gyeonggi-do, 445-743 South Korea
[email protected]
Michael O’Neill Natural Computing Research and Applications Group University College Dublin Ireland
[email protected] Gwi-Tae Park Department of Electrical Engineering Korea University, 1 5-ka, Anam-dong Seongbuk-ku, Seoul 136-701 Korea
[email protected] Witold Pedrycz Department of Electrical and Computer Engineering University of Alberta, Edmonton AB, Canada T6G 2G6 and Systems Research Institute Polish Academy of Sciences Warsaw, Poland
[email protected] Samuel Pierre Department of Computer Engineering ´ Ecole Polytechnique de Montr´eal C.P. 6079, succ. Centre-Ville Montr´eal, Qu´e. Canada, H3C 3A7
[email protected] Dilip Kumar Pratihar Department of Mechanical Engineering Indian Institute of Technology Kharagpur-721 302 West Bengal, India
[email protected]
Contributors
Mattia Prosperi University of Roma TRE Faculty of Computer Science Engineering Department of Computer Science and Automation
[email protected]
Alejandro Quintero Department of Computer Engineering ´ Ecole Polytechnique de Montr´eal C.P. 6079, succ. Centre-Ville Montr´eal, Qu´e. Canada, H3C 3A7
[email protected]
Kiruthika Ramanathan Department of Electrical and Computer Engineering National University of Singapore 10 Kent Ridge Crescent, Singapore 119260 kiruthika
[email protected]
XIX
Andrea G.B. Tettamanzi Universit` a degli Studi di Milano Dipartimento di Tecnologie dell’Informazione via Bramante 65 26013, Crema - Italy
[email protected] Tushar Department of Mechanical Engineering Indian Institute of Technology Kharagpur-721 302 West Bengal, India
[email protected] Giovanni Ulivi University of Roma TRE Faculty of Computer Science Engineering Department of Computer Science and Automation
[email protected] Hong Zhang Griffith School of Engineering Griffith University, Gold Coast Campus QLD 4215, Australia
[email protected]
Engineering Evolutionary Intelligent Systems: Methodologies, Architectures and Reviews Ajith Abraham and Crina Grosan
Summary. Designing intelligent paradigms using evolutionary algorithms is getting popular due to their capabilities in handling several real world problems involving complexity, noisy environment, imprecision, uncertainty and vagueness. In this Chapter, we illustrate the various possibilities for designing intelligent systems using evolutionary algorithms and also present some of the generic evolutionary design architectures that has evolved during the last couple of decades. We also provide a review of some of the recent interesting evolutionary intelligent system frameworks reported in the literature.
1 Introduction Evolutionary Algorithms (EA) have recently received increased interest, particularly with regard to the manner in which they may be applied for practical problem solving. Usually grouped under the term evolutionary computation or evolutionary algorithms, we find the domains of Genetic Algorithms [34], Evolution Strategies [68], [69], Evolutionary Programming [20], Learning Classifier Systems [36], Genetic Programming [45], Differential Evolution [67] and Estimation of Distribution Algorithms [56]. They all share a common conceptual base of simulating the evolution of individual structures and they differ in the way the problem is represented, processes of selection and the usage/implementation of reproduction operators. The processes depend on the perceived performance of the individual structures as defined by the problem. Compared to other global optimization techniques, evolutionary algorithms are easy to implement and very often they provide adequate solutions. A population of candidate solutions (for the optimization task to be solved) is initialized. New solutions are created by applying reproduction operators (mutation and/or crossover). The fitness (how good the solutions are) of the resulting solutions are evaluated and suitable selection strategy is then applied to determine, which solutions are to be maintained into the next generation. The procedure is then iterated. A. Abraham and C. Grosan: Engineering Evolutionary Intelligent Systems: Methodologies, Architectures and Reviews, Studies in Computational Intelligence (SCI) 82, 1–22 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
2
A. Abraham and C. Grosan
The rest of the chapter is organized as follows. In Section 2, the various architectures for engineering evolutionary intelligent systems are presented. In Section 3, we present evolutionary artificial neural networks and its recent applications followed by evolutionary fuzzy systems and applications in Section 4. Evolutionary clustering is presented in Section 5 followed by recent applications of evolutionary design of complex paradigms in Section 6. Multiobjective evolutionary design intelligent systems are presented in Section 7 and some conclusions are provided towards the end.
2 Architectures of Evolutionary Intelligent Systems Hybridization of evolutionary algorithms with other intelligent paradigms is a promising research field and various architectures for Evolutionary Intelligent Systems (EIS) could be formulated as depicted in Figures 1–6. By problem, we refer to any data mining/optimization/function approximation type problem and Intelligent Paradigm (IP) refers to any computational intelligence techniques like neural network, machine learning schemes, fuzzy inference systems, clustering algorithms etc.
Evolutionary algorithm
Intelligent paradigm
Problem / Data Solution (Output)
Fig. 1. EIS architecture 1
Evolutionary algorithm
Intelligent paradigm
Problem / Data
Fig. 2. EIS architecture 2
Solution (Output)
Engineering Evolutionary Intelligent Systems
Solution (Output)
Intelligent paradigm
Problem / Data
Evolutionary algorithm
Fig. 3. EIS architecture 3
Intelligent paradigm
Solution (Output)
Error feedback
Problem / Data
Evolutionary algorithm
Fig. 4. EIS architecture 4
Intelligent paradigm
Problem / Data Evolutionary algorithm
Fig. 5. EIS architecture 5
3
Solution (Output)
4
A. Abraham and C. Grosan
Intelligent paradigm
Problem / Data
Solution (Output)
Error feedback
Evolutionary algorithm
Fig. 6. EIS architecture 6
Figure 1 illustrates a transformational architecture where an evolutionary algorithm is used to optimize an intelligent paradigm and at the same time the intelligent paradigm is used to fine tune the parameters and performance of the evolutionary algorithm. An example is an evolutionary - fuzzy system where an evolutionary algorithm is used to fine tune the parameters of a fuzzy inference system (for a function approximation problem) and the fuzzy system is used to control the parameters of the evolutionary algorithm [32]. A concurrent hybrid architecture is depicted in Figure 2, where an EA is used as a pre-processor and the intelligent paradigm is used to fine tune the solutions formulated by the EA. The final solutions to the problem is provided by IP. Both EA and IP are continuously required for the satisfactory performance of the system. EA may be used as a post processor as illustrated in [1], [4] and [28]. Architecture 3 (Figure 3) depicts a cooperative hybrid system where the evolutionary algorithm is used to fine tune the parameters of IP only during the initialization of the system. Once the system is initialized the EA is not required for the satisfactory functioning of the system. Architecture 4 uses an error feed back from the output (performance measure) and based on the error measure (critic information) the EA is used to fine tune the performance of the IP. Final solutions are provided by the the IP as illustrated in Figure 4. An ensemble model is depicted in Figure 5 where EA received inputs directly from the IP and independent solutions are provided by the EA and IP. A slightly different architecture is depicted in Figure 6, where an error feed back is generated by the EA and depending on this input IP performance is fine tuned and final solutions are provided by the IP. In the following Section, some of the well established hybrid frameworks for optimizing the performance of evolutionary algorithm using intelligent paradigms are presented.
Engineering Evolutionary Intelligent Systems
5
3 Evolutionary Artificial Neural Networks Artificial neural networks are capable of performing a wide variety of tasks, yet in practice sometimes they deliver only marginal performance. Inappropriate topology selection and learning algorithm are frequently blamed. There is little reason to expect that one can find a uniformly best algorithm for selecting the weights in a feedforward artificial neural network. This is in accordance with the no free lunch theorem, which explains that for any algorithm, any elevated performance over one class of problems is exactly paid for in performance over another class [75]. At present, neural network design relies heavily on human experts who have sufficient knowledge about the different aspects of the network and the problem domain. As the complexity of the problem domain increases, manual design becomes more difficult and unmanageable. Evolutionary design of artificial neural networks has been widely explored. Evolutionary algorithms are used to adapt the connection weights, network architecture and learning rules according to the problem environment. A distinct feature of evolutionary neural networks is their adaptability to a dynamic environment. In other words, such neural networks can adapt to an environment as well as changes in the environment. The two forms of adaptation: evolution and learning in evolutionary artificial neural networks make their adaptation to a dynamic environment much more effective and efficient than the conventional learning approach. In Evolutionary Artificial Neural Network (EANN), evolution can be introduced at various levels. At the lowest level, evolution can be introduced into weight training, where ANN weights are evolved. At the next higher level, evolution can be introduced into neural network architecture adaptation, where the architecture (number of hidden layers, no of hidden neurons and node transfer functions) is evolved. At the highest level, evolution can be introduced into the learning mechanism [77]. A general framework of EANNs which includes the above three levels of evolution is given in Figure 7 [2]. From the point of view of engineering, the decision on the level of evolution depends on what kind of prior knowledge is available. If there is more prior knowledge about EANN’s architectures that about their learning rules or
Evolutionary Search of learning rules
Slow
Evolutionary search of architectures and node transfer functions
Evolutionary search of connection weights
Fast
Fig. 7. A General framework for evolutionary artificial neural network
6
A. Abraham and C. Grosan
a particular class of architectures is pursued, it is better to implement the evolution of architectures at the highest level because such knowledge can be used to reduce the search space and the lower level evolution of learning rules can be more biased towards this kind of architectures. On the other hand, the evolution of learning rules should be at the highest level if there is more prior knowledge about them available or there is a special interest in certain type of learning rules. 3.1 Evolutionary Search of Connection Weights The neural network training process is formulated as a global search of connection weights towards an optimal set defined by the evolutionary algorithm. Optimal connection weights can be formulated as a global search problem wherein the architecture of the neural network is pre-defined and fixed during the evolution. Connection weights may be represented as binary strings or real numbers. The whole network is encoded by concatenation of all the connection weights of the network in the chromosome. A heuristic concerning the order of the concatenation is to put connection weights to the same node together. Proper genetic operators are to be chosen depending upon the representation used. While gradient based techniques are very much dependant on the initial setting of weights, evolutionary search method can be considered generally much less sensitive to initial conditions. When compared to any gradient descent or second order optimization technique that can only find local optimum in a neighborhood of the initial solution, evolutionary algorithms always try to search for a global optimal solution. Evolutionary search for connection weights is depicted in Algorithm 3.1. Algorithm 3.1 Evolutionary search of connection weights 1. Generate an initial population of N weight chromosomes. Evaluate the fitness of each EANN depending on the problem. 2. Depending on the fitness and using suitable selection methods reproduce a number of children for each individual in the current generation. 3. Apply genetic operators to each child individual generated above and obtain the next generation. 4. Check whether the network has achieved the required error rate or the specified number of generations has been reached. Go to Step 2. 5. End
Engineering Evolutionary Intelligent Systems
7
3.2 Evolutionary Search of Architectures Evolutionary architecture adaptation can be achieved by constructive and destructive algorithms. Constructive algorithms, which add complexity to the network starting from a very simple architecture until the entire network is able to learn the task [23], [52]. Destructive algorithms start with large architectures and remove nodes and interconnections until the ANN is no longer able to perform its task [58], [66]. Then the last removal is undone. For an optimal network, the required node transfer function (Gaussian, sigmoidal, etc.) can be formulated as a global search problem, which is evolved simultaneously with the search for architectures [49]. To minimize the size of the genotype string and improve scalability, when priori knowledge of the architecture is known it will be efficient to use some indirect coding (high level) schemes. For example, if two neighboring layers are fully connected then the architecture can be coded by simply using the number of layers and nodes. The blueprint representation is a popular indirect coding scheme where it assumes architecture consists of various segments or areas. Each segment or area will define a set of neurons, their spatial arrangement and their efferent connectivity. Several high level coding schemes like graph generation system [44], Symbiotic Adaptive Neuro-Evolution (SANE) [54], marker based genetic coding [24], L-systems [10], cellular encoding [26], fractal representation [53], cellular automata [29] etc. are some of the rugged techniques. Global search of transfer function and the connectivity of the ANN using evolutionary algorithms is formulated in Algorithm 3.2. The evolution of architectures has to be implemented such that the evolution of weight chromosomes are evolved at a faster rate i.e. for every architecture chromosome, there will be several weight chromosomes evolving at a faster time scale. Algorithm 3.2 Evolutionary search of architectures 1. Generate an initial population of N architecture chromosomes. Evaluate the fitness of each EANN depending on the problem. 2. Depending on the fitness and using suitable selection methods reproduce a number of children for each individual in the current generation. 3. Apply genetic operators to each child individual generated above and obtain the next generation. 4. Check whether the network has achieved the required error rate or the specified number of generations has been reached. Go to Step 2. 5. End
8
A. Abraham and C. Grosan
3.3 Evolutionary Search of Learning Rules For the neural network to be fully optimal the learning rules are to be adapted dynamically according to its architecture and the given problem. Deciding the learning rate and momentum can be considered as the first attempt of learning rules [48]. The basic learning rule can be generalized by the function ⎛ ⎞ n n k ⎝θi1 ,i2 ,...,ik xij (t − 1)⎠ (1) ∆w(t) = k=1 i1 ,i2 ,...,ik =1
j=1
where t is the time, ∆w is the weight change, x1 , x2 ,. . . .. x n are local variables and the θ’ s are the real values coefficients which is to be determined by the global search algorithm. In the above equation, different values of θ’ s determine different learning rules. The above equation is arrived based on the assumption that the same rule is applicable at every node of the network and the weight updating is only dependent on the input/output activations and the connection weights on a particular node. The evolution of learning rules has to be implemented such that the evolution of architecture chromosomes are evolved at a faster rate i.e. for every learning rule chromosome, there will be several architecture chromosomes evolving at a faster time scale. Genotypes (θ’ s) can be encoded as real-valued coefficients and the global search for learning rules using the evolutionary algorithm is formulated in Algorithm 3.3. In the literature, several research works could be traced about how to formulate different optimal learning rules [8], [21]. The adaptive adjustment of back-propagation algorithm parameters, such as the learning rate and momentum, through evolution could be considered as the first attempt of the evolution of learning rules [30]. Sexton et al. [65] used simulated annealing algorithm for optimization of learning. For optimization of the neural network Algorithm 3.3 Evolutionary search of learning algorithms or rules 1. Generate an initial population of N learning rules. Evaluate the fitness of each EANN depending on the problem. 2. Depending on the fitness and using suitable selection methods reproduce a number of children for each individual in the current generation. 3. Apply genetic operators to each child individual generated above and obtain the next generation. 4. Check whether the network has achieved the required error rate or the specified number of generations has been reached. Go to Step 2. 5. End
Engineering Evolutionary Intelligent Systems
9
learning, in many cases a pre-defined architecture was used and in a few cases architectures were evolved together. Abraham [2] proposed the meta-learning evolutionary evolutionary neural network with a tight interaction of the different evolutionary search mechanisms using the generic framework illustrated in Figure 7. 3.4 Recent Applications of Evolutionary Neural Networks in Practice Cai et al. [11] used a hybrid of Particle Swarm Optimization (PSO) [41], [18] and EA to train Recurrent Neural Networks (RNNs) for the prediction of missing values in time series data. Experimental results illustrate that RNNs, trained by the hybrid algorithm, are able to predict the missing values in the time series with minimum error, in comparison with those trained with standard EA and PSO algorithms. Castillo et al. [12] explored several methods that combine evolutionary algorithms and local search to optimize multilayer perceptrons. Authors explored a method that optimizes the architecture and initial weights of multilayer perceptrons, a search algorithm for training algorithm parameters, and finally, a co-evolutionary algorithm, that handles the architecture, the network’s initial weights and the training algorithm parameters. Experimental results show that the co-evolutionary method obtains similar or better results than the other approaches, requiring far less training epochs and thus, reducing running time. Hui [37] proposed a new method for predicting the reliability for repairable systems using evolutionary neural networks. Genetic algorithms are used to globally optimize the number of neurons in the hidden layer and learning parameters of the neural network architecture. Marwala [51] proposed a Bayesian neural network trained using Markov Chain Monte Carlo (MCMC) and genetic programming in binary space within Metropolis framework. The proposed algorithm could learn using samples obtained from previous steps merged using concepts of natural evolution which include mutation, crossover and reproduction. The reproduction function is the Metropolis framework and binary mutation as well as simple crossover, are also used. Kim and Cho [42] proposed an incremental evolution method for neural networks based on cellular automata and a method of combining several evolved modules by a rule-based approach. The incremental evolution method evolves the neural network by starting with simple environment and gradually making it more complex. The multi-modules integration method can make complex behaviors by combining several modules evolved or programmed to do simple behaviors. Kim [43] explored a genetic algorithm approach to instance selection in artificial neural networks when the amount of data is very large. GA optimizes
10
A. Abraham and C. Grosan
simultaneously the connection weights and the optimal selection of relevant instances. Capi and Doya [13] implemented an extended multi-population genetic algorithm (EMPGA), where subpopulations apply different evolutionary strategies for designing neural controllers in the real hardware of Cyber Rodent robot. The EMPGA subpopulations compete and cooperate among each other. Bhattacharya et al. [7] used a meta-learning evolutionary artificial neural network in selecting the best Flexible Manufacturing Systems (FMS) from a group of candidate FMSs. EA is used to evolve the architecture and weights of the proposed neural network method. Further, a Back-Propagation (BP) algorithm is used as the local search algorithm. All the randomly generated architectures of the initial population are trained by BP algorithm for a fixed number of epochs. The learning rate and momentum of the BP algorithm have been adapted suiting the generated data of the MCDM problem.
4 Evolutionary Fuzzy Systems A conventional fuzzy controller makes use of a model of the expert who is in a position to specify the most important properties of the process. Fuzzy controller consists of a fuzzification interface, which receives the current values of the input variables and eventually transforms to linguistic terms or fuzzy sets. The knowledge base contains information about the domains of the variables, and the fuzzy sets associated with the linguistic terms. Also a rule base in the form of linguistic control rules is stored in the knowledge base. The decision logic determines the information about the control variables with the help of the measured input values and knowledge base. The task of defuzzification interface is to create a crisp control value out of the information about the control variable of the decision logic by using a suitable transformation. The usual approach in fuzzy control is to define a number of concurrent if-then fuzzy rules. Most fuzzy systems employ the inference method proposed by Mamdani [50] in which the rule consequence is defined by fuzzy sets and has the following structure: if x is A 1 and y is B 1 then f = C
(2)
Takagi, Sugeno and Kang (TSK) [72] proposed an inference scheme in which the conclusion of a fuzzy rule is constituted by a weighted linear combination of the crisp inputs rather than a fuzzy set and has the following structure: (3) if x is A 1 and y is B 1 , then f = p 1 x + q 1 y + r In the literature, several research works related to evolutionary design of fuzzy system could be located [59], [62]. Majority of the works are concerned with the automatic design or optimization of fuzzy logic controllers either by
Engineering Evolutionary Intelligent Systems
11
adapting the fuzzy membership functions or by learning the fuzzy if-then rules [55], [33]. Figure 8 shows the architecture of the adaptive fuzzy control system wherein the fuzzy membership functions and the rule bases are optimized using a hybrid global search procedure. An optimal design of an adaptive fuzzy control system could be achieved by the adaptive evolution of membership functions and the learning rules that progress on different time scales. Figure 9 illustrates the general interaction mechanism with the global search of fuzzy rules evolving at the highest level on the slowest time scale. For each fuzzy rule base, global search of membership functions proceeds at a faster time scale in an environment decided by the problem.
Evolutionary search (Adaptation of fuzzy sets and rule base)
Performance measure
Fuzzy sets + -
if-then rules
Process
Knowledge base Fuzzy controller
Fig. 8. Adaptive fuzzy control system architecture
Slow Evolutionary search of fuzzy rules Evolutionary search of membership functions Fast
Fig. 9. Interaction of various search mechanisms in the design of optimal adaptive fuzzy control system
12
A. Abraham and C. Grosan
4.1 Evolutionary Search of Fuzzy Membership Functions The tuning of the scaling parameters and fuzzy membership functions (piecewise linear and/or differentiable functions) is an important task in the design of fuzzy systems and is popularly known as genetic tuning. Evolutionary algorithms could be used to search the optimal shape, number of membership functions per linguistic variable and the parameters [31]. The genome encodes parameters of trapezoidal, triangle, logistic, Laplace, hyperbolic-tangent or Gaussian membership functions etc. Most of the existing methods assume the existence of a predefined collection of fuzzy membership functions giving meaning to the linguistic labels contained in the rules (database). Evolutionary algorithms are applied to obtain a suitable rule base, using chromosomes that code single rules or complete rule bases. If prior knowledge of the membership functions is available, a simplified chromosome representation could be formulated accordingly. The first decision a designer has to make is how to represent a solution in a chromosome structure. First approach is to have the chromosome encode the complete rule base. Each chromosome differs only in the fuzzy rule membership functions as defined in the database. In the second approach, each chromosome encodes a different database definition based on the fuzzy domain partitions. The global search for membership functions using evolutionary algorithm is formulated in Algorithm 4.1. 4.2 Evolutionary Search of Fuzzy Rule Base The number of rules grows rapidly with an increasing number of variables and fuzzy sets. Literature scan reveals that several coding methods were used Algorithm 4.1 Evolution of learning of fuzzy membership functions and its parameters 1. Generate an initial population of N chromosomes using one of the approaches mentioned in Section 4.1. Evaluate the fitness of each fuzzy rule base depending on the problem. 2. Depending on the fitness and using suitable selection methods reproduce a number of children for each individual in the current generation. 3. Apply genetic operators to each child individual generated above and obtain the next generation. 4. Check whether the fuzzy system has achieved the required error rate or the specified number of generations has been reached. Go to Step 2. 5. End
Engineering Evolutionary Intelligent Systems
13
according to the nature of the problem. The rule base of the fuzzy system may be represented using relational matrix, decision table and set of rules. In the Pittsburg approach, [71] each chromosome encodes a whole rule set. Crossover serves to provide a new combination of rules and mutation provides new rules. The disadvantage is the increased complexity of search space and additional computational burden especially for online learning. The size of the genotype depends on the number of input/output variables and fuzzy sets. In the Michigan approach, [35] each genotype represents a single fuzzy rule and the entire population represents a solution. The fuzzy knowledge base is adapted as a result of antagonistic roles of competition and cooperation of fuzzy rules. A classifier rule triggers whenever its condition part matches the current input, in which case the proposed action is sent to the process to be controlled. The fuzzy behavior is created by an activation sequence of mutually collaborating fuzzy rules. In the Michigan approach, techniques for judging the performance of single rules are necessary. The Iterative Rule Learning (IRL) approach [27] is similar to the Michigan approach where the chromosomes encode individual rules. In IRL, only the best individual is considered as the solution, discarding the remaining chromosomes in the population. The evolutionary algorithm generates new classifier rules based on the rule strengths acquired during the entire process. Defuzzification operators and its parameters may be also formulated as an evolutionary search [46], [40], [5]. 4.3 Recent Applications of Evolutionary Fuzzy Systems in Practice Tsang et al. [73] proposed a fuzzy rule-based system for intrusion detection, which is evolved from an agent-based evolutionary framework and multiobjective optimization. The proposed system can also act as a genetic feature selection wrapper to search for an optimal feature subset for dimensionality reduction. Edwards et al. [19] modeled the complex export pattern behavior of multinational corporation subsidiaries in Malaysia using a Takagi-Sugeno fuzzy inference system. The proposed fuzzy inference system is optimized by using neural network learning and evolutionary computation. Empirical results clearly show that the proposed approach could model the export behavior reasonably well compared to a direct neural network approach. Chen et al. [15] proposed an automatic way of evolving hierarchical Takagi - Sugeno Fuzzy Systems (TS-FS). The hierarchical structure is evolved using Probabilistic Incremental Program Evolution (PIPE) with specific instructions. The fine tuning of the if - then rules parameters encoded in the structure is accomplished using Evolutionary Programming (EP). The proposed method interleaves both PIPE and EP optimizations. Starting with random structures and rules parameters, it first tries to improve the hierarchical structure and then as soon as an improved structure is found, it further fine tunes the rules parameters. It then goes back to improve the structure
14
A. Abraham and C. Grosan
and the rules’ parameters. This loop continues until a satisfactory hierarchical TS-FS model is found or a time limit is reached. Pawara and Ganguli [61] developed a Genetic Fuzzy System (GFS) for online structural health monitoring of composite helicopter rotor blades. Authors formulated a global and local GFSs. The global GFS is for matrix cracking and debonding/delamination detection along the whole blade and the local GFS is for matrix cracking and debonding/delamination detection in various parts of the blade. Chua et al. [16] proposed a GA-based fuzzy controller design for tunnel ventilation systems. Fuzzy Logic Control (FLC) method has been utilized due to the complex and nonlinear behavior of the system and the FLC was optimized using the GA. Franke et al. [22] presented a genetic - fuzzy system for automatically generating online scheduling strategies for a complex objective defined by a machine provider. The scheduling algorithm is based on a rule system, which classifies all possible scheduling states and assigns a corresponding scheduling strategy. Authors compared two different approaches. In the first approach, an iterative method is applied, that assigns a standard scheduling strategy to all situation classes. In the second approach, a symbiotic evolution varies the parameter of Gaussian membership functions to establish the different situation classes and also assigns the appropriate scheduling strategies.
5 Evolutionary Clustering Clustering means the act of partitioning an unlabeled dataset into groups of similar objects. Each group, called a ‘cluster’, consists of objects that are similar between themselves and dissimilar to objects of other groups. A comprehensive review of the state-of-the-art clustering methods can be found in [76], [64]. Data clustering is broadly based on two approaches: hierarchical and partitional. In hierarchical clustering, the output is a tree showing a sequence of clustering with each cluster being a partition of the data set. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. Partitional clustering algorithms, on the other hand, attempt to decompose the data set directly into a set of disjoint clusters by optimizing certain criteria. The criterion function may emphasize the local structure of the data, as by assigning clusters to peaks in the probability density function, or the global structure. Typically, the global criteria involve minimizing some measure of dissimilarity in the samples within each cluster, while maximizing the dissimilarity of different clusters. The advantages of the hierarchical algorithms are the disadvantages of the partitional algorithms and vice versa.
Engineering Evolutionary Intelligent Systems
15
Clustering can also be performed in two different modes: crisp and fuzzy. In crisp clustering, the clusters are disjoint and non-overlapping in nature. Any pattern may belong to one and only one class in this case. In case of fuzzy clustering, a pattern may belong to all the classes with a certain fuzzy membership grade. One of the widely used clustering methods is the fuzzy c-means (FCM) algorithm developed by Bezdek [9]. FCM partitions a collection of n vectors xi , i = 1, 2 . . . , n into c fuzzy groups and finds a cluster center in each group such that a cost function of dissimilarity measure is minimized. To accommodate the introduction of fuzzy partitioning, the membership matrix U is allowed to have elements with values between 0 and 1. The FCM objective function takes the form: c c n 2 J(U, c1 , . . . cc ) = Ji = um (4) ij dij i=1
i=1 j=1
where uij , is a numerical value between [0,1]; ci is the cluster center of fuzzy group i; dij = ci − xj is the Euclidian distance between ith cluster center and j th data point; and m is called the exponential weight which influences the degree of fuzziness of the membership (partition) matrix. Usually a number of cluster centers are randomly initialized and the FCM algorithm provides an iterative approach to approximate the minimum of the objective function starting from a given position and leads to any of its local minima [3]. No guarantee ensures that FCM converges to an optimum solution (can be trapped by local extrema in the process of optimizing the clustering criterion). The performance is very sensitive to initialization of the cluster centers. Research efforts have made it possible to view data clustering as an optimization problem. This view offers us a chance to apply EA for evolving the optimal number of clusters and their cluster centers. The algorithm is initialized by constraining the initial values to be within the space defined by the vectors to be clustered. An important advantage of the EA is its ability to cope with local optima by maintaining, recombining and comparing several candidate solutions simultaneously. Abraham [3] proposed the concurrent architecture of a fuzzy clustering algorithm (to discover data clusters) and a fuzzy inference system for Web usage mining. A hybrid evolutionary FCM approach is proposed in this paper to optimally segregate similar user interests. The clustered data is then used to analyze the trends using a Takagi-Sugeno fuzzy inference system learned using a combination of evolutionary algorithm and neural network learning.
6 Recent Applications of Evolutionary Design of Complex Paradigms Park et al. [60] used EA to optimize Hybrid Self-Organizing Fuzzy Polynomial Neural Networks (HSOFPNN)m, which are based on genetically optimized multi-layer perceptrons. The architecture of the resulting HSOFPNN
16
A. Abraham and C. Grosan
combines fuzzy polynomial neurons (FPNs) [57] that are located at the first layer of the network with polynomial neurons (PNs) forming the remaining layers of the network. The GA-based design procedure being applied at each layer of HSOFPNN leads to the selection of preferred nodes of the network (FPNs or PNs) whose local characteristics (such as the number of input variables, the order of the polynomial, a collection of the specific subset of input variables, the number of membership functions for each input variable, and the type of membership function) can be easily adjusted. Juang and Chung [39] proposed a recurrent TakagiSugenoKang (TSK) fuzzy network design using the hybridization of a multi-group genetic algorithm and particle swarm optimization (R-MGAPSO). Both the number of fuzzy rules and the parameters in a TRFN are designed simultaneously by R-MGAPSO. In R-MGAPSO, the techniques of variable-length individuals and the local version of particle swarm optimization are incorporated into a genetic algorithm, where individuals with the same length constitute the same group, and there are multigroups in a population. Aouiti et al. [6] proposed an evolutionary method for the design of beta basis function neural networks (BBFNN) and beta fuzzy systems (BFS). Authors used a hierarchical genetic learning model of the BBFNN and the BFS. Chen at al. [14] introduced a new time-series forecasting model based on the flexible neural tree (FNT). The FNT model is generated initially as a flexible multi-layer feed-forward neural network and evolved using an evolutionary procedure. FNT model could also select the appropriate input variables or time-lags for constructing a time-series model.
7 Multiobjective Evolutionary Design of Intelligent Paradigms Even though some real world problems can be reduced to a matter of single objective very often it is hard to define all the aspects in terms of a single objective. In single objective optimization, the search space is often well defined. As soon as there are several possibly contradicting objectives to be optimized simultaneously, there is no longer a single optimal solution but rather a whole set of possible solutions of equivalent quality. When we try to optimize several objectives at the same time the search space also becomes partially ordered. To obtain the optimal solution, there will be a set of optimal trade-offs between the conflicting objectives. A multiobjective optimization problem is defined by a function f which maps a set of constraint variables to a set of objective values. Delgado and Pegalajar [17], developed a multi-objective evolutionary algorithm, which is able to determine the optimal size of recurrent neural networks for any particular application. Authors analyzed in the case of grammatical inference: in particular, how to establish the optimal size of a recurrent neural network in order to learn positive and negative examples in a certain language,
Engineering Evolutionary Intelligent Systems
17
and how to determine the corresponding automaton using a self-organizing map once the training has been completed. Serra and Bottura [70] proposed a gain scheduling adaptive control scheme based on fuzzy systems, neural networks and multiobjective genetic algorithms for nonlinear plants. A FLC is developed, which is a discrete time version of a conventional one. Its data base as well as the controller gains are optimally designed by using a genetic algorithm for simultaneously satisfying the overshoot and settling time minimizations and output response smoothing. Kelesoglu [47] developed a method for solving fuzzy multiobjective optimization of space truss using GA. The displacement, tensile stress, fuzzy sets, membership functions and minimum size constraints are considered in formulation of the design problem. Lin [48] proposed a multiobjective and multistage fuzzy competence set model using a hybrid genetic algorithm. Author illustrated that the proposed method can provide a sound fuzzy competence set model by considering the multiobjective and the multistage situations simultaneously. Ishibuchi and Nojimaa [38] examined the interpretability-accuracy tradeoff in fuzzy rule-based classifiers using a multiobjective fuzzy genetics-based machine learning (GBML) algorithm which is a hybrid version of Michigan and Pittsburgh approaches. Each fuzzy rule is represented by its antecedent fuzzy sets as an integer string of fixed length. Each fuzzy rule-based classifier, which is a set of fuzzy rules, is represented as a concatenated integer string of variable length. The GBML algorithm simultaneously maximizes the accuracy of rule sets and minimizes their complexity. The accuracy is measured by the number of correctly classified training patterns while the complexity is measured by the number of fuzzy rules and/or the total number of antecedent conditions of fuzzy rules. Garcia-Pedrajas et al. [25] developed a cooperative coevolutive model for the evolution of neural network topology and weights, called MOBNET. MOBNET evolves subcomponents that must be combined in order to form a network, instead of whole networks. The subcomponents in a cooperative coevolutive model must fulfill different criteria to be useful, these criteria usually conflict with each other. The problem of evaluating the fitness on an individual based on many criteria that must be optimized together is approached as a multi-criteria optimization problems. Wang et al. [74] proposed a multiobjective hierarchical genetic algorithm (MOHGA) to extract interpretable rule-based knowledge from data. In order to remove the redundancy of the rule base proactively, authors applied an interpretability-driven simplification method. Fuzzy clustering is used to generate an initial rule-based model and then MOHGA and the recursive least square method are used to obtain the optimized fuzzy models. Pettersson et al. [63] used an evolutionary multiobjective technique in the training process of a feed forward neural network, using noisy data from an industrial iron blast furnace. The number of nodes in the hidden layer, the architecture of the lower part of the network, as well as the weights used in
18
A. Abraham and C. Grosan
them were kept as variables, and a Pareto front was effectively constructed by minimizing the training error along with the network size.
8 Conclusions This Chapter presented the various architectures for designing intelligent paradigms using evolutionary algorithms. The main focus was on designing evolutionary neural networks and evolutionary fuzzy systems. We also illustrated some of the recent generic evolutionary design architectures reported in the literature including fuzzy neural networks and multiobjective design strategies.
References 1. Abraham A, Grosan C, Han SY, Gelbukh A (2005) Evolutionary multiobjective optimization approach for evolving ensemble of intelligent paradigms for stock market modeling. In: Alexander Gelbukh et al. (eds.) 4th Mexican international conference on artificial intelligence, Mexico, Lecture notes in computer science, Springer, Berlin Heidelberg New York, pp 673–681 2. Abraham, A (2004) Meta-learning evolutionary artificial neural networks. Neurocomput J 56c:1–38 3. Abraham A (2003) i-Miner: A Web Usage Mining Framework Using Hierarchical Intelligent Systems, The IEEE International Conference on Fuzzy Systems, FUZZ-IEEE’03, IEEE Press, ISBN 0780378113, pp 1129–1134 4. Abraham A, Ramos V (2003), Web Usage Mining Using Artificial Ant Colony Clustering and Genetic Programming, 2003 IEEE Congress on Evolutionary Computation (CEC2003), Australia, IEEE Press, ISBN 0780378040, pp 1384– 1391, 2003 5. Abraham A (2003), EvoNF: A Framework for Optimization of Fuzzy Inference Systems Using Neural Network Learning and Evolutionary Computation, The 17th IEEE International Symposium on Intelligent Control, ISIC’02, IEEE Press, ISBN 0780376218, pp 327–332 6. Aouiti C, Alimi AM, Karray F, Maalej A (2005) The design of beta basis function neural network and beta fuzzy systems by a hierarchical genetic algorithm. Fuzzy Sets Syst 154(2):251–274 7. Bhattacharya A, Abraham A, Vasant P, Grosan C (2007) Meta-learning evolutionary artificial neural network for selecting flexible manufacturing systems under disparate level-of-satisfaction of decision maker. Int J Innovative Comput Inf Control 3(1):131–140 8. Baxter J (1992) The evolution of learning algorithms for artificial neural networks, Complex systems, IOS, Amsterdam, pp 313–326 9. Bezdek, JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York 10. Boers EJW, Borst MV, Sprinkhuizen-Kuyper IG (1995) Artificial neural nets and genetic algorithms. In: Pearson DW et al. (eds.) Proceedings of the international conference in Ales, France, Springer, Berlin Heidelberg New York, pp 333–336
Engineering Evolutionary Intelligent Systems
19
11. Cai X, Zhang N, Venayagamoorthy GK, Wunsch II DC (2007) Time series prediction with recurrent neural networks trained by a hybrid PSOEA algorithm. Neurocomputing 70(13–15):2342–2353 12. Castillo PA, Merelo JJ, Arenas MG, Romero G (2007) Comparing evolutionary hybrid systems for design and optimization of multilayer perceptron structure along training parameters. Inf Sci 177(14):2884–2905 13. Capi G, Doya K (2005), Evolution of recurrent neural controllers using an extended parallel genetic algorithm. Rob Auton Syst 52(2–3):148–159 14. Chen Y, Yang B, Dong J, Abraham A (2005) Time-series forecasting using flexible neural tree model. Inf Sci 174(3–4):219–235 15. Chen Y, Yang B, Abraham A, Peng L (2007) Automatic design of hierarchical takagi-sugeno fuzzy systems using evolutionary algorithms. IEEE Trans Fuzzy Syst 15(3):385–397 16. Chu B, Kim D, Hong D, Park J, Chung JT, Chung JH, Kim TH (2008) GA-based fuzzy controller design for tunnel ventilation systems, Journal of Automation in Construction, 17(2):130–136 17. Delgado M, Pegalajar MC (2005) A multiobjective genetic algorithm for obtaining the optimal size of a recurrent neural network for grammatical inference. Pattern Recognit 38(9):1444–1456 18. Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. In: Proceedings of 6th Internationl Symposium on Micro Machine and Human Science, Nagoya, Japan, IEEE Service Center, Piscataaway, NJ, pp 39–43 19. Edwards R, Abraham A, Petrovic-Lazarevic S (2005) Computational intelligence to model the export behaviour of multinational corporation subsidiaries in Malaysia. Int J Am Soc Inf Sci Technol (JASIST) 56(11):1177–1186 20. Fogel LJ, Owens AJ, Walsh MJ (1966) Artificial intelligence through simulated evolution. Wiley, USA 21. Fontanari JF, Meir R (1991) Evolving a learning algorithm for the binary perceptron, Network, vol. 2, pp 353–359 22. Franke C, Hoffmann F, Lepping J, Schwiegelshohn U (2008) Development of scheduling strategies with Genetic Fuzzy systems, Applied Soft Computing Journal, 8(1):706–721 23. Frean M (1990), The upstart algorithm: a method for constructing and training feed forward neural networks. Neural Comput 2:198–209 24. Fullmer B, Miikkulainen R (1992) Using marker-based genetic encoding of neural networks to evolve finite-state behaviour. In: Varela FJ, Bourgine P (eds.) Proceedings of the first European conference on artificial life, France, pp 255–262 25. Garca-Pedrajas N, Hervs-Martnez C, Muoz-Prez J (2002) Multi-objective cooperative coevolution of artificial neural networks (multi-objective cooperative networks). Neural Netw 15(10):1259–1278 26. Grau F (1992)Genetic synthesis of boolean neural networks with a cell rewriting developmental process. In: Whitely D, Schaffer JD (eds.) Proceedings of the international workshop on combinations of genetic algorithms and neural Networks, IEEE Computer Society Press, CA, pp 55–74 27. Gonzalez A, Herrera F (1997) Multi-stage genetic fuzzy systems based on the iterative rule learning approach. Mathware Soft Comput 4(3) 28. Grosan C, Abraham A, Nicoara M (2005) Search optimization using hybrid particle sub-swarms and evolutionary algorithms. Int J Simul Syst, Sci Technol UK 6(10–11):60–79
20
A. Abraham and C. Grosan
29. Gutierrez G, Isasi P, Molina JM, Sanchis A, Galvan IM (2001) Evolutionary cellular configurations for designing feedforward neural network architectures, connectionist models of neurons. In: Jose Mira et al. (eds.) Learning processes, and artificial intelligence, Springer, Berlin Heidelberg New York, LNCS 2084, pp 514–521 30. Harp SA, Samad T, Guha A (1989) Towards the genetic synthesis of neural networks. In: Schaffer JD (ed.) Proceedings of the third international conference on genetic algorithms and their applications, Morgan Kaufmann, CA, pp 360– 369 31. Herrera F, Lozano M, Verdegay JL (1995) Tuning fuzzy logic controllers by genetic algorithms. Int J Approximate Reasoning 12:299–315 32. Herrera F, Lozano M, Verdegay JL (1995) Tackling fuzzy genetic algorithms. In: Winter G, Periaux J, Galan M, Cuesta P (eds.) Genetic algorithms in engineering and computer science, Wiley, USA, pp 167–189 33. Hoffmann F (1999) The Role of Fuzzy Logic in Evolutionary Robotics. In: Saffiotti A, Driankov D (ed.) Fuzzy logic techniques for autonomous vehicle navigation, Springer, Berlin Heidelberg New York 34. Holland JH (1975) Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, MI 35. Holland JH, Reitman JS (1978), Cognitive systems based on adaptive algorithms. In: Waterman DA, Hayes-Roth F (eds.) Pattern-directed inference systems. Academic, San Diego, CA 36. Holland, JH (1980) Adaptive algorithms for discovering and using general patterns in growing knowledge bases. Int J Policy Anal Inf Sys 4(3):245–268 37. Hui LY (2007) Evolutionary neural network modeling for forecasting the field failure data of repairable systems. Expert Syst Appl 33(4):1090–1096 38. Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approximate Reason 44(1):4–31 39. Juang CF, Chung IF (2007) Recurrent fuzzy network design using hybrid evolutionary learning algorithms, Neurocomputing 70(16–18):3001–3010 40. Jin Y, von Seelen W (1999) Evaluating flexible fuzzy controllers via evolution strategies. Fuzzy Sets Syst 108(3):243–252 41. Kennedy J, Eberhart RC (1995). Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Australia, pp 1942– 1948 42. Kim KJ, Cho SB (2006) Evolved neural networks based on cellular automata for sensory-motor controller. Neurocomputing 69(16–18):2193–2207 43. Kim KJ (2006) Artificial neural networks with evolutionary instance selection for financial forecasting. Expert Syst Appl 30(3):519–526 44. Kitano H (1990) Designing neural networks using genetic algorithms with graph generation system. Complex Syst 4(4):461–476 45. Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection, MIT, Cambridge, MA 46. Kosinski W (2007) Evolutionary algorithm determining defuzzyfication operators. Eng Appl Artif Intell 20(5):619–627 47. Kelesoglu O (2007) Fuzzy multiobjective optimization of truss-structures using genetic algorithm. Adv Eng Softw 38(10):717–721
Engineering Evolutionary Intelligent Systems
21
48. Lin CM (2006) Multiobjective fuzzy competence set expansion problem by multistage decision-based hybrid genetic algorithms. Appl Math Comput 181(2):1402–1416 49. Liu Y, Yao X (1996) Evolutionary design of artificial neural networks with different node transfer functions. In: Proceedings of the Third IEEE International Conference on Evolutionary Computation, Nagoya, Japan, pp 670–675 50. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Man Mach Stud 7(1):1–13 51. Marwala T (2007) Bayesian training of neural networks using genetic programming. Pattern Recognit Lett 28(12):1452–1458 52. Mascioli F, Martinelli G (1995) A constructive algorithm for binary neural networks: the oil spot algorithm. IEEE Trans Neural Netw 6(3):794–797 53. Merril JWL, Port RF (1991) Fractally configured neural networks. Neural Netw 4(1):53–60 54. Moriarty DE, Miikkulainen R (1997) Forming neural networks through efficient and adaptive coevolution. Evol Comput 5:373–399 55. Mohammadian M, Stonier RJ (1994) Generating fuzzy rules by genetic algorithms. In: Proceedings of 3rd IEEE International Workshop on Robot and Human Communication, Nagoya, pp 362–367 56. Muhlenbein H, Paab G (1996) From recombination of genes to the estimation of distributions I. Binary parameters. In: Lecture notes in computer science 1411: parallel problem solving from nature-PPSN IV, pp 178–187 57. Oh SK, Pedrycz W, Roh SB (2006), Genetically optimized fuzzy polynomial neural networks with fuzzy set-based polynomial neurons. Inf Sci 176(23):3490– 3519 58. Omlin CW, Giles CL (1993) Pruning recurrent neural networks for improved generalization performance. Techincal report No 93-6, CS Department, Rensselaer Institute, Troy, NY 59. Cordon O, Herrera F, Hoffmann F, Magdalena L (2001) Genetic fuzzy systems: evolutionary tuning and learning of fuzzy knowledge bases, World Scientific, Singapore, ISBN 981-02-4016-3, p 462 60. Park HS, Pedrycz W, Oh SK (2007) Evolutionary design of hybrid selforganizing fuzzy polynomial neural networks with the aid of information granulation. Expert Syst Appl 33(4):830–846 61. Pawar PM, Ganguli R (2007) Genetic fuzzy system for online structural health monitoring of composite helicopter rotor blades. Mech Syst Signal Process 21(5):2212–2236 62. Pedrycz W (ed.) (1997), Fuzzy evolutionary computation, Kluwer Academic Publishers, Boston, ISBN 0-7923-9942-0, p 336 63. Pettersson F, Chakraborti N, Saxen H (2007) A genetic algorithms based multiobjective neural net applied to noisy blast furnace data. Appl Soft Comput 7(1):387–397 64. Rokach L, Maimon O (2005) Clustering methods, data mining and knowledge discovery handbook, Springer, Berlin Heidelberg New York, pp 321–352 65. Sexton R, Dorsey R, Johnson J (1999) Optimization of neural networks: a comparative analysis of the genetic algorithm and simulated annealing. Eur J Oper Res 114:589–601 66. Stepniewski SW, Keane AJ (1997) Pruning back-propagation neural networks using modern stochastic optimization techniques. Neural Comput Appl 5:76–98
22
A. Abraham and C. Grosan
67. Storn R, Price K (1997) Differential evolution – a simple and efficient adaptive scheme for global optimization over continuous spaces. J Global Optim 11(4):341–359 68. Rechenberg I, (1973) Evolutions strategie: optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Fromman-Holzboog, Stuttgart 69. Schwefel HP (1977) Numerische Optimierung von Computermodellen mittels der Evolutionsstrategie, Birkhaeuser, Basel 70. Serra GLO, Bottura CP (2006) Multiobjective evolution based fuzzy PI controller design for nonlinear systems. Eng Appl Artif Intell 19(2):157–167 71. Smith SF (1980) A learning system based on genetic adaptive algorithms. PhD thesis, University of Pittsburgh 72. Takagi T, Sugeno M (1983) Derivation of fuzzy logic control rules from human operators control actions. In: Proceedings of the IFAC symposium on fuzzy information representation and decision analysis, pp 55–60 73. Tsang CH, Kwong S, Wang A (2007) Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection. Pattern Recognit 40(9):2373–2391 74. Wang H, Kwong S, Jin Y, Wei W, Man K (2005) A multi-objective hierarchical genetic algorithm for interpretable rule-based knowledge extraction. Fuzzy Sets Syst 149(1):149–186 75. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82 76. Xu R, Wunsch D, (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678 77. Yao X (1999) Evolving artificial neural networks. Proc IEEE 87(9):423–1447
Genetically Optimized Hybrid Fuzzy Neural Networks: Analysis and Design of Rule-based Multi-layer Perceptron Architectures Sung-Kwun Oh and Witold Pedrycz
Summary. In this study, we introduce an advanced architecture of genetically optimized Hybrid Fuzzy Neural Networks (gHFNN) and develop a comprehensive design methodology supporting their construction. A series of of numeric experiments is included to illustrate the performance of the networks. The construction of gHFNN exploits fundamental technologies of Computational Intelligence (CI), namely fuzzy sets, neural networks, and genetic algorithms (GAs). The architecture of the gHFNNs results from a synergistic usage of the genetic optimization-driven hybrid system generated by combining Fuzzy Neural Networks (FNN) with Polynomial Neural Networks (PNN). In this tandem, a FNN supports the formation of the premise part of the rule-based structure of the gHFNN. The consequence part of the gHFNN is designed using PNNs. The optimization of the FNN is realized with the aid of a standard back-propagation learning algorithm and genetic optimization. We distinguish between two types of the fuzzy rule-based FNN structures showing how this taxonomy depends upon the type of a fuzzy partition of input variables. As to the consequence part of the gHFNN, the development of the PNN dwells on two general optimization mechanisms: the structural optimization is realized via GAs whereas in case of the parametric optimization we proceed with a standard least square method-based learning. Through the consecutive process of such structural and parametric optimization, an optimized PNN is generated in a dynamic fashion. To evaluate the performance of the gHFNN, the models are experimented with several representative numerical examples. A comparative analysis demonstrates that the proposed gHFNN come with higher accuracy as well as superb predictive capabilities when comparing with other neurofuzzy models.
1 Introductory remarks Recently, a lot of attention has been devoted towards advanced techniques of modeling complex systems inherently associated with nonlinearity, highorder dynamics, time-varying behavior, and imprecise measurements. It is anticipated that efficient modeling techniques should allow for a selection of pertinent variables and in this way help cope with dimensionality of the problem at hand. The models should be able to take advantage of the existing S.-K. Oh and W. Pedrycz: Genetically Optimized Hybrid Fuzzy Neural Networks: Analysis and Design of Rule-based Multi-layer Perceptron Architectures, Studies in Computational Intelligence (SCI) 82, 23–57 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
24
S.-K. Oh and W. Pedrycz
domain knowledge (such as a prior experience of human observers or process operators) and augment it by available numeric data to form a coherent dataknowledge modeling entity. The omnipresent modeling tendency is the one that exploits techniques of Computational Intelligence (CI) by embracing fuzzy modeling [1–6], neurocomputing [7], and genetic optimization [8–10]. Especially the two of the most successful approaches have been the hybridization attempts made in the framework of CI [11,12]. Neuro-fuzzy systems are one of them [13–20]. A different approach to hybridization leads to genetic fuzzy systems. Lately to obtain a highly beneficial synergy effect, the neural fuzzy systems and the genetic fuzzy systems hybridize the approximate inference method of fuzzy systems with the learning capabilities of neural networks and evolutionary algorithms [21]. In this study, we develop a hybrid modeling architecture, called genetically optimized Hybrid Fuzzy Neural Networks (gHFNN). In a nutshell, a gHFNN is composed of two main substructures driven by genetic optimization, namely a rule-based Fuzzy Neural Network (FNN) and a Polynomial Neural Network (PNN). From a standpoint of rule-based architectures (with their rules assuming the general form “if antecedent then consequent”), one can regard the FNN as an implementation of the antecedent (or premise) part of the rules while the consequent part is realized with the aid of PNN. The resulting gHFNN is an optimized architecture designed by combining the conventional Hybrid Fuzzy Neural Networks (HFNN [19,20,34,35]) with genetic algorithms (GAs). The conventional HFNNs exhibits FNN architecture treated as the premise part while the PNN structures are used in common as the conclusion part of HFNNs. In this study, the FNNs come with two kinds of network architectures, namely fuzzy-set based FNN and fuzzy-relation based FNN. The topology of the network proposed here relies on fuzzy partitions realized in terms of fuzzy sets or fuzzy relations that its input variables are considered separately or simultaneously. Each of them is placed into the two main categories according to the type of fuzzy inference, namely the simplified and linear fuzzy inference. Moreover the PNN structure is optimized by GAs, that is, a genetically optimized PNN (gPNN) is designed and the gPNN is applied to the consequence part of gHFNN. The gPNN that exhibits a flexible and versatile structure is constructed on a basis of PNN [14,15] and GAs [8–10]. gPNN leads to the effective reduction of the depth of the networks as well as the width of the layer, and the avoidance of a substantial amount of time-consuming iterations for finding the most preferred networks in conventional PNN. In this network, the number of layers and number of nodes in each layer are not predetermined (unlike in case of most neural-networks) but can be generated in a dynamic fashion. The design procedure applied in the construction of each layer of the PNN deals with its structural optimization involving the selection of optimal nodes (or PNs) with specific local characteristics (such as the number of input variables, the order of the polynomial, and a collection of the specific subset of input variables) and addresses specific aspects of parametric optimization.
Genetically Optimized Hybrid Fuzzy Neural Networks
25
The study is organized in the following manner. First, Section 2 delivers a brief introduction to the architecture of the conventional HFNN. In Section 3, we discuss a structure of the genetically optimized HFNN (gHFNN) and elaborate on the development of the networks. The detailed genetic design of the gHFNN model comes with an overall description of a detailed design methodology of the gHFNN presented in Section 4. In Section 5, we report on a comprehensive set of experiments. Finally concluding remarks are covered in Section 6.
2 The architecture of conventional Hybrid Fuzzy Neural Networks (HFNN) The conventional HFNN architecture combined with the FNN and PNN is visualized in Figs. 1–3 [19,20,34,35]. Let us recall that the fuzzy inference (both simplified and linear) -based FNN is constructed with the aid of the space partitioning realized by not only fuzzy set defined for each input variable but also fuzzy relations that effectively capture an ensemble of input variables. These networks arise as a synergy between two other general constructs such as FNN and PNN. Based on the different PNN topologies (see Table 1), the HFNN embraces two kinds of architectures, namely a basic and modified one. Moreover for each architecture of the HFNN, we identified two cases; refer to Fig. 1 for the overall taxonomy. According to the alternative position of two connection points (interface) in case of th usage of FS FNN shown in Fig. 2, we realize a different combination of FNN and PNN while forming the HFNN architecture. Especially when dealing with the interface of FNN realized by means of PNNs, we note that if input variables to PNN used in the consequence part of HFNN are less than three (or four), the generic type of HFNN does not generate a highly versatile structure. As visualized in Figs. 1–3, we identify also two types of the topology, namely a generic and advanced type. Observe that in Figs. 2–3, zi ’(Case 2) in the 2nd layer or higher indicates that the polynomial order of Premise part(FNN)
Linear fuzzy inference
Interface
Simplified fuzzy inference
Consequence part(PNN)
Generic type
Advanced type
Basic
case 1 case 2
Generic type Basic HFNN
Modified
case 1 case 2
Generic type Modified HFNN
Basic
case 1 case 2
Advanced type Basic HFNN
Modified
case 1 case 2
Advanced type Modified HFNN
Fig. 1. Overall diagram for generating the conventional HFNN architecture
26
S.-K. Oh and W. Pedrycz
Fig. 2. FS HFNN architecture combined with FS FNN and PNN
Fig. 3. FR HFNN architecture combined with FR FNN and PNN
Genetically Optimized Hybrid Fuzzy Neural Networks
27
Table 1. Taxonomy of various PNN architectures Layer of PNN
No. of input variables of polynomial
Order of Polynomial
PNN architecture
lst layer
p
Type P
(1) p = q : Basic PNN a) Type P = Type Q: Case1 b) Type P=Type Q: Case2
2nd or higher layer
q
Type Q
(2) p=q : Modified PNN a) Type P = Type Q: Case1 b) Type P = Type Q: Case2
(p = 2,3,4,5, q = 2,3,4,5 ; P = 1,2,3, Q = 1,2,3)
the PD of each node has a different type in comparison with zi of the lst layer. The “NOP” node states that the Ath node of the current layer is the same as the node positioned in the previous layer (NOP stands for “no operation”). An arrow to the NOP node is used to show that the same node moves from the previous layer to the current one.
3 The architecture and development of genetically optimized HFNN (gHFNN) In this section, we elaborate on the architecture and a development process of the gHFNN. This network emerges from the genetically optimized multi-layer perceptron architecture based on fuzzy set or fuzzy relation-based FNN, PNN and GAs. In the sequel, gHFNN is designed by combining the conventional Hybrid Fuzzy Neural Networks (HFNN) with GAs. These networks result as a synergy between two other general constructs such as FNN [24,32] and PNN [14,15]. First, we briefly discuss these two classes of models by underlining their profound features in sections 3.1 and 3.2, respectively. 3.1 Fuzzy neural networks based on genetic optimization We consider two kinds of FNNs (viz. FS FNN and FR FNN) based on two types of fuzzy inferences, namely, simplified and linear fuzzy inferences. The structure of the FNN is the same as the used in the premise of the conventional HFNN. The FNN is designed by using space partitioning realized in terms of the individual input variables or an ensemble of all variables. Its each topology is concerned with a granulation carried out in terms of fuzzy sets defined in each input variable or fuzzy relations that capture an ensemble of input variables respectively. The fuzzy partitions formed for each case lead us to the topologies visualized in Figs. 4–5.
28
S.-K. Oh and W. Pedrycz
Fig. 4. Topology of FS FNN by using space partitioning in terms of individual input variables
Fig. 5. Topology of FR FNN by using space partitioning in terms of an ensemble of input variables
Genetically Optimized Hybrid Fuzzy Neural Networks
29
The notation in these figures requires some clarification. The “circles” denote units of the FNN while “N” identifies a normalization procedure applied tothe membership grades of the input variable xi . The output fi (xi ) of the “ ” neuron is described by some nonlinear function fi . Not necessarily fi is a sigmoid function encountered in conventional neural networks but we allow for more flexibility in this regard. Finally, in case of FS FNN, the output of the FNN y is governed by the following expression; y = f1 (x1 ) + f2 (x2 ) + · · · + fm (xm ) =
m
fi (xi )
(1)
i=1
with m being the number of the input variables (viz. the number of the outputs fi ’s of the “ ” neurons in the network). As previously mentioned, FS FNN is affected by the introduced fuzzy partition of each input variable. In this sense, we can regard each fi given by fuzzy rules as shown in Table 2(a). Table 2(a) represents the comparison of fuzzy rules, inference result and learning for two types of FNNs. In Table 2(a), Rj is the j-th fuzzy rule while Aij denotes a fuzzy variable of the premise of the corresponding fuzzy rule and represents membership function µij . In the simplified fuzzy inference, ωij is a constant consequence of the rules and, in the linear fuzzy inference, ωsij is a constant consequence and ωij is an input variable consequence of the rules. They express a connection (weight) existing between the neurons as we have already visualized in Fig. 4. Mapping from xi to fi (xi ) is determined by the fuzzy inferences and a standard defuzzification. The inference result for individual fuzzy rules follows a standard center of gravity aggregation. An input signal xi activates only two membership functions, so inference results can be written as outlined in Table 2(a) [23,24]. The learning of FNN is realized by adjusting connections of the neurons and as such it follows a standard Back-Propagation (BP) algorithm [23,24]. The complete update formulas are covered in Table 2(a). Where η is a positive learning rate and α is a positive momentum coefficient. The case of FR FNN, see Table 2(b), is carried out in a same manner as outlined in Table 2(a) (the case of FS FNN). The task of optimizing any model involves two main phases. First, a class of some optimization algorithms has to be chosen so that it meets the requirements implied by the problem at hand. Secondly, various parameters of the optimization algorithm need to be tuned in order to achieve its best performance. Along this line, genetic algorithms (GAs) viewed as optimization techniques based on the principles of natural evolution are worth considering. GAs have been experimentally demonstrated to provide robust search capabilities in problems involving complex spaces thus offering a valid solution to problems requiring efficient searching. It is instructive to highlight the main features that tell GA apart from some other optimization methods: (1) GA operates on the codes of the variables, but not the variables themselves. (2) GA searches optimal points starting from a group (population) of points in the search space (potential solutions), rather than a single point. (3) GA’s
30
S.-K. Oh and W. Pedrycz Table 2. Comparison of simplified with linear fuzzy inference-based FNNs
(a) In case of using Simplified fuzzy inference (Scheme I) R1 : If xi is Ai1 then Cyi1 = ωi1 .. . Fuzzy rules Rj : If xi is Aij then Cyij = ωij .. . Rz : If xi is Aiz then Cyiz = ωiz Structure
z
fi (xi ) = Inference result
µij (xi ) · ωij
z
fi (xi ) =
j=1
z
FS FNN Linear fuzzy inference (Scheme II) R1 : · · · then Cyi1 = ωsi1 + xi ωi1 .. . Rj : · · · then Cyij = ωsij + xi ωij .. . Rz : · · · then Cyiz = ωsiz + xi ωiz
j=1
+ µik+1 (xi )
+ µik (xi ) · ωik+1 fi (xi )
Premise part Fuzzy rules Consequence part
n
fi (xi )
i=1
∆ωij = 2 · η · (yp − yp ) · µij (xi )
Structure
· (ωsik+1 + xi wik+1 ) y =
i=1
Learning
µij (xi )
= µik (xi ) · (ωsik + xi ωik )
= µik · (xi )ωik
y =
(µij (xi ) · (ωsij + xi ωij )) z
µij (xi )
j=1
n
j=1
+ α(ωij (t) − ωij (t − 1))
⎧ ⎪ ⎪ ∆ωsij = 2 · η · (y − y) · µij ⎪ ⎪ ⎨ + α(ωsij (t) − ωsij (t − 1)) ⎪ ∆ω = 2 · η · (y − y) · µij · xi ⎪ ⎪ ⎪ ⎩ + α(ωij (t) − ωij (t − 1))
(b) In case of using FR FNN fuzzy inference Linear R1 : If x1 is A11 , · · · , and xk is Ak1 .. . Ri : If x1 is A1i , · · · , and xk is Aki .. . Rn : If x1 is A1n , · · · , and xk is Akn then Cy1 = ω1 .. . then Cyi = ωi .. . then Cyn = ωn
then Cy1 = ω01 + ω11 · x1 + · · · + ωk1 · xk .. . then Cyi = ω0i + ω1i · x1 + · · · + ωki · xk .. . then Cyn = ω0n + ω1n · x1 + · · · + ωkn · xk (continued)
Genetically Optimized Hybrid Fuzzy Neural Networks
31
Table 2. (Continued)
y =
Inference result
= =
n i=1 n i=1 n i=1
fi µ¯i · ωi µi · ωi n µi
y = = =
n i=1 n i=1 n i=1
i=1
Learning
fi µ¯i · (ω0i + ω1i · x1 + ωki · xk ) µi · (ω0i + ω1i · x1 + ωki · xk ) n µi i=1
⎧ ∆ω0i = 2 · η · (y − y) · µ¯i + α(ω0i (t) ⎪ ∆ωi = 2 · η · (y − y) · µ¯i ⎪ ⎪ ⎨ − ω0i (t − 1)) + α(ωi (t) ⎪ = 2 · η · (y − y) · µ¯i · xk ∆ω ki ⎪ ⎪ ⎩ − ωi (t − 1)) + α(ωki (t) − ωki (t − 1))
search is directed only by some fitness function whose form could be quite complex; we do not require its differentiability [8–10]. In order to enhance the learning of the FNN, we use GAs to adjust learning rate, momentum coefficient and the parameters of the membership functions of the antecedents of the rules [19,20,34,35]. 3.2 Genetically optimized PNN (gPNN) As underlined, the PNN algorithm is based upon the GMDH [22] method and utilizes a class of polynomials such as linear, quadratic, modified quadratic, etc. to describe basic processing realized there. By choosing the most significant input variables and an order of the polynomial among various types of forms available, we can obtain the best one - it comes under a name of a partial description (PD). It is realized by selecting nodes at each layer and eventually generating additional layers until the best performance has been reached. Such a methodology leads to an optimal PNN structure [14,15]. In addressing the problems with the conventional PNN (see Fig. 6), we introduce a new genetic design approach; in turn we will be referring to these networks as genetically optimized PNN (to be called “gPNN”). When we construct PNs of each layer in the conventional PNN, such parameters as the number of input variables (nodes), the order of polynomial, and input variables available within a PN are fixed (selected) in advance by the designer. This could have frequently contributed to the difficulties in the design of the optimal network. To overcome this apparent drawback, we resort ourselves to the genetic optimization, see Figs. 8–9 of the next section for more detailed flow of the development activities. The overall genetically-driven structural optimization process of PNN is shown in Fig. 7. The determination of the optimal values of the parameters available within an individual PN (viz. the number of input variables,
32
S.-K. Oh and W. Pedrycz
Polynomial Neural Networks
PN
z1 •
PN
PN
PN PN
z2 • z3 •
PN
z4 •
PN
PN
PN
••
• •
PN
^ f
PN PN
PN PN
Polynomial Neuron(PN) zp zq
Input variables
zp, zq
Polynomial order
2
Partial Description(PD) : Type 2
z
c0+ c1zp+ c2zq+ c3z2p + c4z2q + c5zpzq
Fig. 6. A general topology of the PN-based PNN: note a biquadratic polynomial occurring in the partial description 2nd stage
1st stage 1st layer
Genetic design Selection of the no. of input variables
E
Selection of input variables
S
Layer Generation
2nd layer
Genetic design z1
Selection of the no. of input variables
Selection of input variables
Selection of the polynomial order
Selection of the polynomial order
PNs Selection
PNs Selection
S
Layer Generation
z2
E : Entire inputs, S : Selected PNs, zi : Preferred outputs in the ith stage (zi = z1i, z2i, ..., zWi)
Fig. 7. Overall genetically-driven structural optimization process of PNN
the order of the polynomial, and input variables) leads to a structurally and parametrically optimized network. As a result, this network is more flexible as well as it exhibits simpler topology in comparison to the conventional PNN discussed in the previous research [14,15]. For the optimization of the PNN model, GAs uses the serial method of binary type, roulette-wheel used in the selection process, one-point crossover in the crossover operation, and a binary inversion (complementation) operation in the mutation operator. To retain the best individual and carry it over to the next generation, we use elitist strategy [8,9].
Premise part(FNN)
Consequence part(gPNN) 2nd layer
1st layer
PN
Interface
Simplified & Linear fuzzy inference
GAs S
PN PN PN
Layer generation
GAs
Membership parameters
S PN x2= z1
PN PN PN
Layer generation
GAs
x3= z2
33
Genetically optimized Hybrid Fuzzy Neural Networks (gHFNN)
Genetically Optimized Hybrid Fuzzy Neural Networks
S : Selected PNs, zi : Outputs of the ith layer, xj : Input variables of the jth layer ( j = i + 1)
Fig. 8. Overall diagram for generation of gHFNN architecture
3.3 Optimization of gHFNN topologies The topology of gHFNN is constructed by combining fuzzy set or fuzzy relation-based FNN for the premise part of the gHFNN with PNN being used as the consequence part of gHFNN. These networks emerge through a synergy between two other general constructs such as FNNs and gPNNs. In what follows, the gHFNN is composed of two main substructures driven by genetic optimization; see Figs. 8–9. The role of FNNs arising at the premise part is to support learning and interact with input as well as granulate the corresponding input space (viz. converting the numeric data into their granular representatives emerging at the level of fuzzy sets). Especially, two types of fuzzy inferences-based FNN (viz. FS-FNN or FR-FNN) realized with the fuzzy partitioning of individual input variables or an ensemble of input variables are considered to enhance the adaptability of the hybrid network architecture. One should stress that the structure of the consequent gPNN is not fixed in advance but becomes dynamically organized during a growth process. In essence, the gPNN exhibits an ability of self-organization. The gPNN algorithm can produce an optimal nonlinear system by selecting significant input variables among dozens of those available at the input and forming various types of polynomials. Therefore, for the very reason we selected FNN and gPNN in order to design the gHFNN architecture. One may consider some other hybrid network architectures such as a combination of FNNs and MLPs as well as ANFIS-like models combined with MLPs. While attractive on a surface, such hybridization may lead to several potential problems: 1) The repeated learning and optimization of each of the contributing structure (such as ANFIS and MLP) may result in excessive learning time as well as generate quite complex networks for relatively simple systems and 2) owing to its fixed structure, it could be difficult to generate the flexible topologies of the networks that are required to deal with highly nonlinear dependencies.
34
S.-K. Oh and W. Pedrycz START Premise part of gHFNN : FNN
Agjustment of parameters of MF using GAs & connection weight using BP
Computing activation degrees of linguistic labels Normalization of an activation degree of the rule Multiplying the normalized activation degrees of rules by connection weight Connection Point ? 2
1
Fuzzy inference for output of the rules Output of FS_FNN/FR_FNN
Configuration of input variables for consequence GAs & initial information and gPNN nsequence part
Consequence part of gHFNN : gPNN
Initialization of population
GAs
Generation of a PN by a chromosome in population Reproduction Roulette-wheel selection One-point crossover Invert mutation
Evaluation of PNs(Fitness) x1 = z1, x2 = z2, ..., xW = zW
Elitist strategy & Selection of PNs(W) No
The outputs of the preserved PNs serve as new inputs to the next layer
Stop condition Yes
Generate a layer of gPNN A layer consists of optimal PNs selected by GAs Stop condition
No
Yes
gHFNN gHFNN is organized by FS_FNN and layers with optimal PNs END
Fig. 9. Overall design flowchart of the gHFNN architecture
4 The algorithms and design procedure of genetically optimized HFNN (gHFNN) In this section, we elaborate on the algorithmic details of the design method by considering the functionality of the individual layers in the network architectures. The design procedure for each layer in the premise and consequence of gHFNN comprises of the following steps:
Genetically Optimized Hybrid Fuzzy Neural Networks
35
4.1 The premise of gHFNN: in case of FS FNN [Layer 1] Input layer: The role of this layer is to distribute the signals to the nodes in the next layer. [Layer 2] Computing activation degrees of linguistic labels: Each node in this layer corresponds to one linguistic label (small, large, etc.) of the input variables in layer 1. The layer determines a degree of satisfaction (activation) of this label by the input. [Layer 3] Normalization of a degree activation (firing) of the rule: As described, a degree of activation of each rule was calculated in layer 2. In this layer, we normalize the activation level by using the following expression. µij µik = = µik µ ¯ij = n µik + µik+1 µij
(2)
j=1
where n is the number of membership function for each input variable. An input signal xi activates only two membership functions simultaneously and the sum of grades of these two neighboring membership functions labeled by k and k + 1 is always equal to 1, that is µik (xi ) + µik+1 (xi ) = 1, so that this leads to a simpler format as shown in (2) [23,24]. [Layer 4] Multiplying a normalized activation degree of the rule by connection (weight): The calculated activation degree at the third layer is now calibrated through the connections, that is ¯ij × Cyij = µij × Cyij aij = µ
Simplif ied : Cyij = ωij Linear : Cyij = ωsij + ωij · xi
(3) (4)
If we choose CP (connection point) 1 for combining FS FNN with PNN as shown in Fig. 10, aij is given as the input variable of the PNN. If we choose CP 2, fi (xi ) corresponds to the input signal to the output layer of FNN viewed as the input variable of the PNN. [Layer 5] Fuzzy inference for output of the rules: Considering Fig. 4, the output of each node in the 5th layer of the premise part of gHFNN is inferred µ1j x1
N µij xi
aij
N
CP 2 ∑
w1j
∑
y^
N N
∑
wij
fi(xi)
CP 1
Fig. 10. Connection points used for combining FS FNN (Simplified) with gPNN
36
S.-K. Oh and W. Pedrycz
by the center of gravity method [23,24]. If we choose CP 2, fi is the input variable of gPNN that is the consequence part of gHFNN z j=1
Simplif ied : fi (xi ) = z
z
aij =
µij (xi )
j=1
µij (xi ) · ωij
j=1 z
µij (xi )
(5)
j=1
= µik (xi ) · ωik + µik+1 (xi ) · ωik+1 z j=1
Linear : fi (xi ) = z
j=1
z
aij
µij (xi )
=
µij (xi ) · (ωsij + xi ωij )
j=1 z
µij (xi )
(6)
j=1
= µik (xi ) · ωik + µik+1 (xi ) · ωik+1 [Output layer of FNN] Computing output of basic FNN: The output becomes a sum of the individual contributions from the previous layer, see (1) The design procedure for each layer in FR FNN is carried out in a same manner as the one presented for FS FNN. 4.2 The consequence of gHFNN: in case of gPNN combined with FS FNN [Step 1] Configuration of input variables: Define input variables: xi ’s (i = 1, 2, · · · , n) to gPNN of the consequent structure of gHFNN. If we choose the first option to combine the structures of FNN and gPNN (CP 1), aij , which is the output of layer 4 in the premise structure of the gHFNN, is treated as the input of the consequence structure of gHFNN, that is, x1 = a11 , x2 = a12 , · · · xn = aij (n = i × j). For the second option of combining the structures (viz. CP 2), we have x1 = f1 , x2 = f2 , · · · , xn = fm (n = m). [Step 2] Decision of initial information for constructing the gPNN structure: We decide upon the design parameters of the PNN structure and they include that a) Stopping criterion, b) Maximum number of input variables coming to each node in the corresponding layer, c) Total number W of nodes to be retained (selected) at the next generation of the gPNN, d) Depth of the gPNN to be selected to reduce a conflict between overfitting and generalization abilities of the developed gPNN, and e) Depth and width of the gPNN to be selected as a result of a tradeoff between accuracy and complexity of the overall model. It is worth stressing that the decisions made with respect to (b)–(e) help us avoid building excessively large networks (which could be quite limited in terms of their predictive abilities). [Step 3] Initialization of population: We create a population of chromosomes for a PN, where each chromosome is a binary vector of bits. All bits for each chromosome are initialized randomly.
Genetically Optimized Hybrid Fuzzy Neural Networks
37
[Step 4] Decision of PN structure using genetic design: This concerns the selection of the number of input variables, the polynomial order, and the input variables to be assigned in each node of the corresponding layer. These important decisions are carried out through an extensive genetic optimization. When it comes to the organization of the chromosome representing a PN, we divide the chromosome into three sub-chromosomes as shown in Fig. 12. The 1st sub-chromosome contains the number of input variables, the 2nd sub-chromosome involves the order of the polynomial of the node, and the 3rd sub-chromosome (remaining bits) contains input variables coming to the corresponding node (PN). In nodes (PNs) of each layer of gPNN, we adhere to the notation of Fig. 11. ‘PNn ’ denotes the nth PN (node) of the corresponding layer, ‘N’ denotes the number of nodes (inputs or PNs) coming to the corresponding node, and ‘T’ denotes the polynomial order in the corresponding node. Each sub-step of the genetic design of the three types of the parameters available within the PN is structured as follows. [Step 4-1] Selection of the number of input variables (1st sub-chromosome) Sub-step 1) The first 3 bits of the given chromosome are assigned to the binary bits for the selection of the number of input variables. Sub-step 2) The selected 3 bits are decoded into a decimal. Sub-step 3) The above decimal value is converted into [1 N] and rounded off. N denotes the maximal number of input variables entering the corresponding node (PN). Sub-step 4) The normalized integer value is then treated as the number of input variables (or input nodes) coming to the corresponding node. [Step 4-2] Selection of the order of polynomial (2nd sub-chromosome) Sub-step 1) The 3 bits of the 2nd sub-chromosome are assigned to the binary bits for the selection of the order of polynomial. Sub-step 2) The 3 bits are decoded into a decimal format. Sub-step 3) The decimal value obtained is normalized into [1 3] and rounded off. Sub-step 4) The normalized integer value is given as the polynomial order. a) The normalized integer value is given as 1 ⇒ the order of polynomial is Type 1 nth Polynomial Neuron(PN) xi
PNn N T
xj
z
Polynomial order(Type T) No. of inputs
Fig. 11. Overall diagram for generation of gHFNN architecture
38
S.-K. Oh and W. Pedrycz Selection of node (PN) structrue by chromosome
Related bit items
Bit structure of subchromosome divided for each item
i) Bits for the selection of the no. of input variables
1
0
1
Decoding (Decimal)
Genetic Design
1
1
Decoding (Decimal)
Normalization (less than Max)
Selection of no. of input variables(r)
0
ii) Bits for the selection of the polynomial order
Normalization (1 ~ 3)
Selection of the order of polynomial
1
1
1
iii) Bits for the selection of input variables
0
1
1
0
1
1
1
1
1
1
r
Decoding (Decimal)
Decoding (Decimal)
Normalization (1 ~ n(or W))
Normalization (1 ~ n(or W))
Decision of input variables
Decision of input variables
Selection of input variables
(Type 1~Type 3)
PN
Selected PN
Fig. 12. Overall design flowchart of the gHFNN architecture Table 3. Different forms of regression polynomial forming a PN
PP Number PP PP of inputs P PP Order of PP the polynomial P
2
3
4
1 (Type 1)
Bilinear
Trilinear
Tetralinear
2 (Type 2)
Biquadratic-1
Triquadratic-1
Tetraquadratic-1
2 (Type 3)
Biquadratic-2
Triquadratic-2
Tetraquadratic-2
The following types of the polynomials are used; • Bilinear = c0 + c1 x1 + c2 x2 • Biquadratic-1 (Basic) = Bilinear +c3 x21 + c4 x22 + c5 x1 x2 , • Biquadratic-2 (Modified) = Bilinear +c3 x1 x2
b) The normalized integer value is given as 2 ⇒ the order of polynomial is Type 2 c) The normalized integer value is given as 3 ⇒ the order of polynomial is Type 3
Genetically Optimized Hybrid Fuzzy Neural Networks
39
[Step 4-3] Selection of input variables (3rd sub-chromosome) Sub-step 1) The remaining bits are assigned to the binary bits for the selection of input variables. binary bits for the selection of the number of input variables. Sub-step 2) The remaining bits are divided by the value obtained in step 4-1. Sub-step 3) Each bit structure is decoded into a decimal. Sub-step 4) The decimal value obtained is normalized into [1 n (or W)] and rounded off. n is the overall system’s inputs in the 1st layer, and W is the number of the selected nodes in the 2nd layer or higher. Sub-step 5) The normalized integer values are then taken as the selected input variables while constructing each node of the corresponding layer. Here, if the selected input variables are multiple-duplicated, the multiple-duplicated input variables are treated as a single input variable. [Step 5] Estimation of the coefficients of the polynomial assignedto the selected node and evaluation of a PN: The vector of coefficients is derived by minimizing the mean squared error between yi and y [14,15]. To evaluate the approximation and generalization capability of a PN produced by each chromosome, we use the following fitness function (the objective function is given in Section 5). fitness function =
1 1 + Objective function
(7)
[Step 6] Elitist strategy and Selection of nodes (PNs) with the best predictive capability: The nodes (PNs) obtained on the basis of the calculated fitness values (F1 ,F2 ,· · · ,Fz ) are rearranged in a descending order. We unify the nodes with duplicated fitness values (viz. in case that one node is the same fitness value as other nodes) among the rearranged nodes on the basis of the fitness values. We choose several PNs (W) characterized by the best fitness values. For the elitist strategy, we select the node that has the highest fitness value among the generated nodes. [Step 7] Reproduction: To generate new populations of the next generation, we carry out selection, crossover, and mutation operation using genetic information and the fitness values. Until the last generation, this step carries out by repeating steps 4–7. [Step 8] Construction of a corresponding layer of consequence part of gHFNN: Individuals evolved by GAs produce optimal PNs, W. The generated PNs construct their corresponding layer for the design of consequence part of gHFNN. [Step 9] Check the termination criterion: The termination condition builds a sound compromise between the high accuracy of the resulting model and its complexity as well as generalization abilities. [Step 10] Determine new input variables for the next layer: If the termination criterion has not been met, the
40
S.-K. Oh and W. Pedrycz
model is expanded. The outputs of the preserved nodes (z1 ,z2 , · · · , zW ) serves as new inputs to the next layer (x1 ,x2 , · · · , xW ). Repeating steps 3–10 carries out the gPNN.
5 Experimental studies In this section, the performance of the gHFNN is illustrated with the aid of some well-known and widely used datasets. In the first experiment, the network is used to model a three-input nonlinear function [1,13,19,25,28,29]. In the second simulation, an gHFNN is used to model a time series of gas furnace (Box-Jenkins data) [2–6,26,30–35]. Finally we use gHFNN for NOx emission process of gas turbine power plant [27,29,32]. The performance indexes (object function) used here are: (9) for the three-input nonlinear function and (8) for both gas furnace process and NOx emission process. i) Mean Squared Error (MSE) 1 (yp − yp )2 n p=1 n
E(P I or EP I) =
(8)
ii) Mean Magnitude of Relative Error (MMRE) 1 |yp − yp | × 100(%) n p=1 yp n
E(P I or EP I) =
(9)
Genetic algorithms use binary type, roulette-wheel as the selection operator, one-point crossover, and an invert operation in the mutation operator. The crossover rate of GAs is set to 0.75 and probability of mutation is equal to 0.065. The values of these parameters come from experiments and are very much in line with typical values encountered in genetic optimization. 5.1 Nonlinear function In this experiment, we use the same numerical data as in [1,13,19,25,28,29]. The nonlinear function to be determined is expressed as −1 −1.5 2 y = (1 + x0.5 ) 1 + x2 + x3
(10)
We consider 40 pairs of the original input-output data. The performance index (PI) is defined by (9). 20 out of 40 pairs of input-output data are used as learning set; the remaining part serves as a testing set. Table 4 summarizes the list of parameters related to the genetic optimization of the network. Design information for the optimization of gHFNN distinguishes between information of two networks such as the premise FNN and the consequent gPNN. First, a chromosome used in genetic optimization
Genetically Optimized Hybrid Fuzzy Neural Networks
41
Table 4. Parameters of the optimization environment and computational effort (a) In case of using FS FNN Generation 100 Population size 60 Gas Elite population size (W) 30 Premise structure (FNN) 10 (per one variable) String length Consequence structure (PNN) 3 + 3 + 24 No. of entire system inputs 3 Learning iteration 1000 Learning rate Simplified 0.039 Premise tuned Linear 0.335 (FS FNN) Momentum Simplified 0.004 gHFNN Coefficient tuned Linear 0.058 No. of rules 6 No. of entire CP 1 6 Consequence inputs CP 2 3 (gPNN) Maximal layer 5 No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max) Type(T) 1≤T≤3 N, T : integer (b) In case of using FR FNN Generation 150 Population size 100 Gas Elite population size (W) 50 Premise structure (FNN) 10 (per one variable) String length Consequence structure (PNN) 3 + 3 + 28 No. of entire system inputs 3 Learning iteration 1000 Learning rate Simplified 0.309 Premise tuned Linear 0.879 gHFNN (FS FNN) Momentum Simplified 0.056 Coefficient tuned Linear 0.022 No. of rules 8 No. of entire inputs 8 Consequence Maximal layer 5 (gPNN) No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max) Type (T) 1≤T≤3 N, T : integer
of the premise FNN contains the vertices of 2 membership functions of each system input (here, 3 system input variables have been used), learning rate, and momentum coefficient. The numbers of bits allocated to a chromosome are equal to 60, 10, and 10, respectively, that is 10 bits is assigned to each one variable. The parameters such as learning rate, momentum coefficient, and membership parameters are tuned with the help of genetic optimization
42
S.-K. Oh and W. Pedrycz
of the FNN as shown in Table 4. Next, in case of the consequent gPNN, a chromosome used in the genetic optimization consists of a string including 3 sub-chromosomes. The numbers of bits allocated to each sub-chromosome are equal to 3, 3, and 24/28, respectively. The population size being selected from the total population size, 60/100 is equal to 30/50. The process is realized as follows. 60/100 nodes (PNs) are generated in each layer of the network. The parameters of all nodes generated in each layer are estimated and the network is evaluated using both the training and testing data sets. Then we compare these values and choose 30/50 PNs that produce the best (lowest) value of the performance index. The number of inputs to be selected is confined to a maximum of four entries. The order of the polynomial is chosen from three types, that is Type 1, Type 2, and Type 3. Tables 5(a) and (b) summarize the results of the genetically optimized HFNN architectures when exploiting two kinds of FNN (viz. FS FNN and FR FNN) based on each fuzzy inference method. In light of the values of Table 5(a) reported there, we distinguish with two network architectures such as the premise FNN and the overall gHFNN. First, in case of the premise FNN, the network comes in the form of two fuzzy inference methods. Here, the FNN uses two membership functions for each input variable and has six fuzzy rules. In this case, as mentioned previously, the parameters of the FNN are optimized with the aid of GAs and BP learning. When considering the simplified fuzzy inference-based FNN, the minimal value of the performance index, that is PI = 5.217 and EPI = 5.142 are obtained. In case of the linear fuzzy inference-based FNN, the best results are reported in the form of the performance index such that PI = 2.929 and EPI = 3.45. Next the values of the performance index of output of the gHFNN depend on each connection point based on the individual fuzzy inference methods. The values of the performance index vis-` a-vis choice of number of layers of gHFNN related to the optimized architectures in each layer of the network are shown in Table 5. That is, according to the maximal number of inputs to be selected (Max = 4), the selected node numbers, the selected polynomial type (Type T), and its corresponding performance index (PI and EPI) were shown when the genetic optimization for each layer was carried out. For example, in case when considering simplified fuzzy inference and connection point 2 of FS FNN in Table 5(a), let us investigate the 2nd layer of the network (shadowed in Table 5(a)). The fitness value in layer 2 attains its maximum for Max = 4 when nodes 8, 7, 4, 11 (such as z8 , z7 , z4 , z11 ) occur among preferred nodes (W) chosen in the previous layer (the 1st layer) are selected as the node inputs in the present layer. Furthermore 4 inputs of Type 2 (linear function) were selected as the results of the genetic optimization, refer to Fig. 13(b). In the “Input No.” item of Table 5, a blank node marked by period (·) indicates that it has not been selected by the genetic operation. The performance of the conventional HFNN (called “SOFPNN” in the literature [19], which is composed of FNN and PNN with 4 inputs-Type 2 topology) was quantified
Genetically Optimized Hybrid Fuzzy Neural Networks
43
Table 5. Parameters of the optimization environment and computational effort (a) In case of using FS FNN Consequence part CP No. of EPI Layer Input No. inputs 1 4 6 2 5 3 2 4 4 9 7 15 01 3 3 28 7 18 · 4 4 5 1 7 19 5 2 5 3 · · 5.21 5.14 1 3 3 1 2 · 2 4 8 7 4 11 02 3 3 15 5 14 · 4 3 5 3 24 · 5 3 14 25 1 · 1 4 3 2 6 5 2 4 1 7 28 15 01 3 4 28 9 1 14 4 4 16 23 3 22 5 4 11 12 20 7 2.92 3.45 1 3 2 1 3 · 2 4 16 1 2 7 02 3 4 15 2 28 24 4 4 5 15 20 24 5 3 12 3 7 ·
Premise part Fuzzy No. of rules PI Inference (MFs)
Simplified
6 (2 + 2 + 2)
Linear
6 (2 + 2 + 2)
(b) In case of using FR FNN Consequence part No. of EPI Layer Input No. inputs 1 4 5 6 4 7 2 4 2 45 49 25 8 Simplified 3.997 3.269 3 4 42 9 40 49 (2 × 2 × 2) 4 3 36 16 33 · 5 3 43 45 28 · 1 4 6 1 5 7 2 4 49 4 44 34 8 Linear 2.069 2.518 3 2 43 2 · · (2 × 2 × 2) 4 3 33 44 5 · 5 3 41 26 42 · Premise part Fuzzy No. of rules PI Inference (MFs)
T 2 3 3 1 2 2 2 3 3 2 2 2 2 2 1 2 2 3 3 3
T 1 2 3 1 2 2 3 2 3 4
PI
EPI
2.070 0.390 0.363 0.350 0.337 2.706 0.299 0.299 0.299 0.299 0.667 0.087 0.0029 0.0014 0.0014 0.908 0.113 0.029 0.010 0.0092
2.536 0.896 0.642 0.539 0.452 3.946 0.517 0.467 0.412 0.398 0.947 0.315 0.258 0.136 0.112 1.423 0.299 0.151 0.068 0.056
PI
EPI
8.138 10.68 0.382 2.316 0.313 1.403 0.311 0.734 0.309 0.610 0.423 4.601 0.184 2.175 0.105 1.361 0.063 0.761 0.039 0.587
by the values of PI equal to 0.299 and EPI given as 0.555, whereas under the condition given as similar performance, the best results for the proposed network related to the output node mentioned previously were reported as PI = 0.299 and EPI = 0.517. In the sequel, the depth (the number of layers) as well as the width (the number of nodes) of the proposed genetically optimized HFNN (gHFNN) can be lower in comparison to the “conventional HFNN” (which immensely contributes to the compactness of the resulting network), refer to Fig. 13. In what follows, the genetic design procedure at stage (layer) of HFNN leads to the selection of the preferred nodes (or PNs)
44
S.-K. Oh and W. Pedrycz
Fig. 13. Comparison of the proposed model architecture (gHFNN) and the conventional model architecture (SOFPNN [19])
Fig. 14. Optimal topology of genetically optimized HFNN for the nonlinear function (In case of using FS FNN)
with optimal local characteristics (such as the number of input variables, the order of the polynomial, and input variables). In addition, when considering linear fuzzy inference and CP2, the best results are reported in the form of the performance index such as PI = 0.113 and EPI = 0.299 for layer 2, and PI = 0.0092 and EPI = 0.056 for layer 5. Their optimal topologies are shown in Figs. 14 (a) and (b). Figs. 15(a) and (b) depict the optimization process by showing the values of the performance index in successive cycles of both BP learning and genetic optimization when using each linear fuzzy inference-based FNN. Noticeably, the variation ratio (slope) of the performance of the network changes radically around the 1st and 2nd layer. Therefore, to effectively reduce
Genetically Optimized Hybrid Fuzzy Neural Networks
45
16 14
8
Consequence part; gPNN
6 Performance Index
Premise part; FNN
12
Premise part; FNN
5
Performance Index
7
1st layer
E_PI = 3.45
4
2nd layer 3rd layer
Consequence part; gPNN
10 1st layer
8
2nd layer
6
3rd layer
3 4th layer
PI = 2.929
50 200
400
E_PI = 2.069
5th layer
: PI : E_PI
1
4th layer
4
2
E_PI = 0.299 PI = 0.113 E_PI = 0.056 PI = 0.0092
600
Iteration
800 1000 100
200
300
400 500
Generation
(a) In case of FS FNN (linear)
: PI : E_PI
2
50 200
400
5th layer
PI = 2.518
600 800 Iteration
E_PI = 0.587 PI = 0.039
1000 150 300 450 600 750 Generation
(b) In case of FR FNN (linear)
Fig. 15. Optimization procedure of HFNN by BP learning and GAs
a large number of nodes and avoid a substantial amount of time-consuming iterations concerning HFNN layers, the stopping criterion can be taken into consideration. Referring to Figs. 14 and 15 it becomes obvious that we can optimize the network up to maximally the 2nd layer. Table 6 covers a comparative analysis including several previous models. Sugeno’s model I and II were fuzzy models based on linear fuzzy inference method while Shin-ichi’s models formed fuzzy rules by using learning method of neural networks. The study of literature [29] is based on fuzzy-neural networks using HCM clustering and evolutionary fuzzy granulation. SOFPNN [19] is a network being called the “conventional HFNN” in this study. The proposed genetically optimized HFNN (gHFNN) comes with higher accuracy and improved prediction capabilities. 5.2 Gas furnace process We illustrate the performance of the network and elaborate on its development by experimenting with data coming from the gas furnace process. The time series data (296 input-output pairs) resulting from the gas furnace process has been intensively studied in the previous literature [2-6,30-35]. The delayed terms of methane gas flow rate, u(t) and carbon dioxide density, y(t) are used as system input variables such as u(t-3), u(t-2), u(t-1), y(t-3), y(t-2), and y(t1). We use two types of system input variables of FNN structure, Type I and Type II to design an optimal model from gas furnace data. Type I utilize two system input variables such as u(t-3) and y(t-1) and Type II utilizes 3 system input variables such as u(t-2), y(t-2), and y(t-1). The output variable is y(t). Table 7 summarizes the computational aspects related to the genetic optimization of gHFNN.
46
S.-K. Oh and W. Pedrycz Table 6. Performance analysis of selected models Model Linear model [25] GMDH [25,28] Fuzzy model I Sugeno’s [1,25] Fuzzy model II FNN Type 1 Shin-ichi’s [13] FNN Type 2 FNN Type 3 Simplified FNN [29] Linear Simplified Multi-FNN [29] Linear BFPNN SOFPNN [19] MFPNN Simplified FS FNN Linear Simplified FR FNN Linear Proposed model Simplified gHFNN (FS FNN) Linear gFPNN Simplified (FR FNN) Linear
PI 12.7 4.7 1.5 1.1 0.84 0.73 0.63 2.865 2.670 0.865 0.174 0.299 0.116 5.21 2.92 3.997 2.069 0.299 0.299 0.113 0.0092 0.309 0.039
EPI 11.1 5.7 2.1 3.6 1.22 1.28 1.25 3.206 3.063 0.956 0.689 0.555 0.360 5.14 3.45 3.269 2.518 0.517 0.398 0.299 0.056 0.610 0.587
No. of rules
3 4 8(23 ) 4(22 ) 8(23 ) 9(3 + 3 + 3) 9(3 + 3 + 3) 9(3 + 3 + 3) 9(3 + 3 + 3) 6 rules/5th layer 8 rules/5th layer 6(2 + 2 + 2) 6(2 + 2 + 2) 8(2 × 2 × 2) 8(2 × 2 × 2) 6 rules/2nd layer 6 rules/5th layer 6 rules/2nd layer 6 rules/5th layer 8 rules/5th layer 8 rules/5th layer
The GAs-based design procedure is carried out in the same manner as in the previous experiments. Table 8 includes the results of the overall network reported according to various alternatives concerning various forms of FNN architecture, types of fuzzy inference and location of the connection point. When considering the simplified fuzzy inference-based FS FNN with Type I (4 fuzzy rules), the minimal value of the performance index, that is PI = 0.035 and EPI = 0.281 are obtained. In case of the linear fuzzy inference-based FS FNN with Type I (4 fuzzy rules), the best results are reported with the performance index such that PI = 0.041 and EPI = 0.267. When using Type II (6 fuzzy rules), the best results (PI = 0.0248 and EPI = 0.126) were obtained for simplified fuzzy inference and linear fuzzy inference, respectively (in the second case we have PI = 0.0256 and EPI = 0.143). In case of using FR FNN and Type II, the best results are given as the performance index such that PI = 0.026, EPI = 0.115 and PI = 0.033, EPI = 0.119 for simplified and linear fuzzy inference respectively. When using FS FNN and Type I, Fig. 16 illustrates the detailed optimal topology of the gHFNN with 3 layers of PNN; the network comes with the following values: PI = 0.017 and EPI = 0.267. The proposed network enables the architecture to be a structurally optimized and gets simpler than the
Genetically Optimized Hybrid Fuzzy Neural Networks
47
Table 7. Computational aspects of the optimization of gHFNN (a) In case of using FS FNN Generation 150 Population size 60 GAs Elite population size (W) 30 Premise structure (FNN) 10 (per one variable) String length Consequence structure (PNN) 3 + 3 + 24 No. of entire system inputs 2 Learning iteration 300 Premise Simplified 0.0014 Learning rate tuned (FNN) Linear 0.0052 Momentum Simplified 0.0002 Coefficient tuned Linear 0.0004 gHFNN No. of rules 4/6 CP 1 4/6 No. of system inputs Consequence CP 2 2/3 (Gpnn) Maximal layer 5 No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max) Type (T) 1≤T≤3 N, T : integer (b) In case of using FR FNN and Type II Generation 150 Population size 60 GAs Elite population size (W) 30 Premise structure (FNN) 10 (per one variable) String length Consequence structure (PNN) 3+3+24 No. of entire system inputs 3 Learning iteration 500 Premise Simplified 0.0524 Learning rate tuned (FNN) Linear 0.0144 Momentum Simplified 0.00086 Coefficient tuned Linear 0.00064 gHFNN No. of rules 4/8 No. of entire inputs 4/8 Consequence Maximal layer 5 (Gpnn) No. of inputs to be selected (N) 1 ≤ N ≤ 4 (Max) Type (T) 1≤T≤3 N, T : integer
conventional HFNN. Fig. 17(a) illustrates the optimization process by visualizing the performance index in successive cycles (iteration and generation). It also shows the optimized network architecture when taking into consideration HFNN based on linear fuzzy inference and CP1, refer to Table 8(a) and Fig. 16. As shown in Figs. 17(a) and (b), the variation ratio (slope) of the performance of the network is almost the same around the 2nd through 5th layer.
48
S.-K. Oh and W. Pedrycz Table 8. Performance index of HFNN for the gas furnace
(a) In case of using FS FNN and Type I Premise part Consequence part Fuzzy No. of rules CP No. of PI EPI Layer Input No. Inference (MFs) inputs 1 2 3 1 · · 2 4 24 11 30 18 01 3 3 9 10 23 · Simplified 4 0.035 0.281 4 4 7 13 27 20 (2 + 2) 5 3 1 10 7 · 1 2 1 2 · · 2 4 1 4 5 7 02 3 3 29 27 26 · 4 3 15 13 21 · 5 4 8 4 20 13 1 4 4 2 1 3 2 4 7 12 2 10 01 3 4 20 21 5 3 Linear 4 0.041 0.267 4 3 22 13 29 · (2 + 2) 5 4 25 18 27 9 1 2 1 2 · · 2 3 4 6 5 · 02 3 4 6 14 7 1 4 3 15 3 2 · 5 3 16 6 14 · (b) In case of using FR FNN and Type II Premise part Consequence part Fuzzy No. of rules No. of PI EPI Layer Input No. Inference (MFs) inputs 1 4 2 5 6 7 2 3 26 15 16 · Simplified 8 0.026 0.115 3 3 21 3 27 · (2 × 2 × 2) 4 3 24 3 18 · 5 4 13 12 29 20 1 4 6 5 2 8 2 4 21 18 6 9 Linear 8 0.033 0.119 3 4 4 24 5 6 (2 × 2 × 2) 4 3 28 4 5 · 5 3 21 18 25 ·
T 3 2 2 2 2 3 3 2 3 2 3 2 2 2 3 3 2 2 2 2
T 2 2 2 2 1 1 2 2 2 1
PI
EPI
0.025 0.024 0.020 0.019 0.018 0.024 0.021 0.020 0.019 0.018 0.019 0.018 0.017 0.016 0.015 0.027 0.021 0.018 0.018 0.016
0.328 0.269 0.265 0.262 0.254 0.328 0.282 0.270 0.268 0.265 0.292 0.271 0.267 0.263 0.258 0.310 0.279 0.270 0.263 0.259
PI
EPI
0.021 0.019 0.018 0.018 0.018 0.083 0.028 0.022 0.021 0.021
0.124 0.115 0.114 0.111 0.109 0.146 0.116 0.110 0.106 0.104
Table 9 contrasts the performance of the genetically developed network with other fuzzy and fuzzy-neural networks reported in the literature. It becomes obvious that the proposed genetically optimized HFNN architectures outperform other models both in terms of their accuracy and generalization capabilities.
Genetically Optimized Hybrid Fuzzy Neural Networks
u(t-3)
N
Π
N
Π
PN 1 3 3 PN 2 4 2 PN 4 4 3 PN 5 3 3 PN 13 3 1 PN 16 3 1 PN 18 1 2 PN 21 3 2 PN 23 3 2 PN 24 2 3 PN 29 2 2
∑ ∑
1
y(t-1)
N
Π
N
Π
∑ ∑
1
PN 3 4 2 PN 5 4 2 PN 20 4 2 PN 21 3 2
49
y^
PN 20 4 2
Fig. 16. Genetically optimized HFNN (gHFNN) with FS FNN (linear) 0.9
1.6
: PI : E_PI
0.8
Premise part; FNN
Consequence part; gPNN
0.6
2nd layer 3rd layer
0.5
4th layer
0.4
5th layer
0.3 0.2
E_PI = 0.267 E_PI = 0.258
E_PI = 0.267
Consequence part; gPNN
1.2
1st layer
Performance Index
Performance Index
0.7
: PI : E_PI
1.4
1st layer
1 Premise part; FR_FNN
0.8
2nd layer 3rd layer
0.6 4th layer
0.4 5th layer
0.1
0.2
PI = 0.041
E_PI = 0.119
PI = 0.017 PI = 0.015
30 100 200 300 Iteration
150
300 450 Generation
600
750
E_PI = 0.104
PI = 0.033
30
200 400 500 Iteration
PI = 0.0211
150
300 450 600 Generation
750
(a) In case of using FS FNN (linear) with (b) In case of using FR FNN (linear) Type I and Type II Fig. 17. Optimization procedure of HFNN by BP learning and GAs
5.3 NOx emission process of gas turbine power plant NOx emission process is modeled using the data of gas turbine power plant coming from a GE gas turbine power plant located in Virginia, US. The input variables include AT (Ambient Temperature at site), CS (Compressor Speed), LPTS (Low Pressure Turbine Speed), CDP (Compressor Discharge Pressure), and TET (Turbine Exhaust Temperature). The output variable is NOx [27,29,32]. The performance index is defined by (8). We consider 260 pairs of the original input-output data. 130 out of 260 pairs of input-output data are used as learning set; the remaining part serves as a
50
S.-K. Oh and W. Pedrycz Table 9. Performance analysis of selected models Model Kim, et al.’s model [30] Lin and Cunningham’s mode [31] Simplified Min-Max [5] Linear Simplified Linear Simplified Complex [2] Linear Hybrid [4] Simplified (GAs+Complex) Linear Simplified HCM [3] Linear GAs [5]
Fuzzy
Simplified HCM+GAs [3] Linear Neural Networks [3] Oh’s Adaptive FNN [6] FNN [32] Multi-FNN [33] SOFPNN
Simplified Linear Simplified Linear Generic [34] Advanced [35] Simplified FS FNN Linear Simplified FR FNN Linear
Proposed model gHFNN (FS FNN)
gHFNN (FR FNN)
Simplified Linear Simplified Linear
PI EPI No. of rules 0.034 0.244 2 0.071 0.261 4 0.022 0.335 4(2 × 2) 0.022 0.336 6(3 × 2) 0.024 0.358 4(2 × 2) 0.020 0.362 6(3 × 2) 0.023 0.344 4(2 × 2) 0.018 0.264 4(2 × 2) 0.024 0.328 4(2 × 2) 0.023 0.306 4(2 × 2) 0.024 0.329 4(2 × 2) 0.017 0.289 4(2 × 2) 0.755 1.439 6(3 × 2) 0.018 0.286 6(3 × 2) 0.035 0.289 4(2 × 2) 0.022 0.333 6(3 × 2) 0.026 0.272 4(2 × 2) 0.020 0.2642 6(3 × 2) 0.034 4.997 0.021 0.332 9(3 × 3) 0.022 0.353 4(2 × 2) 0.043 0.264 6(3 + 3) 0.037 0.273 6(3 + 3) 0.025 0.274 6(3 + 3) 0.024 0.283 6(3 + 3) 0.017 0.250 4 rules/5th layer 0.019 0.264 6 rules/5th layer 0.035 0.281 4(2 + 2) 0.024 0.126 6(2 + 2 + 2) 0.041 0.267 4(2 + 2) 0.025 0.143 6(2 + 2 + 2) 0.024 0.329 4(2 × 2) 0.026 0.115 8(2 × 2 × 2) 0.025 0.265 4(2 × 2) 0.033 0.119 8(2 × 2 × 2) 0.018 0.254 4 rules/5th layer 0.018 0.112 6 rules/5th layer 0.015 0.258 4 rules/5th layer 0.018 0.110 6 rules/5th layer 0.017 0.250 4 rules/5th layer 0.018 0.109 8 rules/5th layer 0.016 0.249 4 rules/5th layer 0.021 0.104 8 rules/5th layer
Genetically Optimized Hybrid Fuzzy Neural Networks
51
testing set. Using NOx emission process data, the regression model is y = −163.77341 − 0.06709 x1 + 0.00322 x2 + 0.00235 x3 + 0.26365 x4 + 0.20893 x5
(11)
And it comes with PI = 17.68 and EPI = 19.23. We will be using these results as a reference point when discussing gHFNN models. Table 10 summarizes Table 10. Summary of the parameters of the optimization environment (a) In case of using FS FNN Generation 150 Population size 100 GAs Elite population size 50 Premise structure 10 (per one variable) String length Consequence CP 1 3 + 3 + 70 structure CP 2 3 + 3 + 35 No. of entire system inputs 5 Learning iteration 1000 Premise Simplified 0.052 Learning rate tuned (FNN) Linear 0.034 Momentum Simplified 0.010 Coefficient tuned Linear 0.001 gHFNN No. of rules 10(2 + 2 + 2 + 2 + 2) CP 1 10 No. of system inputs CP 2 5 Consequence Maximal layer 5 (gPNN) No. of inputs to be CP 1 1 ≤ N ≤ 10 (Max) selected (N) CP 1 1 ≤ N ≤ 5 (Max) Type (T) 1≤T≤3 N, T : integer (b) In case of using FR FNN and Type II Generation 150 Population size 100 GAs Elite population size (W) 50 Premise structure (FNN) 10 (per one variable) String length Consequence structure (PNN) 3 + 3 + 105 No. of entire system inputs 5 Learning iteration 1000 Premise Simplified 0.568 Learning rate tuned Linear 0.651 Momentum Simplified 0.044 Coefficient tuned Linear 0.064 gHFNN No. of rules 32(2 × 2 × 2 × 2 × 2) No. of entire inputs 32 Consequence Maximal layer 5 (Gpnn) No. of inputs to be selected (N) 1 ≤ N ≤ 15 (Max) Type (T) 1≤T≤3 N, T : integer
52
S.-K. Oh and W. Pedrycz
the parameters of the optimization environment. The parameters used for optimization of this process modeling are almost the same as used in the previous experiments. Table 11 summarizes the detailed results. When using FS FNN, the best results for the network are obtained when using linear fuzzy inference and CP Table 11. Parameters of the optimization environment and computational effort (a) In case of using FS FNN Premise part Consequence part Fuzzy No. of rules CP No. of PI EPI Layer Type Inference (MFs) inputs 1 7 2 2 6 2 01 3 9 1 4 10 1 Simplified 10 22.331 19.783 5 4 2 (2 + 2 + 2 + 2 + 2) 1 5 2 2 5 2 02 3 5 3 4 3 2 5 4 2 1 9 2 2 5 2 01 3 9 1 4 2 2 Linear 10 8.054 12.147 5 9 1 (2 + 2 + 2 + 2 + 2) 1 5 2 2 5 2 02 3 5 2 4 5 3 5 3 2 (b) In case of using FR FNN Premise part Consequence part Fuzzy No. of rules No. of PI EPI Layer Type Inference (MFs) inputs 1 13 2 Simplified 32 2 12 1 (2 × 2 × 2 × 2 × 2) 0.711 1.699 3 11 1 4 8 1 5 3 3 1 10 2 Linear 32 2 13 1 (2 × 2 × 2 × 2 × 2) 0.079 0.204 3 3 3 4 12 1 5 3 3
PI
EPI
0.916 0.623 0.477 0.386 0.337 1.072 0.176 0.105 0.060 0.049 0.023 0.0095 0.0057 0.0057 0.0045 2.117 0.875 0.550 0.390 0.340
2.014 1.430 1.212 1.077 1.016 2.220 0.291 0.168 0.113 0.081 0.137 0.044 0.029 0.027 0.026 4.426 1.647 1.144 0.793 0.680
PI
EPI
0.149 0.065 0.046 0.044 0.041 0.205 0.049 0.028 0.023 0.019
0.921 0.189 0.134 0.125 0.111 1.522 0.646 0.437 0.330 0.286
Genetically Optimized Hybrid Fuzzy Neural Networks 25
0.3
: PI : E_PI Premise part; FNN
20
Consequence part; gPNN
1st layer
15
3rd layer E_PI = 12.147
4th layer
10 5th layer
Performance Index
Performance Index
2nd layer
0.2 2nd layer
0.15
3rd layer 4th layer
0.1
5th layer
PI = 8.054
5
E_PI = 0.044
0.05
E_PI = 0.026
E_PI = 0.026 PI = 0.0045
400
600
Iteration
800
: PI : E_PI
Consequence part; gPNN
0.25
1st layer
200
53
1000 150 300 450 600 750 Generation
(a) Premise and consequence part
PI = 0.0095
150
300
PI = 0.0045
450
600
750
Generation
(b) Consequence part (extended)
Fig. 18. Optimal procedure of gHFNN with FS FNN (linear) by BP and GAs
1 with Type 1 (linear function) and 9 nodes at input; this network comes with the value of PI equal to 0.0045 and EPI set as 0.026. In case of using FR FNN and simplified fuzzy inference, the most preferred network architecture have been reported as PI = 0.041 and EPI = 0.111. As shown in Table 11 and Fig. 18, the variation ratio (slope) of the performance of the network changes radically at the 2nd layer. Therefore, to effectively reduce a large number of nodes and avoid a large amount of time-consuming iteration of gHFNN, the stopping criterion can be taken into consideration up to maximally the 2nd layer. Table 12 covers a comparative analysis including several previous fuzzyneural network models. The experimental results clearly reveal that the proposed approach and the resulting model outperform the existing networks both in terms of better approximation and generalization capabilities.
6 Concluding remarks In this study, we have introduced a class of gHFNN driven genetic optimization regarded as a modeling vehicle for nonlinear and complex systems. The genetically optimized HFNNs are constructed by combining FNNs with gPNNs. In contrast to the conventional HFNN structures and their learning, the proposed model comes with two kinds of rule-based FNNs (viz. FS FNN and FR FNN based on two types of fuzzy inferences) as well as a diversity of local characteristics of PNs that are extremely useful when coping with various nonlinear characteristics of the system under consideration.
54
S.-K. Oh and W. Pedrycz Table 12. Performance analysis of selected models
Model PI EPI Regression model [32] 17.68 19.23 FNN 5.835 Ahn et al. [27] AIM 8.420 Simplified 6.269 8.778 FNN [32] Linear 3.725 5.291 Simplified 2.806 5.164 Multi-FNN [33] Linear 0.720 2.025 Simplified 22.331 19.783 FS FNN Linear 8.054 12.147 Simplified 0.711 1.699 FR FNN Linear 0.079 0.204 Proposed model 0.176 0.291 Simplified gHFNN 0.049 0.081 (FS FNN) 0.0095 0.044 Linear 0.0045 0.026 0.065 0.189 Simplified gHFNN 0.041 0.111 (FR FNN) 0.049 0.646 Linear 0.019 0.286
No. of rules
30(6 + 6 + 6 + 6 + 6) 30(6 + 6 + 6 + 6 + 6) 30(6 + 6 + 6 + 6 + 6) 30(6 + 6 + 6 + 6 + 6) 10(2 + 2 + 2 + 2 + 2) 10(2 + 2 + 2 + 2 + 2) 32(2 × 2 × 2 × 2 × 2) 32(2×2×2×2×2) 10 rules/2nd layer 10 rules/5th layer 10 rules/2nd layer 10 rules/5th layer 32 rules/2nd layer 32 rules/5th layer 32 rules/2nd layer 32 rules/5th layer
The comprehensive design methodology comes with the parametrically as well as structurally optimized network architecture. A few general notes are worth stressing: 1) as the premise structure of the gHFNN, the optimization of the rule-based FNN hinges on genetic algorithms and back-propagation (BP) learning algorithm: The GAs leads to the auto-tuning of vertexes of membership function, while the BP algorithm helps produce optimal parameters of the consequent polynomial of fuzzy rules through learning; 2) the gPNN that is the consequent structure of the gHFNN is based on the technologies of the extended GMDH algorithm and GAs: The extended GMDH method is comprised of both a structural phase such as a self-organizing and evolutionary algorithm and a parametric phase driven by the least square error (LSE)-based learning. Furthermore the PNN architecture is optimized by the genetic optimization that leads to the selection of the optimal nodes (or PNs) with local characteristics such as the number of input variables, the order of the polynomial, and a collection of the specific subset of input variables. In this sense, we have constructed a coherent development platform in which all components of CI are fully utilized. In the sequel, a variety of architectures of the proposed gHFNN driven to genetic optimization have been discussed. The model is inherently dynamic - the use of the genetically optimized PNN (gPNN) of consequent structure of the overall network is essential to the generation process of the “optimally self-organizing” network by selecting its width and depth. The series of experiments helped compare the network with other models through which we found the network to be of superior quality.
Genetically Optimized Hybrid Fuzzy Neural Networks
55
7 Acknowledgement This work has been supported by KESRI(I-2004-0-074-0-00), which is funded by MOCIE (Ministry of commerce, industry and energy).
References 1. Kang G, Sugeno M (1987) Fuzzy modeling. Trans Soc Instrum Control Eng 23(6cr):106–108 2. Oh SK, Pedrycz W (2000) Fuzzy identification by means of auto-tuning algorithm and its application to nonlinear systems. Fuzzy Sets Syst 115(2):205–230 3. Park BJ, Pedrycz W, Oh SK (2001) Identification of fuzzy models with the aid of evolutionary data granulation. IEE Proc-Control Theory Appl 148(5):406–418 4. Oh SK, Pedrycz W, Park BJ (2002) Hybrid identification of fuzzy rule-based models. Int J Intell Syst 17(1):77–103 5. Park BJ, Oh SK, Ahn TC, Kim HK (1999) Optimization of fuzzy systems by means of GA and weighting factor. Trans Korean Inst Electr Eng 48A(6):789– 799 (In Korean) 6. Oh SK, Park CS, Park BJ (1999) On-line modeling of nonlinear process systems using the adaptive fuzzy-neural networks. Trans Korean Inst Electr Eng 48A(10):1293–1302 (In Korean) 7. Narendra KS, Parthasarathy K (1991) Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Trans Neural Netw 2:252–262 8. Goldberg DE (1989) Genetic algorithms in search, optimization & machine learning. Addison-wesley, Reading 9. Michalewicz Z (1996) Genetic Algorithms + Data Structures = Evolution Programs. Springer, Berlin Heidelberg Newyork 10. Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbour 11. Pedrycz W, Peters JF (1998) Computational intelligence and software engineering. World Scientific, Singapore 12. Computational intelligence by programming focused on fuzzy neural networks and genetic algorithms. Naeha, Seoul (In Korean) 13. Horikawa S, Furuhashi T, Uchigawa Y (1992) On fuzzy modeling using fuzzy neural networks with the back propagation algorithm. IEEE Trans Neural Netw 3(5):801–806 14. Oh SK, Pedrycz W (2002) The design of self-organizing polynomial neural networks. Inf Sci 141(3–4):237–258 15. Oh SK, Pedrycz W, Park BJ (2003) Polynomial neural networks architecture: Analysis and Design. Comput Electr Eng 29(6):653–725 16. Ohtani T, Ichihashi H, Miyoshi T, Nagasaka K (1998) Orthogonal and successive projection methods for the learning of neurofuzzy GMDH. Inf Sci 110:5–24 17. Ohtani T, Ichihashi H, Miyoshi T, Nagasaka K (1998) Structural learning with M-Apoptosis in neurofuzzy GMHD. In: Proceedings of the 7th IEEE International Conference on Fuzzy Systems:1265–1270
56
S.-K. Oh and W. Pedrycz
18. Ichihashi H, Nagasaka K (1994) Differential minimum bias criterion for neurofuzzy GMDH. In: Proceedings of 3rd International Conference on Fuzzy Logic Neural Nets and Soft Computing IIZUKA’94:171–172 19. Park BJ, Pedrycz W, Oh SK (2002) Fuzzy polynomial neural networks: hybrid architectures of fuzzy modeling. IEEE Trans Fuzzy Syst 10(5):607–621 20. Oh SK, Pedrycz W, Park BJ (2003) Self-organizing neurofuzzy networks based on evolutionary fuzzy granulation. IEEE Trans Syst Man and Cybern A 33(2):271–277 21. Cordon O et al. (2004) Ten years of genetic fuzzy systems: current framework and new trends. Fuzzy Sets Syst 141(1):5–31 22. Ivahnenko AG (1968) The group method of data handling: a rival of method of stochastic approximation. Sov Autom Control 13(3):43–55 23. Yamakawa T (1993) A new effective learning algorithm for a neo fuzzy neuron model. 5th IFSA World Conference:1017–1020 24. Oh SK, Yoon KC, Kim HK (2000) The Design of optimal fuzzy- eural networks structure by means of GA and an aggregate weighted performance index. J Control, Autom Syst Eng 6(3):273–283 (In Korean) 25. Park MY, Choi HS (1990) Fuzzy control system. Daeyoungsa, Seoul (In Korean) 26. Box G.EP, Jenkins GM (1976) Time series analysis, forecasting, and control, 2nd edn. Holden-Day, SanFransisco 27. Ahn TC, Oh SK (1997) Intelligent models concerning the pattern of an air pollutant emission in a thermal power plant, Final Report, EESRI 28. Kondo T (1986) Revised GMDH algorithm estimating degree of the complete polynomial. Trans Soc Instrum Control Eng 22(9):928–934 29. Park HS, Oh SK (2003) Multi-FNN identification based on HCM clustering and evolutionary fuzzy granulation. Int J Control, Autom Syst 1(2):194–202 30. Kim E, Lee H, Park M, Park M (1998) A simply identified sugeno-type fuzzy model via double clustering. Inf Sci 110:25–39 31. Lin Y, Cunningham III GA (1997) A new approach to fuzzy-neural modeling, IEEE Trans Fuzzy Syst 3(2):190–197 32. Oh SK, Pedrycz W, Park HS (2003) Hybrid identification in fuzzy-neural networks. Fuzzy Sets Syst 138(2):399–426 33. Park HS, Oh SK (2000) Multi-FNN identification by means of HCM clustering and its optimization using genetic algorithms. J Fuzzy Logic Intell Syst 10(5):487–496 (In Korean) 34. Park BJ, Oh SK, Jang SW (2002) The design of adaptive fuzzy polynomial neural networks architectures based on fuzzy neural networks and self-organizing networks. J Control Autom Syst Eng 8(2):126–135 (In Korean) 35. Park BJ, Oh SK (2002) The analysis and design of advanced neurofuzzy polynomial networks. J Inst Electron Eng Korea 39-CI(3):18–31 (In Korean) 36. Park BJ, Oh SK, Pedrycz W, Kim HK (2005) Design of evolutionally optimized rule-based fuzzy neural networks on fuzzy relation and evolutionary optimization. International Conference on Computational Science. Lecture Notes in Computer Science 3516:1100–1103 37. Oh SK, Park BJ, Pedrycz W, Kim HK (2005) Evolutionally optimized fuzzy neural networks based on evolutionary fuzzy granulation. Lecture Notes in Computer Science 3483:887–895 38. Oh SK, Park BJ, Pedrycz W, Kim HK (2005) Genetically optimized hybrid fuzzy neural networks in modeling software data. Lecture Notes in Artificial Intelligence 3558:338–345
Genetically Optimized Hybrid Fuzzy Neural Networks
57
39. Zadeh NN, Darvizeh A, Jamali A, Moeini A (2005) Evolutionary design of generalized polynomial neural networks for modeling and prediction of explosive forming process. J Mater Process Technol 164(15):1561–1571 40. Delivopoulos E, Theocharis JB (2004) A modified PNN algorithm with optimal PD modeling using the orthogonal least squares method. Inf Sci 168(3):133–170
Genetically Optimized Self-organizing Neural Networks Based on Polynomial and Fuzzy Polynomial Neurons: Analysis and Design Sung-Kwun Oh and Witold Pedrycz
Summary. In this study, we introduce and investigate a class of neural architectures of self-organizing neural networks (SONN) that is based on a genetically optimized multilayer perceptron with polynomial neurons (PNs) or fuzzy polynomial neurons (FPNs), develop a comprehensive design methodology involving mechanisms of genetic optimization and carry out a series of numeric experiments. The conventional SONN is based on a self-organizing and an evolutionary algorithm rooted in a natural law of survival of the fittest as the main characteristics of the extended Group Method of Data Handling (GMDH) method, and utilized the polynomial order (viz. linear, quadratic, and modified quadratic) as well as the number of node inputs fixed (selected in advance by designer) at the corresponding nodes (PNs or FPNs) located in each layer through a growth process of the network. Moreover it does not guarantee that the SONN generated through learning results in the optimal network architecture. We distinguish between two kinds of SONN architectures, that is, (a) Polynomial Neuron (PN) based and (b) Fuzzy Polynomial Neuron (FPN) based self-organizing neural networks. This taxonomy is based on the character of each neuron structure in the network. The augmented genetically optimized SONN (gSONN) results in a structurally optimized structure and comes with a higher level of flexibility in comparison to the one encountered in the conventional SONN. The GA-based design procedure being applied at each layer of SONN leads to the selection of preferred nodes (PNs or FPNs) with specific local characteristics (such as the number of input variables, the order of the polynomial, and a collection of the specific subset of input variables) available within the network. In the sequel, two general optimization mechanisms of the gSONN are explored: the structural optimization is realized via GAs whereas for the ensuing detailed parametric optimization we proceed with a standard least square method-based learning. Each node of the PN based gSONN exhibits a high level of flexibility and realizes a collection of preferred nodes as well as a preferred polynomial type of mapping (linear, quadratic, and modified quadratic) between input and output variables. FPN based gSONN dwells on the ideas of fuzzy rule-based computing and neural networks. The performance of the gSONN is quantified through experimentation that exploits standard data already used in fuzzy or neurofuzzy modeling. These results reveal superiority of the proposed networks over the existing fuzzy and neural models.
S.-K. Oh and W. Pedrycz: Genetically Optimized Self-organizing Neural Networks Based on Polynomial and Fuzzy Polynomial Neurons: Analysis and Design, Studies in Computational Intelligence (SCI) 82, 59–108 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
60
S.-K. Oh and W. Pedrycz
1 Introduction Recently, lots of attention has been directed towards advanced techniques of complex system modeling. The challenging quest for constructing models of the systems that come with significant approximation and generalization abilities as well as are easy to comprehend has been within the community for decades. While neural networks, fuzzy sets and evolutionary computing as the technologies of Computational Intelligence (CI) have expanded and enriched a field of modeling quite immensely, they have also gave rise to a number of new methodological issues and increased our awareness about tradeoffs one has to make in system modeling [1–4]. The most successful approaches to hybridize fuzzy systems with learning and adaptation have been made in the realm of CI. Especially neural fuzzy systems and genetic fuzzy systems hybridize the approximate inference method of fuzzy systems with the learning capabilities of neural networks and evolutionary algorithms [5]. When the dimensionality of the model goes up (say, the number of variables increases), so do the difficulties. Fuzzy sets emphasize the aspect of transparency of the models and a role of a model designer whose prior knowledge about the system may be very helpful in facilitating all identification pursuits. On the other hand, to build models of substantial approximation capabilities, there is a need for advanced tools. The art of modeling is to reconcile these two tendencies and find a workable and efficient synergistic environment. Moreover it is also worth stressing that in many cases the nonlinear form of the model acts as a two-edge sword: while we gain flexibility to cope with experimental data, we are provided with an abundance of nonlinear dependencies that need to be exploited in a systematic manner. In particular, when dealing with high-order nonlinear and multivariable equations of the model, we require a vast amount of data for estimating all its parameters [1–2]. To help alleviate the problems, one of the first approaches along the line of a systematic design of nonlinear relationships between system’s inputs and outputs comes under the name of a Group Method of Data Handling (GMDH). GMDH was developed in the late 1960s by Ivakhnenko [6–9] as a vehicle for identifying nonlinear relations between input and output variables. While providing with a systematic design procedure, GMDH comes with some drawbacks. First, it tends to generate quite complex polynomial even for relatively simple systems (experimental data). Second, owing to its limited generic structure (that is quadratic two-variable polynomials), GMDH also tends to produce an overly complex network (model) when it comes to highly nonlinear systems. Third, if there are less than three input variables, GMDH algorithm does not generate a highly versatile structure. To alleviate the problems associated with the GMDH, Self-Organizing Neural Networks (SONN) (viz. polynomial neuron (PN)-based SONN and fuzzy polynomial neuron (FPN)based SONN, or called SOPNN/FPNN) were introduced by Oh and Pedrycz [10–13] as a new category of neural networks or neuro-fuzzy networks. In a nutshell, these networks come with a high level of flexibility associated with
Genetically Optimized Self-organizing Neural Networks
61
each node (processing element forming a Partial Description (PD) (viz. Polynomial Neuron (PN) or Fuzzy Polynomial Neuron (FPN)) can have a different number of input variables as well as exploit a different order of the polynomial (say, linear, quadratic, cubic, etc.). In comparison to well-known neural networks or neuro-fuzzy networks whose topologies are commonly selected and kept prior to all detailed (parametric) learning, the SONN architecture is not fixed in advance but becomes generated and optimized in a dynamic way. As a consequence, the SONNs show a superb performance in comparison to the previously presented intelligent models. Although the SONN has a flexible architecture whose potential can be fully utilized through a systematic design, it is difficult to obtain the structurally and parametrically optimized network because of the limited design of the nodes (viz. PNs or FPNs) located in each layer of the SONN. In other words, when we construct nodes of each layer in the conventional SONN, such parameters as the number of input variables (nodes), the order of the polynomial, and the input variables available within a node (viz. PN or FPN) are fixed (selected) in advance by the designer. Accordingly, the SONN algorithm exhibits some tendency to produce overly complex networks as well as a repetitive computation load by the trial and error method and/or the repetitive parameter adjustment by designer like in case of the original GMDH algorithm. In order to generate a structurally and parametrically optimized network, such parameters need to be optimal. In this study, in addressing the above problems with the conventional SONN as well as the GMDH algorithm, we introduce a new genetic design approach; as a consequence we will be referring to these networks as genetically optimized SONN (gSONN). The determination of the optimal values of the parameters available within an individual PN or FPN (viz. the number of input variables, the order of the polynomial, and a collection of preferred nodes) leads to a structurally and parametrically optimized network. As a result, this network is more flexible as well as exhibits simpler topology in comparison to the conventional SONN discussed in the previous research. Let us reiterate that the objective of this study is to develop a general design methodology of gSONN modeling, come up with a logic-based structure of such model and propose a comprehensive evolutionary development environment in which the optimization of the models can be efficiently carried out both at the structural as well as parametric level [14]. This chapter is organized in the following manner. First, Section 2 gives a brief introduction to the architecture and development of the SONNs. Section 3 introduces the genetic optimization used in SONN. The genetic design of the SONN comes with an overall description of a detailed design methodology of SONN based on genetically optimized multi-layer perceptron architecture in Section 4. In Section 5, we report on a comprehensive set of experiments. Finally concluding remarks are covered in Section 6. To evaluate the performance of the proposed model, we exploit two well-known time series data [10, 11, 13, 15, 16, 20–41]. Furthermore, the network is directly contrasted with several existing neurofuzzy models reported in the literatures.
62
S.-K. Oh and W. Pedrycz
2 The architecture and development of the self-organizing neural networks (SONN) Proceeding with the overall SONN architecture, essential design decisions have to be made with regard to the number of input variables, the order of the polynomial, and a collection of the specific subset of input variables. We distinguish between two kinds of the SONN architectures (PN-based SONN and FPN-based SONN). 2.1 Polynomial Neuron (PN) based SONN and its topology As underlined, the SONN algorithm is based on the GMDH method and utilizes a class of polynomials such as linear, quadratic, modified quadratic, etc. to describe basic processing realized there. By choosing the most significant input variables and an order of the polynomial among various types of forms available, we can obtain the best one - it comes under a name of a partial description (PD). It is realized by selecting nodes at each layer and eventually generating additional layers until the best performance has been reached. Such methodology leads to an optimal SONN structure. Let us recall that the input-output data are given in the form (Xi , yi ) = (x1i , x2i , · · · , xN i , yi ),
i = 1, 2, 3, · · · , n.
(1)
where N is the number of input variables, i is the datanumber of each input and output variable, and n denotes the number of data in the dataset. The input-output relationship for the above data realized by the SONN algorithm can be described in the following manner y = f (x1 , x2 , · · · , xN ).
(2)
Where, x1 , x2 , · · · , xN denote the outputs of the 1st layer layer of PN nodes (the inputs of the 2nd layer (PN nodes)). The estimated output y reads as y = c0 +
N i=1
ci xi +
N N i=1 j=1
cij xi xj
N N N
cijk xi xj xk · · ·
(3)
i=1 j=1 k=1
Where, C(c0 , ci , cij , cijk , · · · ) (i, j, k, · · · : 1, 2, · · · , N ) and X(xi , xj , xk , · · · ), (i, j, k, · · · : 1, 2, · · · , N ) are vectors of the coefficients and input variables of the resulting multi-input single-output (MISO) system, respectively. The design of the SONN structure proceeds further and involves a generation of some additional layers. These layers consist of PNs (PDs) for which the number of input variables, the polynomial order, and a collection of the specific subset of input variables are genetically optimized across the layers. The detailed PN involving a certain regression polynomial is shown in Table 1.
Genetically Optimized Self-organizing Neural Networks
63
Table 1. Different forms of regression polynomial building a PN
PP PP Number of PP inputs PP Order PP
1
2
1 (Type 1) Linear Bilinear 2 (Type 2) Biquadratic-1 Quadratic 2 (Type 3) Biquadratic-2 The following types of the polynomials are used; • Bilinear = c0 + c1 x1 + c2 x2 • Biquadratic-1 (Basic) = Bilinear + c3 x21 + c4 x22 + c5 x1 x2 • Biquadratic-2 (Modified) = Bilinear + c3 x1 x2 1 st la y e r
3 Trilinear Triquadratic-1 Triquadratic-2
2 n d la y e r o r h ig h e r
PN PN x1
PN
PN
x2
PN
x3
PN
PN PN PN PN
PN x4
yˆ
PN
PN
PN
PN
Input variables zp
zp
zq
zq
Partial Description
2
C0 +C1 zp +C2 zq+C3 z 2p+C4 z 2q +C5 zp zq
z
Polynomial order
Fig. 1. A general topology of the PN based-SONN : note a biquadratic polynomial in the partial description (z: intermediate variable)
The architecture of the PN based SONN is visualized in Fig. 1. The structure of the SONN is genetically optimized on the basis of the design alternatives available within a PN occurring in each layer. In the sequel, the SONN embraces diverse topologies of PN being selected on the basis of the number of input variables, the order of the polynomial, and a collection of the specific subset of input variables (as shown in Table 1). The choice of the number of input variables, the polynomial order, and input variables available within each node itself helps select the best model with
64
S.-K. Oh and W. Pedrycz
respect to the characteristics of the data, model design strategy, nonlinearity and predictive capabilities. 2.2 Fuzzy Polynomial Neuron (FPN) based SONN and its topology In this section, we introduce a fuzzy polynomial neuron (FPN). This neuron, regarded as a generic type of the processing unit, dwells on the concepts of fuzzy sets and neural networks. We show that the FPN encapsulates a family of nonlinear “if-then” rules. When arranged together, FPNs build a selforganizing neural network (SONN). In the sequel, we investigate architectures arising therein. 2.2.1 Fuzzy polynomial neuron (FPN) As visualized in Fig. 2, the FPN consists of two basic functional modules. The first one, labeled by F, is a collection of fuzzy sets (here Al and Bk ) that form an interface between the input numeric variables and the processing part realized by the neuron. Here q and xp denote input variables. The second module (denoted here by P) is about the function-based nonlinear (polynomial) processing. This nonlinear processing involves some input variables (xi and xj ). Quite commonly, we will be using a polynomial form of the nonlinearity, hence the name of the fuzzy polynomial processing unit. The use of polynomials is motivated by their generality. In particular, they include constant and linear mappings as their special cases (that are used quite often in rule-based systems). In other words, FPN realizes a family of multiple-input single-output rules. Each rule, refer again to Fig. 2, reads in the form if xp is Al and xq is Bk then z is Plk (xi , xj , alk )
FPN x i ,x j
F xp
xq
(4)
P
µ1
µˆ 1
P1
µ2
µˆ 2
P2
µ3
µˆ 3
P3
µK
µˆ K
PK
{A l }
∑
z
{Bk }
Fig. 2. A general topology of the generic FPN module (F: fuzzy set-based processing part, P: the polynomial form of mapping)
Genetically Optimized Self-organizing Neural Networks
65
Table 2. Different forms of the regression polynomials standing in the consequence part of the fuzzy rules
PP Number of PP PP inputs P PP Order of PP the polynomial P P 0 1 2 2
(Type (Type (Type (Type
1) 2) 3) 4)
1
2
3
Constant Linear
Constant Bilinear Biquadratic-1 Biquadratic-2
Constant Trilinear Triquadratic-1 Triquadratic-2
Quadratic
1: Basic type, 2: Modified type
where alk is a vector of the parameters of the conclusion part of the rule while Plk (xi , xj , alk ) denotes the regression polynomial forming the consequence part of the fuzzy rule which uses several types of high-order polynomials (linear, quadratic, and modified quadratic) besides the constant function forming the simplest version of the consequence; refer to Table 2. The types of the polynomial read as follows • • • • • •
Bilinear = c0 + c1 x1 + c2 x2 Biquadratic-1 = Bilinear + c3 x21 + c4 x22 + c5 x1 x2 Biquadratic-2 = Bilinear + c3 x1 x2 Trilinear = c0 + c1 x1 + c2 x2 + c3 x3 Triquadratic-1 = Trilinear + c4 x21 + c5 x22 + c6 x23 + c7 x1 x2 + c8 x1 x3 + c9 x2 x3 Triquadratic-2 = Trilinear + c4 x1 x2 + c5 x1 x3 + c6 x2 x3
Alluding to the input variables of the FPN, especially a way in which they interact with the two functional blocks shown there, we use the notation FPN (xp , xq ; xi , xj ) to explicitly point at the variables. The processing of the FPN is governed by the following expressions that are in line of the rule-based computing existing in the literatures [15, 16]. The activation of the rule “K ” is computed as an and-combination of the activations of the fuzzy sets occurring in the rule. This combination of the subconditions is realized through any t-norm. In particular, we consider the minimum and product operations as two widely used models of the logic connectives. Subsequently, denote the resulting activation level of the rule by µK . The activation levels of the rules contribute to the output of the FPN being computed as a weighted average of the individual condition parts (functional transformations) PK (note that the index of the rule, namely “K” is a shorthand notation for the two indexes of fuzzy sets used in the rule (4), that is K = (l, k)). all rules all rules all rules z= µK PK (xi , xj , aK ) µK = µ K PK (xi , xj , ak ) (5) K=1
K=1
K=1
66
S.-K. Oh and W. Pedrycz
In the above expression, we use an abbreviated notation to describe an activation level of the “K”th rule to be in the form µK µ K = all rules (6) µL L=l
2.2.2 The topology of the fuzzy polynomial neuron (FPN) based SONN The topology of the FPN based SONN implies the ensuing learning mechanisms; in the description below we indicate some of these learning issues that permeate the overall architecture. First, the network is homogeneous in the sense it is constructed with the use of the FPNs. It is also heterogeneous in the sense that FPNs can be very different (as far as the detailed architecture is concerned) and this contributes to the generality of the architecture. The network may contain a number of hidden layers each of them of a different size (number of nodes). The nodes may have a different number of inputs and this triggers a certain pattern of connectivity of the network. The FPN itself promotes a number of interesting design options, see Fig. 3. These alternatives distinguish between two categories such as designer-based and GA-based. The former concerns a choice of the membership function (MF) type, the consequent input structure of the fuzzy rules, and the number of MFs per each input variable. The latter is related to a choice of the number of inputs, and a collection of the specific subset of input variables and its associated order of the polynomial realizing a consequence part of the rules based on fuzzy inference method. Proceeding with the FPN-based SONN architecture, see Fig. 4, essential design decisions have to be made with regard to the number of input variables
Fig. 3. The design alternatives available within a single FPN
Genetically Optimized Self-organizing Neural Networks
67
Fig. 4. Configuration of the topology of the FPN-based SONN Table 3. Polynomial type according to the number of input variables in the conclusion part of fuzzy rules
H Input HH vector Selected input HH Type of variables in the HH the consequence premise part HH polynomial H H Type T Type T*
A A
Selected input variables in the consequence part
Entire system input variables
A B
B B
and the order of the polynomial forming the conclusion part of the rules as well as a collection of the specific subset of input variables. The consequence part can be expressed by linear, quadratic, or modified quadratic polynomial equation as mentioned previously. Especially for the consequence part, we consider two kinds of input vector formats in the conclusion part of the fuzzy rules of the 1st layer, namely i) selected inputs and ii) entire system inputs, see Table 3. i) The input variables of the consequence part of the fuzzy rules are same as the input variables of premise part. ii) The input variables of the consequence part of the fuzzy rules in a node of the 1st layer are same as the entire system input variables and the input variables of the consequence part of the fuzzy rules in a node of the 2nd layer or higher are same as the input variables of premise part. Where notation A: Vector of the selected input variables (x1 , x2 , · · · , xi ), B: Vector of the entire system input variables (x1 , x2 , · · · , xi , xj · · · ), Type T:f (A) = f (x1 , x2 , · · · , xi ) - type of a polynomial function standing in the consequence part of the fuzzy rules, Type T*: f (B) = f (x1 , x2 , · · · , xi , xj · · · ) - type of a polynomial function occurring in the consequence part of the fuzzy rules.
68
S.-K. Oh and W. Pedrycz
As shown in Table 3, A and B describe vectors of the selected input variables and the entire collection (set) of the input variables, respectively. Proceeding with each layer of the SONN, the design alternatives available within a single FPN can be carried out with regard to the entire collection of the input variables or its selected subset as they occur in the consequence part of fuzzy rules encountered at the 1st layer. Following these criteria, we distinguish between two fundamental types (Type T, Type T*), namely Type T- the input variables in the conditional part of fuzzy rules are kept as those in the conclusion part of the fuzzy rules (Zi of (A)F P N (xp , xq ; xi , xj )) Type T*- the entire collection of the input variables is kept as input variables in the conclusion part of the fuzzy rules (Zi of (B)F P N (xp , xq ; x1 , x2 , x3 , x4 )). In the Fuzzy Polynomial Neuron (FPN) shown in Fig. 4, the variables in the FPN(•) are enumerated in the form of two lists that are separated by a semicolon. The former and latter part denote the premise input variables (xp , xq ) and the input variables (xp , xq or x1 , x2 , x3 , x4 ) of the consequence regression polynomial of the fuzzy rules respectively. In other words, xp and xq of both the former and latter part stand for the selected input variables to be used in both the premise and consequence part of the fuzzy rules, and x1 , x2 , x3 , and x4 of the latter part stand for system input variables to be used in the consequence polynomial of the fuzzy rules.
3 Genetic optimization of SONN The task of optimizing any complex model involves two main phases. First, a class of some optimization algorithms has to be chosen so that it is applicable to the requirements implied by the problem at hand. Secondly, various parameters of the optimization algorithm need to be tuned in order to achieve its best performance. Genetic algorithms (GAs) are optimization techniques based on the principles of natural evolution. In essence, they are search algorithms that use operations found in natural genetics to guide a comprehensive search over the parameter space. GAs have been theoretically and empirically demonstrated to provide robust search capabilities in complex spaces thus offering a valid solution strategy to problems requiring efficient and effective searching. For the optimization applied to real world problems, many methods are available including gradient-based search and direct search, which are based on techniques of mathematical programming [17]. In contrast to these, genetic algorithm are aimed at stochastic global search and involving a structured information exchange [18]. It is eventually instructive to highlight the main features that tell GA apart from some other optimization methods: (1) GA operates on the codes of the variables, but not the variables themselves. (2) GA searches optimal points starting from a group (population) of points in the search space (potential solutions), rather than a single point. (3) GA’s search
Genetically Optimized Self-organizing Neural Networks
69
is directed only by some fitness function whose form could be quite complex; we do not require it need to be differentiable. In this study, for the optimization of the SONN model, GA uses the serial method of binary type, roulette-wheel used in the selection process, one-point crossover in the crossover operation, and a binary inversion (complementation) operation in the mutation operator. To retain the best individual and carry it over to the next generation, we use elitist strategy [19]. The overall geneticallydriven structural optimization process of SONN is shown in Figs. 5–6. As mentioned, when we construct PNs or FPNs of each layer in the conventional SONN, such parameters as the number of input variables (nodes),
Fig. 5. Overall genetically-driven structural optimization process of the PN-based SONN
Fig. 6. Overall genetically-driven structural optimization process of FPN-based SONN
70
S.-K. Oh and W. Pedrycz
Fig. 7. A general flow of genetic design of SONNs
the order of polynomial, and input variables available within a PN or a FPN are fixed (selected) in advance by the designer. This could have frequently contributed to the difficulties in the design of the optimal network. To overcome this apparent drawback, we resort ourselves to the genetic optimization, see Fig. 7 for more detailed flow of the development activities.
4 The algorithm and design procedure of genetically optimized SONN (gSONN) The genetically-driven SONN comes with a highly versatile architecture both in the flexibility of the individual nodes as well as the interconnectivity between the nodes and organization of the layers. Evidently, these features contribute to the significant flexibility of the networks yet require a prudent design methodology and a well-thought learning mechanisms based on genetic optimization. Let us stress that there are several important differences that make this architecture distinct from the panoply of the well-known neurofuzzy architectures existing in the literature. The most important is that the
Genetically Optimized Self-organizing Neural Networks
71
learning of SONNs dwells on the extended GMDH algorithm that is crucial to the structural development of the network. In the sequel, The GMDH method is comprised of both a structural phase such as a self-organizing and an evolutionary algorithm (rooted in natural law of survival of the fittest), and a parametric phase of Least Square Estimation (LSE)-based learning. Therefore the structural and parametric optimization help utilize hybrid method (combining GAs with a structural phase of GMDH) and LSE-based technique in the most efficient way. Overall, the framework of the design procedure of the SONN based on genetically optimized multi-layer perceptron architecture comprises the following steps. [Step 1] Determine system’s input variables. [Step 2] Using available experimental data, form a training and testing data set. [Step 3] Decide initial information for constructing the SONN structure. [Step 4] Decide a structure of the PN or FPN based SONN using genetic design. • Selection of the number of input variables • Selection of the polynomial order (PN) or the polynomial order of the consequent part of fuzzy rules (FPN) • Selection of a collection of the specific subset of input variables [Step 5] Estimate the coefficient parameters of the polynomial in the selected node. • In case of a PN – Estimate the coefficients of the polynomial assigned to the selected node (PN) • In case of a FPN – Carry out fuzzy inference and coefficients estimation for fuzzy identification in the selected node (FPN) [Step 6] Select nodes (PNs or FPNs) with the best predictive capability and construct their corresponding layer. [Step 7] Check the termination criterion. [Step 8] Determine new input variables for the next layer. In what follows, we describe each of these steps in more detail. [Step 1] Determine system’s input variables. Define system’s input variables xi (i = 1, 2, · · · , n) related to the output variable y. If required, the normalization of input data is carried out as well. [Step 2] Form a training and testing data. The input-output data set (xi , yi ) = (x1i , x2i , · · · , xni , yi ), i = 1, 2, . . . , N (with N being the total number of data points) is divided into two parts, that is, a training and testing dataset. Denote their sizes by Nt and Nc respectively. Obviously we have N = Nt + Nc . The training data set is used to construct
72
S.-K. Oh and W. Pedrycz
the SONN. Next, the testing data set is used to evaluate the quality of the network. [Step 3] Decide initial information for constructing the SONN structure. We decide upon the design parameters of the SONN structure and they include: a) According to the stopping criterion, two termination methods are exploited: – Criterion level for comparison of a minimal identification error of the current layer with that occurring at the previous layer of the network. – The maximum number of layers (predetermined by the designer) with an intent to achieve a sound balance between model accuracy and its complexity. b) The maximum number of input variables coming to each node in the corresponding layer. c) The total number W of nodes to be retained (selected) at the next generation of the SONN algorithm. d) The depth of the SONN to be selected to reduce a conflict between overfitting and generalization abilities of the developed SONN. e) The depth and width of the SONN to be selected as a result of a tradeoff between accuracy and complexity of the overall model. In addition, in case of FPN-based SONN, parameters related to the following item are considered besides what are mentioned above. f) The decision of initial information for fuzzy inference method and fuzzy identification: – Fuzzy inference method – MF type: Triangular or Gaussian-like MF – No. of MFs per each input of a node (or FPN) – Structure of the consequence part of fuzzy rules [Step 4] Decide a structure of the PN or FPN based SONN using genetic design. This concerns the selection of the number of input variables, the polynomial order, and the input variables to be assigned in each node of the corresponding layer. These important decisions are carried out through an extensive genetic optimization. When it comes to the organization of the chromosome representing (mapping) the structure of the SONN, we divide the chromosome to be used for genetic optimization into three sub-chromosomes as shown in Figs. 8–9. The 1st sub-chromosome contains the number of input variables, the 2nd subchromosome involves the order of the polynomial of the node, and the 3rd sub-chromosome (remaining bits) contains input variables coming to the corresponding node (PN or FPN). All these elements are optimized when running the GA. In nodes (PN or FPNs) of each layer of SONN, we adhere to the notation of Fig. 10. ‘PNn’ or ‘FPNn’ denotes the nth node (PN or FPN) of the
Genetically Optimized Self-organizing Neural Networks
73
Fig. 8. The PN design used in the SONN architecture - structural considerations and mapping the structure on a chromosome
Fig. 9. The FPN design used in the SONN architecture - structural considerations and mapping the structure on a chromosome
74
S.-K. Oh and W. Pedrycz n th Polynomial or Fuzzy Polynomial Neuron(PN or FPN) xi
PN n or FPNn N T
xj No. of inputs
z
Polynomial order(Type T)
Fig. 10. Formation of each PN or FPN in SONN architecture
corresponding layer, ‘N’ denotes the number of nodes (inputs or PNs/FPNs) coming to the corresponding node, and ‘T’ denotes the order of polynomial used in the corresponding node. Each sub-step of the genetic design of the three types of the parameters available within the PN or the FPN is structured as follows: [Step 4-1] Selection of the number of input variables (1st sub-chromosome) Sub-step 1) The first 3 bits of the given chromosome are assigned to the binary bits for the selection of the number of input variables. The size of this bit structure depends on the number of input variables. For example, in case of 3 bits, the maximum number of input variables is limited to 7. Sub-step 2) The 3 bits randomly selected as β = (22 × bit(3)) + (21 × bit(2)) + (20 × bit(1))
(7)
are then decoded into decimal. Here, bit(1), bit(2) and bit(3) show each location of these three bits and are denoted as “0”, or “1” respectively. Sub-step 3) The above decimal value is rounded off γ = (β/α) × (M ax − 1) + 1
(8)
where Max denotes the maximal number of input variables entering the corresponding node (PN or FPN) while α is the decoded decimal value corresponding to the situation when all bits of the 1st sub-chromosome are set up as 1’s. Sub-step 4) The normalized integer value is then treated as the number of input variables (or input nodes) coming to the corresponding node. Evidently, the maximal number (Max) of input variables is equal to or less than the number of all system’s input variables (x1 , x2 , · · · , xn ) coming to the 1st layer, that is, Max ≤ n. [Step 4-2] Selection of the order of polynomial (2nd sub-chromosome) Sub-step 1) The 3 bits of the 2nd sub-chromosome are assigned to the binary bits for the selection of the order of polynomial. Sub-step 2) The 3 bits randomly selected using (7) are decoded into a decimal format. Sub-step 3) The decimal value obtained by means of (8) is normalized and rounded off. The value of Max is replaced with 3 (in case of PN) or 4 (in case of FPN), refer to Tables 1and 2.
Genetically Optimized Self-organizing Neural Networks
75
Sub-step 4) The normalized integer value is given as the selected polynomial order, when constructing each node of the corresponding layer. [Step 4-3] Selection of input variables (3rd sub-chromosome) Sub-step 1) The remaining bits are assigned to the binary bits for the selection of input variables. Sub-step 2) The remaining bits are divided by the value obtained in step 4-1. If these bits are not divisible, we apply the following rule. For example, if the remaining are 22 bits and the number of input variables obtained in step 4-1 has been set up as 4, the 1st , 2nd , and 3rd bit structures (spaces) for the selection of input variables are assigned to 6 bits, respectively. The last (4th ) bit structure (spaces) used for the selection of the input variables is assigned to 4 bits. Sub-step 3) Each bit structure is decoded into decimal (through relationship (7)) Sub-step 4) Each decimal value obtained in sub-step 3 is then normalized following (8); moreover we round off the values obtained from this expression. We replace Max with the total number of inputs (viz. input variables or input nodes), n (or W ) in the corresponding layer. Note that the total number of input variables denotes the number of the overall system’s inputs, n, in the 1st layer, and the number of the selected nodes, W , as the output nodes of the preceding layer in the 2nd layer or higher. Sub-step 5) The normalized integer values are then taken as the selected input variables while constructing each node of the corresponding layer. Here, if the selected input variables are multiple-duplicated, the multiple-duplicated input variables (viz. same input numbers) are treated as a single input variable while the remaining ones are discarded. [Step 5] Estimate the coefficient parameters of the polynomial in the selected node (PN or FPN). [Step 5-1] In case of a PN The vector of coefficients Ci is derived by minimizing the mean squared error between yi and zmi Ntr 1 E= (yi − zm i)2 (9) Ntr i=0 Using the training data subset, this gives rise to the set of linear equations Y = Xi Ci
(10)
Evidently, the coefficients of the PN of nodes in each layer are expressed in the form (11) Ci = (XiT Xi )−1 XiT Y where Y = [y1 y2 · · · yntr ]T , Xi = [X1i X2i · · · Xki · · · Xntr i ]T , T m m m T = [1 xki1 xki2 · · · xkin · · · xm Xki ki1 xki1 xki2 · · · xkin ], Ci = [c0i c1i c2i · · · cn t ]
76
S.-K. Oh and W. Pedrycz µ (x )
Small( µ S (x ))
Big(µB ( x ) = 1 − µ S ( x ))
1 µ S (x )
0.5 0
x min
x
x max
0
if x ≥ x max
x max − x x max − x min
if x min < x < x max
1
if x ≤ x min
(a) Triangular MF µ (x )
Small( µ S (x))
Big(µB ( x) = 1 − µ S ( x))
1
if c ≥ x max
0 2
µ S (x)
0.5 0
x min
x max
x
−0.6931
µ ( x) = e
1
( x −c ) σ
if x min < c < x max if c ≤ x min
(b) Gaussian-like MF Fig. 11. Triangular and Gaussian-like membership functions and their parameters
with the following notation i: node number, k: data number, ntr : number of the training data subset, n: number of the selected input variables, m: maximum order, n : number of estimated coefficients. [Step 5-2] In case of a PN At this step, the regression polynomial inference is considered. The inference method deals with regression polynomial functions viewed as the consequents of the rules. Regression polynomials (polynomial and in the very specific case, a constant value) standing in the conclusion part of fuzzy rules are given as different types of Type 1, 2, 3, or 4, see Table 2. In the fuzzy inference, we consider two types of membership functions, namely triangular and Gaussianlike membership functions. The regression fuzzy inference (reasoning scheme) is envisioned: The consequence part can be expressed by linear, quadratic, or modified quadratic polynomial equation as shown in Table 2. The use of the regression polynomial inference method gives rise to the expression Ri : If x1 is Ai1 , · · · , xk is Aik then yi = fi (x1 , x2 , · · · , xk )
(12)
Genetically Optimized Self-organizing Neural Networks
77
where, Ri is the i-th fuzzy rule, xl (l = 1, 2, k) is an input variable, Aik is a membership function of fuzzy sets, k denotes the number of the input variables, fi (•) is a regression polynomial function of the input variables. The calculation of the numeric output of the model are carried out in the well-known form n
y =
µi fi (x1 , x2 , · · ·, xk )
i=1
n
= µi
n
µ i fi (x1 , x2 , · · ·, xk )
(13)
i=1
i=1
where, n is the number of the fuzzy rules, y is the inferred value, µi is the premise fitness of Ri and µ i is the normalized premise fitness of µi . Here we consider a regression polynomial function of the input variables. The consequence parameters are produced by the standard least squares method that is = (XjT Xj )−1 XjT Y a (14) where aTj = [a10j , · · ·, anoj , a11j , · · ·, an1j , · · ·, a1kj , · · ·, ankj ], Xj = [x1j , x2j , · · ·, xij , · · ·, xmj ]T , µlj1 , · · ·, µ ljn , xlj1 µ lj1 , · · ·, xlj1 µ ljn , · · ·, xljk µ lj1 , · · ·, xljk µ ljn ], xTlj = [ Y = [y1 , · · ·, ym ]T j is the node number, m denotes the total number of data points, and k stands the number of the input variables. Subsequently l is the data index while n is the number of the fuzzy rules. The procedure described above is implemented iteratively for all nodes of the layer and also for all layers of SONN; we start from the input layer and move towards the output layer. [Step 6] Select nodes (PNs or FPNs) with the best predictive capability and construct their corresponding layer. Fig. 12 depicts the genetic optimization procedure for the generation of the optimal nodes in the corresponding layer. As shown in Fig. 12, all nodes of the corresponding layer of SONN architecture are constructed through the genetic optimization. The generation process can be organized as the following sequence of steps Sub-step 1) We set up initial genetic information necessary for generation of the SONN architecture. This concerns the number of generations and populations, mutation rate, crossover rate, and the length of the chromosome. Sub-step 2) The nodes (PNs or FPNs) are generated through the genetic design. Here, a single population assumes the same role as the node (PN or FPN) in the SONN architecture and its underlying processing is visualized
78
Generation 1
S.-K. Oh and W. Pedrycz Population (1~T)
Array of nodes (PNs or FPNs:1~T)
Fitness operation of nodes (PNs or FPNs)
Reproduction Roulette-wheel selection One-point crossover Invert mutaion
Generation 2
Population (1~T)
Fitness operation of nodes (PNs or FPNs)
Elitist strategy
Selection of the optimal node
Selected Population Array of nodes
Highest fitness value
(preferred nodes:
(PNs or FPNs:1~2W)
PNs orFPNs) (1~W)
Array of nodes (PN or FPNs:1~T)
Reproduction Roulette-wheel selection One-point crossover Invert mutaion
Selected Population (preferred nodes: PNs orFPNs) (1~W)
Selected Population (preferred nodes: PNs orFPNs) (1~W)
Selection of the optimal node Highest fitness value
Selected Population Array of nodes
(preferred nodes:
(PNs or FPNs:1~2W)
PNs orFPNs) (1~W)
T : Total population size Generation 3
Population (1~T)
Fitness operation of nodes (PNs or FPNs)
Elitist strategy
Array of nodes (PNs or FPNs:1~T)
Selected Population (preferred nodes: PNs orFPNs) (1~W)
W : Selected population size : GA operation flow
: Optimization operation of SOFPNN within GA flow
Fig. 12. The genetic optimization used in the generation of the optimal nodes in the given layer of the network
in Figs. 8–9. The optimal parameters of the corresponding polynomial are computed by the standard least squares method. Sub-step 3) To evaluate the performance of nodes (PNs or FPNs) constructed using the training dataset, the testing dataset is used. Based on this performance index, we calculate the fitness function. The fitness function reads as F(fitness function) =
1 1 + EPI
(15)
where EPI denotes the performance index for the testing data (or validation data). In this case, the model is obtained by the training data and EPI is obtained from the testing data (or validation data) of the SONN model constructed by the training data. Sub-step 4) To move on to the next generation, we carry out selection, crossover, and mutation operation using genetic initial information and the fitness values obtained via sub-step 3. Sub-step 5) The nodes (PNs or FPNs) obtained on the basis of the calculated fitness values (F1 , F2 ,· · · , Fz ) are rearranged in a descending order. We unify the nodes with duplicated fitness values (viz. in case that one node is the same fitness value as other nodes) among the rearranged nodes on the basis of the fitness values. We choose several nodes (PNs or FPNs) characterized by the best fitness values. Here, we use the pre-defined number W of nodes (PNs or FPNs) with better predictive capability that need to be preserved to assure an optimal operation at the next iteration of the SONN algorithm. The outputs
Genetically Optimized Self-organizing Neural Networks
79
of the retained nodes serve as inputs to the next layer of the network. There are two cases as to the number of the retained nodes, that is (i) If W * < W , then the number of the nodes (PNs or FPNs) retained for the next layer is equal to z. Here, W * denotes the number of the retained nodes in each layer that nodes with the duplicated fitness values were moved. (ii) If W * ≥ W , then for the next layer, the number of the retained nodes (PNs or FPNs) is equal to W . The above design pattern is carried out for the successive layers of the network. For the construction of the nodes in the corresponding layer of the original SONN structure, the nodes obtained from (i) or (ii) are rearranged in ascending order on a basis of initial population number. This step is needed to construct the final network architecture as we trace the location of the original nodes of each layer generated by the genetic optimization. Sub-step 6) For the elitist strategy, we select the node that has the highest fitness value among the selected nodes (W ). Sub-step 7) We generate new populations of the next generation using operators of GAs obtained from Sub-step 4. We use the elitist strategy. This sub-step carries out by repeating sub-step 2–6. Especially in sub-step 5, we replace the node that has the lowest fitness value in the current generation with the node that has reached the highest fitness value in the previous generation obtained from sub-step 6. Sub-step 8) We combine the nodes (W populations) obtained in the previous generation with the nodes (W populations) obtained in the current generation. In the sequel, W nodes that have higher fitness values among them (2W ) are selected. That is, this sub-step carries out by repeating sub-step 5. Sub-step 9) Until the last generation, this sub-step carries out by repeating sub-step 7–8. The iterative process generates the optimal nodes of the given layer of the SONN. Step 7) Check the termination criterion. The termination condition that controls the growth of the model consists of two components, that is the performance index and a size of the network (expressed in terms of the maximal number of the layers). As far as the performance index is concerned (that reflects a numeric accuracy of the layers), a termination is straightforward and comes in the form, F1 ≤ F∗
(16)
Where, F1 denotes a maximal fitness value occurring at the current layer whereas F∗ stands for a maximal fitness value that occurred at the previous layer. As far as the depth of the network is concerned, the generation process is stopped at a depth of less than five layers. This size of the network has been experimentally found to build a sound compromise between the high accuracy of the resulting model and its complexity as well as generalization abilities.
80
S.-K. Oh and W. Pedrycz
In this study, we use two measures (performance indexes) that is the Mean Squared Error (MSE) and the Root Mean Squared Error (RMSE). i) Mean Squared Error E(P Is or EP Is ) =
N 1 (yp − yp )2 N p=1
(17)
ii) Root Mean Squared Error N 1 E(P Is or EP Is ) = (yp − yp )2 N p=1
(18)
where, yp is the p − th target output data and yp stands for the p-th actual output of the model for this specific data point. N is training (P Is ) or testing (EP Is ) input-output data pairs and E is an overall (global) performance index defined as a sum of the errors for the N. [Step 8] Determine new input variables for the next layer. If (16) has not been met, the model is expanded. The outputs of the preserved nodes (zli , z2i , · · ·, zW i ) serves as new inputs to the next layer (x1j , x2j , · · ·, xW j )(j = i + 1). This is captured by the expression x1j = z1i , x2j = z2i , · · · , xwj = zwi
(19)
The SONN algorithm is carried out by repeating steps 4–8 of the algorithm.
5 Experimental studies In this section, we illustrate the development of the SONN and show its performance for a number of well-known and widely used datasets. The first one is a time series of gas furnace (Box-Jenkins data) which was studied previously in [10, 11, 13, 15, 16, 20–34]. The other one deals with a chaotic time series data (Mackey-Glass time series data) [35–41]. 5.1 Gas furnace process We illustrate the performance of the network and elaborate on its development by experimenting with data coming from the gas furnace process. The time series data (296 input-output pairs) resulting from the gas furnace process has been intensively studied in the previous literature [10, 11, 13, 15, 16, 20–34]. The delayed terms of methane gas flow rate, u(t) and carbon dioxide density, y(t) are used as system input variables such as u(t − 3), u(t − 2), u(t − 1), y(t − 3), y(t − 2), and y(t − 1). The output variable is y(t). The first part of the dataset (consisting of 148 pairs) was used for training. The remaining part
Genetically Optimized Self-organizing Neural Networks
81
Table 4. System’s input vector formats for the design of SONN Number of System Inputs (SI) 2 (Type I) 3 (Type II) 4 (Type III) 6 (Type IV)
Inputs and output u(t−3), y(t−1):y(t) u(t−2), y(t−2), y(t−1):y(t) u(t−2), u(t−1), y(t−2), y(t−1):y(t) u(t−3), u(t−2), u(t−1), y(t−3), y(t−2), y(t−1):y(t)
of the series serves as a testing set. We consider the MSE given by (17) to be a pertinent performance index. We choose the input variables of nodes in the 1st layer of SONN architecture from these input variables. We use four types of system input variables of SONN architecture with vector formats such as Type I, Type II, Type III, and Type IV as shown in Table 4. The forms Type I, II, III, and IV utilize two, three, four, and six system input variables respectively. Table 5 summarizes the list of parameters used in the genetic optimization of the PN-based and the FPN-based SONN. In the optimization of each layer, we use 100 generations, 60 populations, a string of 36 bits, crossover rate equal to 0.65, and the probability of mutation set up to 0.1. A chromosome used in the genetic optimization consists of a string including 3 sub-chromosomes. The 1st chromosome contains the number of input variables, the 2nd chromosome contains the order of the polynomial, and finally the 3rd chromosome contains input variables. The numbers of bits allocated to each sub- chromosome are equal to 3, 3, and 30, respectively. The population size being selected from the total population size (60) is equal to 30. The process is realized as follows. 60 nodes (PNs or FPNs) are generated in each layer of the network. The parameters of all nodes generated in each layer are estimated and the network is evaluated using both the training and testing data sets. Then we compare these values and choose 30 nodes (PNs or FPNs) that produce the best (lowest) value of the performance index. The maximal number (Max) of inputs to be selected is confined to two to five (2–5). In case of PN-based SONN, the order of the polynomial is chosen from three types that is Type 1, Type 2, and Type 3 (refer to the Table 1), while in case of FPN-based SONN, the polynomial order of the consequent part of fuzzy rules is chosen from four types, that is Type 1, Type 2, Type 3, and Type 4 as shown in Table 2. As usual in fuzzy systems, we may exploit a variety of membership functions in the condition part of the rules and this is another factor contributing to the flexibility of the network. Overall, triangular or Gaussian fuzzy sets are of general interest. The first class of triangular membership functions provides with a very simple implementation. The second class becomes useful because of an infinite support of its fuzzy sets.
82
S.-K. Oh and W. Pedrycz
Table 5. Computational aspects of the genetic optimization of PN-based and FPNbased SONN
GA
Parameters
1st layer
2nd layer
3rd layer
4th layer
5th layer
Maximum generation
100
100
100
100
100
Total population size
60
60
60
60
60
Selected population size (W )
30
30
30
30
30
Crossover rate
0.65
0.65
0.65
0.65
0.65
Mutation rate
0.1
0.1
0.1
0.1
0.1
String length
3 + 3 + 30
3 + 3 + 30
3 + 3 + 30
3 + 3 + 30
3 + 3 + 30
Maximal no. (Max) of inputs 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max (2–5) (2–5) (2–5) (2–5) (2–5) to be selected
PNbased SONN Polynomial type (Type T) (# )
1≤T≤3
1≤T≤3
1≤T≤3
1≤T≤3
1≤T≤3
Maximal no. (Max) of inputs 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max 1 ≤ l ≤ Max (2–5) (2–5) (2–5) (2–5) (2–5) to be selected Polynomial type (Type T) of the consequent part of fuzzy rules (## ) FPNConsequent based input type to be SONN used for Type T (### )
Membership Function (MF) type
1≤T≤4
1≤T≤4
1≤T≤4
1≤T≤4
1≤T≤4
Type T
Type T
Type T
Type T
Type T
Type T∗
Type T
Type T
Type T
Type T
Triangular
Triangular
Triangular
Triangular
Triangular
Gaussian
Gaussian
Gaussian
Gaussian
Gaussian
2
2
2
2
2
No. of MFs per input l, T, Max: integers,
#
,
##
and
###
: refer to Tables 1–3 respectively.
5.1.1 PN-based SONN Fig. 13 shows an example of the PN design that is driven by some specific chromosome, refer to the case that the performance values are PI = 0.022, EPI = 0.136 in the 1st layer when using Max = 3 in Type IV as shown in Table 6(b). Here, the number of the input variables considered in the 1st layer is given as 6 that are the number of the entire system input variables. Especially, in the 2nd layer or higher, the number of entire input variables is given as W that is the number of the nodes selected in the current layer, as the output nodes of the preceding layer. Refer to sub-step 4 of step 4-3 of the introduced
Genetically Optimized Self-organizing Neural Networks
83
Selection of node(PN) structure by a chromosome
i) Bits for the selection of the no. of input variables
Related bit items
Bit structure of subchromosome divided for each item
1
1
1
1
1
0
Decoding
Decoding
7
6
Normalization
Normalization
3
3
1
1
1
1
0
0
0
0
1
No. of selected input variables(3)
Selected polynomial order (Type 3)
1
0
0
0
1
Normalization : 5
Selected input variables : 5
1
0
1
1
1
1
1
0
1
0
2
3
Decoding : 7
Decoding : 60
Normalization : 2
Normalization : 6
Selected input variables : 2
Selected input variables : 6
1 Decoding : 50
Genetic Design
iii) Bits for the selection of input variables
ii) Bits for the selection of the polynomial order
0
0
Selected input variables : 2, 5, 6
3 input variables : x 2 , x 5 , x 6
Selected PN
Modified quadratic polynomial : a0 +
a1x2 + a2x5 + a3x6 + a4x2x5 + a5x2x6 + a6x5x6
PN
Fig. 13. The example of the PN design guided by some chromosome (Type IV and Max = 3 in layer 1)
design process. The maximal number of input variables for the selection is confined to 3 over nodes of each layer of the network. Table 6 summarizes the results when using Type II and Type IV: According to the maximal number of inputs to be selected (Max = 2 to 5), the selected node numbers, the selected polynomial type (Type T), and its corresponding performance index (PI and EPI) were shown when the genetic optimization for each layer was carried out. “Node” denotes the nodes for which the fitness value is maximal in each layer. For example, in case of Table 6(b), the fitness value in layer 1 is maximal for Max = 5 when nodes 3, 4, 5, 6 occurring in the previous layer are selected as the node inputs in the present layer. Only 4 inputs of Type 1 (linear function) were selected as the result of the genetic optimization. Here, node “0” indicates that it has not been selected by the genetic operation. Therefore the width (the number of nodes) of the layer can be lower in comparison to the conventional PN-based SONN (which immensely contributes to the compactness of the resulting network). In that case, the minimal value of the performance index at the node, that is PI = 0.035, EPI = 0.125 are obtained. Fig. 14 depicts the values of the performance index of the PN-based gSONN with respect to the maximal number of inputs to be selected when
6
1 0.035 0.125 6
21
28
EPI 8
3 0.042 0.112 1
2 0.043 0.115 12
EPI
1 0.021 0.123 4
2 0.026 0.134
PI 2 22
PI
EPI 23
23 3 0.041 0.112 14
2 0.024 0.114
T
18
0
21
30
Node
8 25 29 3 0.039 0.108 5 15 16
16
16
Node
PI
EPI
1 0.038 0.106
2 0.036 0.109
2 0.024 0.112
T
5th layer
EPI 8
3 0.017 0.113 4
2 0.020 0.112 12
1 0.023 0.126
PI
0
19
PI
EPI
2 0.014 0.110 4
2 0.022 0.124
T
4th layer
7 6
Node
7 10 28 2 0.016 0.106 27 28 29
13
Node
0
18
9
PI
EPI
3 0.014 0.100
2 0.014 0.102
2 0.022 0.120
T
5th layer
9 18 27 0 3 0.015 0.107 3 23 30 0 0 2 0.013 0.096 6 24 25 0 0 3 0.012 0.091
19
28
T
3rd layer
25
Node
(b) In case of Type IV
9 15 27 3 0.018 0.122 6 12 13
0
27
T
3 4 5 6 0 1 0.035 0.125 4 15 22 0 0 2 0.015 0.112 7
25
Node
2nd layer
5
24
2 0.105 0.199 24
2 0.105 0.199
EPI
18
2 0.045 0.121 8 13 14
PI
2 0.024 0.119
T
4th layer
7 18 0 0 2 0.045 0.121 1 15 30 0 0 2 0.022 0.115 23 29 30 0 0 2 0.021 0.110 13 14 23 30 0 1 0.019 0.108
0
14 2 0.045 0.121 16
12
Node
3rd layer
(a) In case of Type II
10
3 4 5
6
6
PI
EPI
4
5
T
1st layer
PI
2 0.048 0.124
T
2
5
Node
7
10
Node
2nd layer
3
2
Max
1 2 3 0 0 3 0.022 0.135 4
6
3 0.022 0.135 5 15 16
3 0.022 0.135 5
5
0
1 2 3
3
EPI
1
2
PI
2 0.104 0.198
T
4
3
Node
2
3
2
Max
1st layer
Table 6. Performance index of the PN-based SONN viewed with regard to the increasing number of the layers
84 S.-K. Oh and W. Pedrycz
Genetically Optimized Self-organizing Neural Networks Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
0.1 0.09
Training error
0.2
A : (2 3) B : (1 2 3) C : (1 2 3 0) D : (1 2 3 0 0) A : (6 10) B : (5 7 14) C : (5 15 16 0) D : (4 7 18 0 0)
0.08 0.07
0.19 0.18 A : ( 8 16) B : (12 16 23) C : ( 1 8 25 29) D : (23 29 30 0 0)
A : (10 12) B : (16 18 28) C : ( 8 13 14 21) D : ( 1 15 30 0 0)
0.06 0.05
Testing error
0.11
A : (23 30) B : (14 18 21) C : (10 16 28 0) D : (13 14 23 30 0)
0.04
0.17 0.16 0.15 0.14 0.13 0.12
0.03
0.11
0.02 0.01
1
2
3
4
0.1
5
1
2
Layer
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
0.11
0.07
0.05 0.04 0.03
(a-2) Testing error
0.18
A : (2 25) B : (4 22 28) C : (6 12 13 19) D : (7 9 18 27 0) A : ( 8 19) B : (12 13 0) A : ( 7 9) C : ( 4 7 10 28) B : ( 4 6 18) D : ( 3 23 30 0 0) C : (27 28 29 0) D : ( 6 24 25 0 0)
0.06
5
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
Testing error
0.08
4
0.2
A : (5 6) B : (2 5 6) C : (3 4 5 6) D : (3 4 5 6 0) A : (24 27) B : (24 25 0) C : ( 6 9 15 27) D : ( 4 15 22 0 0)
0.1 0.09
3
Layer
(a-1) Training error (a) In case of Type II
Training error
85
0.16 0.14 0.12 0.1
0.02 0.01
1
2
3
4
5
0.08
1
Layer
(b-1) Training error (b) In case of Type IV
2
3
4
5
Layer
(b-2) Testing error
Fig. 14. Performance index treated as a function of the maximal number of inputs to be selected in Type II or Type IV
the number of system inputs is equal to 6 (Type IV). Next, Fig. 15 shows the performance index of the gSONN with respect to the number of entire system inputs when using Max = 5. Considering the training and testing data sets in Type IV with Max = 5, the best results for network in the 2nd layer happen with the Type 2 polynomial (quadratic function) and 3 nodes at the inputs (nodes numbered as 4, 15, 22); the performance of this network is quantified by the values of PI equal to 0.015 and EPI given as 0.112. The best results for the network in the 5th layer coming with PI = 0.012 and EPI = 0.091 have been reported when using Max = 5 with the polynomial of Type 3 and 3 nodes at the inputs (the node numbers are 6, 24, 25). In Fig. 14, A(•)–E(•) denote the optimal node numbers at each layer of the network, namely those with the best predictive performance. Here, the node numbers of the 1st layer represent system input numbers, and the node numbers of each layer in the 2nd layer or higher represent the output node numbers of the preceding layer, as the optimal node which has the best output performance in the current layer. Fig. 16 illustrates the detailed optimal topologies of the network with 2 or 3 layers, compared to the conventional optimized network of Fig. 17. That is, in Fig. 17, the
86
S.-K. Oh and W. Pedrycz ,3;
,4;
No. of system inputs : 2 ;
,6; 0.4
0.045
0.35
0.04
Testing error
Training error
No. of system inputs : 2 ; 0.05
0.035 0.03 0.025 0.02
0.2 0.15 0.1 0.05
3
4
5
,6;
0.3
0.01
2
,4;
0.25
0.015 1
,3;
1
2
3
4
5
Layer
Layer
(a) Training error
(b) Testing error
Fig. 15. Performance index regarded as a function of the number of system inputs (SI) (in case of Max = 5) u(t-3)
PN 4
PN4
2 1
y(t-2)
PN 6
PN 1
2 2
2 3
y(t-2)
1 2
3 3
y ^
3 2
PN30
PN 14
y(t-1)
PN 7
PN15
PN 7
4 3
2 3
4
u(t-2) u(t-1)
2 1
PN5
PN15 4
y(t-3) y(t-2)
PN 17
3
3
2
^
y
2
PN22 5
y(t-1)
1
(a) In case of Type II with 3 layers and (b) In case of Type IV with 2 layers and Max = 5 Max = 5 Fig. 16. PN-based genetically optimized SONN (gSONN) architecture PD 11
PD 1
PD 3
PD 2
u(t-2) y(t-2)
PD 4 PD 1 PD 2 PD 1
PD 3 PD 4
y(t-1)
PD 9 PD 11 PD 12
PD 3 PD 12
PD 5
u(t-3) • PD
PD 20
PD 15
^
y u(t-2)
•
u(t-1) •
PD 18 PD 20
y(t-3) •
PD 7 PD 8 PD 9 PD 10 PD 11 PD 12
NOP5
y(t-2) •
NOP6 NOP7 PD 2
PD 9
PD 3
PD 10
NOP25
PD 44
PD 4
(a) In case of Type II with 5 layers
y(t-1) •
PD 13 PD 14 PD 15
PD 8 PD 9 PD 10 PD 12 PD 13 PD 16 PD 20 PD 21 PD 23
PD 2
PD 12
PD 17
PD 7
PD 20 PD 24
PD 9
PD
PD 26 PD 28
PD 30
PD 29
PD 24 PD 25 PD 26
(b) In case of Type IV with 5 layers
Fig. 17. Conventional optimized PN-based SONN architecture
^
f
Genetically Optimized Self-organizing Neural Networks
87
0.17
0.04
0.16
0.035
0.025
1st layer
2nd layer
3rd layer
4th layer
Testing error
Training error
0.15 0.03 5th layer
0.02
0.14 0.13
1st layer
2nd layer
3rd layer
4th layer
5th layer
0.12 0.11
0.015 0.01
0.1 0
100
200
300
Generation
(a) Training error (PI)
400
500
0.09
0
100
200
300
400
500
Generation
(b) Testing error (EPI)
Fig. 18. The optimization process reported in terms of PI and EPI
performance of the conventional optimized SONN in Type II or IV with 5 layers was quantified by the values of PI = 0.020, EPI = 0.119 or PI = 0.017, EPI = 0.101 respectively whereas under the condition given as similar performance values, two types of gSONN architectures were depicted as shown in Fig. 16 (a) Type II with 3 layers and Max = 5 (PI = 0.022, EPI = 0.115) (refer to Table 6(a), PI = 0.019, EPI = 0.108 in 5 layers with Max = 5) and (b) Type IV with 2 layers and Max = 5 (PI = 0.015, EPI = 0.112) (refer to Table 6(b), PI = 0.012, EPI = 0.091 in 5 layers with Max = 5). As shown in Fig. 16, the genetic design procedure at each stage (layer) of SONN leads to the selection of the preferred nodes (or PNs) with optimal local characteristics (such as the number of input variables, the order of the polynomial, and input variables). Fig. 18 illustrates the optimization process by visualizing the performance index in successive generations of the genetic optimization in case of Type IV with Max = 5. It also shows the optimized network architecture over 5 layers. Noticeably, the variation ratio (slope) of the performance of the network changes radically around the 2nd layer from the viewpoint of PI and EPI. Therefore, to effectively reduce a large number of nodes and avoid a substantial amount of time-consuming iterations concerning SONN layers, the stopping criterion can be taken into consideration. Referring to Fig. 16(b), it becomes obvious that we optimized the network up to the maximally 2nd layer. Table 7 summarizes the detailed results of the optimal architecture according to each Type of the network. 5.1.2 FPN-based SONN Fig. 19 shows an example of the FPN design that is driven by some specific chromosome, refer to the case that the performance values are PI = 0.012, EPI = 0.145 in the 1st layer when using Type IV, Gaussian-like MF, and Max = 4 in Table 8(b-2). In FPN of each layer, two Gaussian-like membership
88
S.-K. Oh and W. Pedrycz
Table 7. Performance index of PN based SONN architectures for Types I, II, III, and IV PN-based SONN
Max = 4 Type I (SI: 2) Max = 5
Max = 4 Type II (SI: 3) Max = 5
Max = 4 Type III (SI: 4) Max = 5
Max = 4 Type IV (SI: 6) Max = 5
Layer 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Optimal Polynomial Neuron No. of inputs (Node no.) Polynomial type 2(1,2) Type 1 4(1,4,5,6) Type 2 4(1,2,14,25) Type 2 3(17,22,25) Type 2 4(4,14,25,26) Type 2 2(1,2) Type 1 4(3,4,5,7) Type 2 5(7,9,15,24,28) Type 3 3(21,27,30) Type 2 3(2,15,13) Type 2 3(1,2,3) Type 3 3(5,15,16) Type 2 4(8,13,14,21) Type 3 4(1,8,25,29) Type 3 3(10,16,28) Type 1 3(1,2,3) Type 3 3(4,17,18) Type 2 3(1,15,30) Type 2 3(23,29,30) Type 2 4(13,14,23,30) Type 1 3(1,3,4) Type 3 4(24,26,29,30) Type 3 4(7,11,18,26) Type 2 3(2,28,29) Type 2 3(3,12,22) Type 1 3(1,3,4) Type 3 5(12,13,15,18,29) Type 2 4(7,9,10,14) Type 2 3(1,3,5) Type 2 4(8,9,23,24) Type 1 4(3,4,5,6) Type 1 4(6,9,15,27) Type 3 4(6,12,13,19) Type 3 4(4,7,10,28) Type 2 3(27,28,29) Type 3 4(3,4,5,6) Type 1 3(4,15,22) Type 2 4(7,9,18,27) Type 3 3(3,23,30) Type 2 3(6,24,25) Type 3
PIs EPIs 0.022 0.019 0.020 0.018 0.018 0.022 0.020 0.020 0.017 0.017 0.022 0.045 0.042 0.039 0.038 0.022 0.045 0.022 0.021 0.019 0.022 0.016 0.014 0.014 0.013 0.022 0.015 0.014 0.014 0.013 0.035 0.018 0.017 0.016 0.014 0.035 0.015 0.015 0.013 0.012
0.335 0.282 0.273 0.268 0.265 0.335 0.282 0.271 0.263 0.259 0.135 0.121 0.112 0.108 0.106 0.135 0.121 0.115 0.110 0.108 0.135 0.124 0.113 0.102 0.099 0.135 0.119 0.104 0.100 0.098 0.125 0.122 0.113 0.106 0.100 0.125 0.112 0.107 0.096 0.091
Genetically Optimized Self-organizing Neural Networks
89
Selection of node(FPN) structure by a chromosome
Related bit items
Bit structure of subchromosome divided for each item
i) Bits for the selection of the no. of input variables
0
1
1
0
1
iii) Bits for the selection of input variables
ii) Bits for the selection of the polynomial order
1
Decoding
Decoding
3
3
0
0
1
1
0
1
0
Normalization
2
2
No. of selected input variables(2)
1
1
0
0
0
Decoding : 11992
Normalization : 2
Normalization : 3
Selected input variable : 2
Selected input variable : 3 Selected input variables : 2, 3
Selected polynomial order (Type 2)
Input variables :x 2,x 3
Regression polynomial fuzzy inference
Fuzzy inference & fuzzy identification
0
2
1
Normalization
1
0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 0 0 0
Decoding : 7024
Genetic Design
1
Gaussian
2 MFs
Entire system input variables
R 1 : If x 2 is Small and x3 is Small, then y 1 = f1(x 1 , x 2 , x 3 , x 4 ) with Type 2
R 4 : If x 2 is Big and x 3 is Big, then y4 = f 4 (x 1 , x 2 , x 3 , x 4) with Type 2
Selected FPN
FPN
Fig. 19. The example of the FPN design guided by some chromosome (Type IV, Gaussian-like MF, Max = 4, 1st layer)
functions for each input variable are used. Here, the number of entire input variables (here, entire system input variables) considered in the 1st layer is given as 6 (Type IV). The polynomial order selected is given as Type 2. Furthermore the entire system input variables for the regression polynomial function of the consequent part of fuzzy rules in the first layer were considered. That is, “Type 2∗ ” as the consequent input type in the 1st layer is used and in the 2nd layer or higher, the input variables in the conclusion part of fuzzy rules are kept as those in the conditional part of the fuzzy rules. Especially, in the 2nd layer or higher, the number of entire input variables is given as W that is the number of the nodes selected in the current layer, as the output nodes of the preceding layer. Refer to sub-step 4 of step 4–3 of the introduced design process. As mentioned previously, the maximal number (Max) of input variables for the selection is confined to 4, and two variables (such as x2 and x3 ) were selected among them. The parameters of the conclusion part (polynomial) of fuzzy rules can be determined by the standard least-squares method. Table 8 summarizes the results when using Types II and IV: According to the maximal number of inputs to be selected (Max = 2 to 5), the selected node numbers, the selected polynomial type (Type T), and its corresponding
2 3 4 5
Max
2 3 4 5
Max
2 3 4 5
Max
2 3 4 5
Max
2nd layer EPI Node T PI 0.133 8 19 1 0.049 0.133 8 20 0 1 0.049 0.133 18 21 0 0 1 0.049 0.133 4 27 0 0 0 1 0.049
2nd layer EPI Node T PI 0.142 8 14 2 0.017 0.142 25 16 17 1 0.025 0.142 3 9 14 25 1 0.021 0.142 5 6 9 24 0 1 0.020
2nd layer EPI Node T PI 0.149 25 0 3 0.012 0.136 3 14 0 1 0.021 0.136 12 27 0 0 1 0.021 0.136 16 27 0 0 0 1 0.021
2nd layer EPI Node T PI 0.145 25 0 3 0.012 0.145 10 29 0 4 0.012 0.145 19 29 0 0 4 0.011 0.145 5 27 28 0 0 1 0.015
1st layer Node T PI 1 0 4 0.020 1 0 0 4 0.020 1 0 0 0 4 0.020 1 0 0 0 0 4 0.020
1st layer Node T PI 1 2 4 0.017 1 2 0 4 0.017 1 2 0 0 4 0.017 1 2 0 0 0 4 0.017
1st layer Node T PI 2 3 2 0.013 2 5 6 1 0.021 2 5 6 0 1 0.021 2 5 6 0 0 1 0.021
1st layer Node T PI 2 3 2 0.012 2 3 0 2 0.012 2 3 0 0 2 0.012 2 3 0 0 0 2 0.012 EPI 0.146 0.139 0.125 0.138
EPI 0.148 0.124 0.124 0.124
EPI 0.134 0.131 0.125 0.127
EPI 0.129 0.129 0.129 0.129
(b) In case of Type IV (b-1) Triangular MF 3rd layer Node T PI EPI 2 23 2 0.013 0.143 2 4 6 2 0.015 0.117 19 25 0 0 3 0.018 0.122 5 6 10 0 0 4 0.015 0.106 (b-2) Gaussian-like MF 3rd layer Node T PI EPI 2 0 3 0.013 0.147 11 13 23 1 0.023 0.129 4 13 29 0 1 0.020 0.104 11 27 29 0 0 1 0.021 0.114
(a) In case of Type II (a-1) Triangular MF 3rd layer Node T PI EPI 16 0 2 0.049 0.127 5 0 0 2 0.049 0.127 1 4 13 19 1 0.019 0.126 11 22 0 0 0 2 0.019 0.124 (a-2) Gaussian-like MF 3rd layer Node T PI EPI 17 26 2 0.017 0.132 10 20 26 1 0.024 0.127 2 3 9 10 1 0.023 0.111 4 7 17 20 0 1 0.022 0.117
4th layer T PI 19 0 3 0.015 2 3 7 1 0.016 29 0 0 0 3 0.011 3 5 0 0 0 2 0.009 Node
4th layer Node T PI 3 26 3 0.013 10 12 30 1 0.017 8 14 15 19 1 0.017 5 27 0 0 0 2 0.015
4th layer Node T PI 15 25 4 0.017 2 4 0 2 0.016 10 20 21 28 1 0.030 17 29 0 0 0 2 0.014
4th layer T PI 7 9 3 0.018 11 20 24 1 0.017 10 20 27 30 1 0.018 5 23 0 0 0 2 0.018 Node
5th layer EPI Node T PI 0.149 8 14 4 0.010 0.116 7 14 0 2 0.008 0.106 2 20 24 27 1 0.008 0.106 5 15 20 0 0 1 0.016
5th layer EPI Node T PI 0.141 21 30 2 0.012 0.114 4 12 21 4 0.011 0.114 6 10 14 28 1 0.011 0.105 7 19 0 0 0 3 0.015
5th layer EPI Node T PI 0.130 1 5 4 0.016 0.119 16 25 0 2 0.015 0.108 3 13 0 0 2 0.018 0.114 19 29 0 0 0 3 0.014
5th layer EPI Node T PI 0.122 22 30 2 0.018 0.123 9 17 0 4 0.017 0.123 14 23 0 0 1 0.018 0.119 17 21 26 27 0 1 0.016
EPI 0.152 0.119 0.093 0.100
EPI 0.125 0.103 0.106 0.103
EPI 0.128 0.117 0.101 0.108
EPI 0.119 0.119 0.119 0.114
Table 8. Performance index of the network of each layer versus the increase of maximal number of inputs to be selected for Types II and IV with Type T∗
90 S.-K. Oh and W. Pedrycz
Genetically Optimized Self-organizing Neural Networks
91
performance index (PI and EPI) were shown when the genetic optimization for each layer was carried out. “Node” denotes the nodes for which the fitness value is maximal in each layer. For example, in case of Table 8(b-2), the fitness value in layer 5 is maximal for Max = 5 when nodes 5, 15, 20 occurring in the previous layer are selected as the node inputs in the present layer. Only 3 inputs of Type 1 (constant) were selected as the result of the genetic optimization. Here, node “0” indicates that it has not been selected by the genetic operation. Therefore the width (the number of nodes) of the layer can be lower in comparison to the conventional FPNN (which immensely contributes to the compactness of the resulting network). In that case, the minimal value of the performance index at the node, that is PI = 0.016, EPI = 0.100 are obtained. Fig. 20 shows the values of performance index vis-` a-vis number of layers of the gSONN with respect to the maximal number of inputs to be selected as optimal architectures of each layer of the network included in Table 8(b) while Fig. 21 summarizes the values of the performance index represented vis-` a-vis the increasing number of the layers with regard to the number of Maximal number of inputs to be selected(Max) , 3(B) ;
2(A) ;
, 4(C) ;
Maximal number of inputs to be selected(Max)
, 5 (D) ;
0.022 A : (2 3) B : (2 5 6) C : (2 5 6 0) D : (2 5 6 0 0) A : (25 0) B : ( 3 14 0) C : (12 27 0 0) D : (16 27 0 0 0)
0.018 0.016
A : (21 30) B : ( 4 12 21) C : ( 6 10 14 28) D : ( 7 19 0 0 0)
0.014 A : ( 2 23) B : ( 2 4 6) C : (19 25 0 0) D : ( 5 6 10 0 0)
0.012
1
2
A : ( 3 26) B : (10 12 30) C : ( 8 14 15 19) D : ( 5 27 0 0 0)
3
0.12 0.11
4
0.1
5
1
2
,5 (D) ;
0.02 0.018 0.016
0) 3 7) 0 0 0) 5 0 0 0) A : (8 14) B : (7 14 0) C : (2 20 24 27) D : (5 15 20 0 0)
0.014 A : (2 0) B : (11 13 22) C : ( 4 13 29 0) D : (11 27 29 0 0)
0.012 0.01 0.008
1
2
3
Layer
5
, 3(B) ;
, 4(C) ;
,5 (D) ;
0.15
Testing error
0.022
Training error
A : (19 B:( 2 C : (29 D:( 3
4
(a-2) Testing error
2(A) ;
0.16
0.024 A : (25 0) B : (10 29 0) C : (19 29 0 0) D : ( 5 27 28 0 0) A : (2 3) B : (2 3 0) C : (2 3 0 0) D : (2 3 0 0 0)
3
Layer
Maximal number of inputs to be selected(Max)
Maximal number of inputs to be selected(Max) , 4(C) ;
,5 (D) ;
0.13
(a-1) Training error error (a) Triangular MF
, 3(B) ;
, 4(C) ;
0.14
Layer
2(A) ;
, 3(B) ;
0.15
Testing error
Training error
0.02
0.01
2(A) ;
0.16
0.14 0.13 0.12 0.11 0.1
4
5
0.09
1
(b-1) Training error error (b) Gaussian-like MF
2
3
4
5
Layer
(b-2) Testing error
Fig. 20. Performance index of gSONN for Type IV according to the increase of number of layers
92
S.-K. Oh and W. Pedrycz No. of system inputs : 2 ;
,3;
,4;
No. of system inputs : 2 ;
, 6;
0.05 0.045
,4;
,6;
: Selected input variables : Entire system input variables
0.3
0.04
Testing error
Training error
,3;
0.35 : Selected input variables : Entire system input variables
0.035 0.03 0.025 0.02
0.25
0.2
0.15
0.015 0.01
1
2
3
4
0.1
5
Layer
1
2
(a-1) Training error error (a) Triangular MF
No. of system inputs : 2 ;
0.045
,4;
,6;
4
5
(a-2) Testing error
No. of system inputs : 2 ;
,3;
,4;
,6;
0.3 : Selected input variables
: Selected input variables : Entire system input variables
0.04
: Entire system input variables
0.25
0.035
Testing error
Testing error
,3;
3
Layer
0.03 0.025 0.02 0.015
0.2
0.15
0.1
0.01 0.005
1
2
3
4
5
0.05
1
Layer
(b-1) Training error error (b) Gaussian-like MF
2
3
4
5
Layer
(b-2) Testing error
Fig. 21. Performance index of gSONN with respect to the increase of number of system inputs (in case of Max = 5)
entire system inputs being used in gSONN when using Max = 5. In Fig. 20, A(•)–D(•) denote the optimal node numbers at each layer of the network, namely those with the best predictive performance. Here, the node numbers of the 1st layer represent system input numbers, and the node numbers of each layer in the 1nd layer or higher represent the output node numbers of the preceding layer, as the optimal node that has the best output performance in the current layer. Fig. 22 illustrates the detailed optimal topology of the network with 2 layers, compared to the conventional optimized network of Fig. 23. That is, in Fig. 23, the performance of two types of the conventional optimized SONNs was quantified by the values of PI = 0.020, EPI = 0.130 for Type II, and PI = 0.016, EPI = 0.128 for Type III whereas under the condition given as similar performance values, two types of gSONN architectures were depicted as shown Fig. 22(a) and (b). As shown in Fig. 22, the genetic design procedure at each stage (layer) of SONN leads to the selection of the preferred nodes (or FPNs) with optimal local characteristics (such as the number of input variables, the order
Genetically Optimized Self-organizing Neural Networks u(t-3)
FPN5
2
u(t-2)
4
u(t-2)
FPN6
2
y(t-2)
3
FPN23
4
y^
1
1
2
u(t-1)
FPN29
2
y(t-3)
y^
4
FPN 29
y(t-2)
FPN24
1
FPN19
2
2
FPN9
y(t-1)
93
3
4
1
y(t-1)
(a) FPN-based gSONN with 2 layers for Type II, Max = 5, and Gaussianlike MF
(b) FPN-based gSONN with 2 layers for Type IV, Max = 4, and Gaussian-like MF
Fig. 22. Genetically optimized FPNN (gFPNN) architecture FPN 3 FPN 5
u(t-2)
FPN 1
FPN 6
FPN 28 FPN 29
FPN22
y(t-2)
FPN11
y^
NOP 7
y(t-1) NOP 2
FPN 10
NOP31
NOP 31
NOP 3 NOP 4
(a) Type II u(t-2)
FPN 1
FPN1
FPN 2
FPN5
FPN 3
FPN6
FPN4
FPN8
FPN5
FPN14
FPN6
FPN15
FPN 1
c
u(t-1)
c
y(t-2)
c
y(t-1)
c
FPN 17
FPN 3
FPN 20
FPN 24
FPN 7
y^
FPN 30
(b) Type III Fig. 23. Genetically optimized FPNN (gFPNN) architecture
of polynomial of the consequent part of fuzzy rules, and a collection of the specific subset of input variables). In the sequel, the best results for Type II network in the 2nd layer are obtained when using Max = 5 and Gaussian-like MF, and this happens with the Type 1 (constant) and 4 nodes at the inputs (nodes numbered as 5, 6, 9, 24); the performance of this network is quantified by the values of PI = 0.020 and EPI = 0.127. In addition, the most preferred results for the Type IV network in the 2nd layer coming with PI = 0.011 and EPI = 0.125 have been reported when using Max = 4 and Gaussian-like MF with the polynomial of Type 4 and 2 nodes at the inputs (the node numbers are 19, 29). Therefore the width (the number of nodes) of the layer as well as the depth (the number of layers) of the network can be lower in comparison
94
S.-K. Oh and W. Pedrycz
Table 9. Performance index of FPN based SONN architectures for Types I, II, III, and IV PN-based SONN Max = 5 (Triangular MF) Type I (SI: 2) Max = 5 (Gaussian-like MF)
Max = 5 (Triangular MF) Type II (SI: 3) Max = 4 (Gaussian-like MF)
Max = 5 (Triangular MF) Type III (SI: 4) Max = 5 (Gaussian-like MF)
Max = 5 (Triangular MF) Type IV (SI: 6) Max = 4 (Gaussian-like MF)
Layer 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Optimal Polynomial Neuron No. of inputs (Node no.) Polynomial type 1(1) Type 2 4(7, 9, 10, 11) Type 1 4(6, 12, 25, 28) Type 1 4(6, 11, 12, 27) Type 1 3(10, 14, 19) Type 1 2(1,2) Type 3 3(6, 9, 11) Type 2 4(5, 13, 15, 25) Type 1 2(3, 6) Type 3 2(28, 30) Type 2 1(1) Type 4 2(4, 27) Type 1 2(11, 22) Type 2 2(5, 23) Type 2 4(17, 21, 26, 27) Type 1 2(1, 2) Type 4 4(3, 9, 14, 25) Type 1 4(2, 3, 9, 10) Type 1 4(10, 20, 21, 28) Type 1 2(3, 13) Type 2 3(1, 3, 4) Type 1 2(1, 26) Type 1 4(2, 6, 13, 15) Type 1 4(10, 14, 17, 28) Type 1 4(9, 14, 17, 24) Type 1 3(1, 3, 4) Type 2 4(1, 11, 24, 27) Type 1 4(1, 8, 19, 24) Type 1 2(2, 22) Type 3 4(8,9,23,24) Type 1 3(2, 5, 6) Type 1 2(16, 27) Type 1 3(5, 6, 10) Type 4 2(5, 27) Type 2 2(7, 19) Type 3 2(2, 3) Type 2 2(19, 29) Type 4 3(4, 13, 29) Type 1 1(29) Type 3 4(2, 20, 24, 27) Type 1
PIs
EPIs
0.022 0.018 0.017 0.017 0.016 0.022 0.020 0.020 0.017 0.017 0.020 0.049 0.019 0.018 0.016 0.017 0.021 0.023 0.030 0.018 0.021 0.021 0.017 0.010 0.010 0.016 0.043 0.024 0.012 0.014 0.021 0.021 0.015 0.015 0.015 0.012 0.011 0.020 0.011 0.008
0.335 0.271 0.270 0.268 0.264 0.281 0.267 0.260 0.252 0.249 0.133 0.129 0.124 0.119 0.114 0.142 0.125 0.111 0.108 0.101 0.135 0.125 0.123 0.120 0.115 0.146 0.128 0.117 0.104 0.099 0.136 0.124 0.106 0.105 0.103 0.145 0.125 0.104 0.106 0.093
to the conventional SONN (which immensely contributes to the compactness of the resulting network). Table 9 summarizes the detailed results of the optimal architecture according to each Type of the network. Table 10 contrasts the performance of the genetically developed network with other fuzzy and neuro-fuzzy models studied in the literatures. The experimental results clearly reveal that the proposed approach and the resulting model outperforms the existing networks both in terms of better approximation capabilities (lower values of the performance index on the training data, PIs ) as well as superb generalization abilities (expressed by the performance index on the testing data, EPIs ). In addition, the structurally optimized gSONN leads to the effective reduction of the depth of network as well as the width of the layer and the avoidance of a substantial amount
Genetically Optimized Self-organizing Neural Networks
95
Table 10. Comparative analysis of the performance of the network; considered are models reported in the literature Performance index Model PI PIs EPIs Box and Jenkin’s model [20] 0.710 Tong’s model [21] 0.469 Sugeno and Yasukawa’s model [22] 0.355 Sugeno and Yasukawa’s model [23] 0.190 Xu and Zailu’s model [24] 0.328 Pedrycz’s model [15] 0.320 Chen’s model [25] 0.268 Gomez-Skarmeta’s model [26] 0.157 Oh and Pedrycz’s model [16] 0.123 0.020 0.271 Kim et al.’s model [27] 0.055 Kim et al.’s model [28] 0.034 0.244 Leski and Czogala’s model [29] 0.047 Lin and Cunningham’s model [30] 0.071 0.261 NNFS model [31] 0.128 CASE I (SI = 4, 5th layer) 0.016 0.116 FPNN [32] CASE II (SI = 4, 5th layer) 0.016 0.128 Basic (SI = 4, 5th layer) 0.021 0.110 PNN [33] Modified (SI = 4, 5th layer) 0.015 0.103 Triangular (SI = 4, 5th layer) 0.019 0.134 HFPNN [34] Gaussian (SI = 4, 5th layer) 0.021 0.119 Basic SOPNN (SI = 4, 5th layer) 0.027 0.021 0.085 Generic SOPNN [10] Modified SOPNN (SI = 4, 5th layer) 0.035 0.017 0.095 Basic SOPNN (SI = 4, 5th layer) 0.020 0.119 Advanced SOPNN [11] Modified SOPNN (SI = 4, 5th layer) 0.018 0.118 Basic Case 1(5th layer) 0.016 0.266 Type I SONN Case 2(5th layer) 0.016 0.265 (SI = 2) Modified Case 1(5th layer) 0.013 0.267 SONN∗ [13] SONN Case 2(5th layer) 0.013 0.272 Basic Case 1(5th layer) 0.016 0.116 Type II SONN Case 2(5th layer) 0.016 0.128 (SI = 4) Modified Case 1(5th layer) 0.016 0.133 SONN Case 2(5th layer) 0.018 0.131 Type I 2nd layer (Max = 4) 0.019 0.282 (SI = 2) 5th layer (Max = 5) 0.017 0.259 Type II 1st layer (Max = 5) 0.022 0.135 PN(SI = 3) 3rd layer (Max = 5) 0.022 0.115 based Type III 2nd layer (Max = 5) 0.015 0.119 (SI = 4) 5th layer (Max = 5) 0.018 0.098 Type IV 2nd layer (Max = 5) 0.015 0.112 Proposed (SI = 6) 3rd layer (Max = 5) 0.012 0.091 gSONN Type I Gaussian1st layer (Max = 5) 0.017 0.281 (SI = 2) like 5th layer (Max = 5) 0.014 0.249 Type II Gaussian1st layer (Max = 5) 0.017 0.142 FPN(SI = 3) like 5th layer (Max = 5) 0.014 0.108 based Type III Gaussian1st layer (Max = 5) 0.016 0.146 (SI = 4) like 5th layer (Max = 5) 0.014 0.099 Type IV Gaussian2nd layer (Max = 5) 0.012 0.145 (SI = 6) like 5th layer (Max = 4) 0.008 0.093 PI - performance index over the entire data set, PIs - performance index on the training data, EPIs - performance index on the testing data. *: denotes “conventional optimized FPN-based SONN”.
96
S.-K. Oh and W. Pedrycz
of time-consuming iterations for finding the most preferred network in the conventional SONN. PIs (EPIs ) is defined as the mean square errors(MSE) between the experimental data and the respective outputs of the model (network). 5.2 Chaotic time series In this section, we demonstrate how the GA-based SONN can be utilized to predict future values of a chaotic Mackey-Glass time series. The performance of the network is also contrasted with some other models existing in the literature [35–41]. The time series is generated by the chaotic Mackey-Glass differential delay equation [42] comes in the form x(t) ˙ =
0.2x(t − τ ) − 0.1x(t) 1 + x10 (t − τ )
(20)
The prediction of future values of this series arises as a benchmark problem that has been used and reported by a number of researchers. To obtain the time series value at each integer point, we applied the fourth-order RungeKutta method to find the numerical solution to (20). From the Mackey-Glass time series x(t), we extracted 1000 input-output data pairs of three types of vector formats such as Type I, Type II, and Type III as shown in Table 11. We choose the input variables of nodes in the 1st layer of SONN architecture from these input variables. The first 500 pairs were used as the training data set while the remaining 500 pairs formed the testing data set. To come up with a quantitative evaluation of the network, we use the standard RMSE performance index as given by (18). The parameters used for optimization of this process modeling are the same as used in the previous experiments. Especially in the optimization of each layer of the PN-based SONN, we use 100 generations and 150 populations. The population size being selected from the total population size (150) is equal to 100. The GA-based design procedure is carried out in the same manner as in the previous experiments as well. The consequent input type to be used in this process is the same as that in case of gas furnace process. Table 11. System’s input vector formats for the design of SONN Number of System Inputs (SI) 4(Type I) 5(Type II) 6(Type III) where t = 118 to 1117.
Inputs and output x(t−18), x(t−12), x(t−6), x(t);x(t+6) x(t−24), x(t−18), x(t−12), x(t−6), x(t);x(t+6) x(t−30), x(t−24), x(t−18), x(t−12), x(t−6), x(t);x(t+6)
Genetically Optimized Self-organizing Neural Networks
97
5.2.1 PN-based SONN Table 12 summarizes the results for Type III when using the maximal number, 2, 3, 4, 5, 10 of inputs to be selected. The best results for the network in the 5th layer are reported as PI = 0.0009, EPI = 0.0009 when using Max = 10. Fig. 24 depicts the values of the performance index of the PN-based gSONN with respect to the maximal number of inputs to be selected when the number of system inputs is equal to 6 (Type III). Next, Fig. 25 shows the performance index of the gSONN with respect to the number of entire system inputs when using Max = 10. 5.2.2 FPN-based SONN Table 14 summarizes the performance of the 1st to 5th layer of the network when changing the maximal number of inputs to be selected; here Max was set up to 2 through 5. Fig. 26 depicts the performance index of each layer of FPNbased gSONN according to the increase of maximal number of inputs to be selected. Fig. 27 summarizes the values of the performance index represented vis-` a-vis the increasing number of the layers with regard to the number of selected inputs (or entire system inputs) being used in gSONN when using Max = 5. Fig. 28(a)–(b) illustrate the detailed optimal topologies of gSONN for 1 layer when using triangular MF: the results of the network have been reported as PI = 4.0e-4, EPI = 4.0e-4 for Max = 3, and PI = 7.7e-5, EPI = 1.6e-4 for Max = 5. Table 13 summarizes the detailed results of the optimal architecture according to each Type of the network. And also Fig. 28(c)–(d) illustrate the detailed optimal topologies of gSONN for 1 layer in case of Gaussian-like MF: those are quantified as PI = 3.0e-4, EPI = 3.0e-4 for Max = 2, and PI = 3.6e-5, EPI = 4.5e-5 for Max = 5. As shown in Fig. 28, the proposed network enables the architecture to be a structurally more optimized and simplified network than the conventional SONN. Fig. 29 illustrates the optimization process by visualizing the values of the performance index obtained in successive generations of GA. It also shows the optimized network architecture when using Gaussian-like MF (the maximal number (Max) of inputs to be selected is set to 5 with the structure composed of 5 layers). As shown in Fig. 29, the variation ratio (slope) of the performance of the network is almost the same up to the 2nd through 5th layer, therefore in this case, the stopping criterion can be taken into consideration up to maximally 1 or 2 layers for the purpose to effectively reduce the number of nodes as well as layers of the network (from the viewpoint of the width and depth of gSONN for the compact network). Table 15 summarizes the results of the optimal architecture according to each Type (such as Types I, II, and III) of the network when using Type T∗ as the consequent input type.
EPI 17
10
2 0.0211 0.0203
73
50
73
2 0.0036 0.0034
2 0.0106 0.0105 47 51 61 69
2 0.0161 0.0159 1
64
Node
PI
EPI 31
2 0.0021 0.0019
2 0.0063 0.0062 6 31 46
66
93
86
Node
2 0.0107 0.0103 15 80
2 0.0215 0.0211
T
PI
EPI 70
14
43
73
Node
2 0.0013 0.0012
2 0.0052 0.0052 29 40 75 95
2 0.0080 0.0078 8
2 0.0173 0.0170
T
PI
EPI
2 0.0009 0.0009
2 0.0045 0.0044
2 0.0067 0.0065
2 0.0156 0.0154
T
1 3 4 5 6 2 0.0231 0.0223 2 51 70 74 83 2 0.0126 0.0124 74 76 79 86 95 2 0.0066 0.0063 16 36 43 53 58 2 0.0049 0.0046 10 21 34 46 67 2 0.0041 0.0038
39
PI
2 0.0309 0.0305
T
5
6 2 0.0347 0.0339 31
23
Node
1 3 4 6 2 0.0293 0.0283 4 25 59 63
12
5th layer
3 4
EPI
4th layer
4
PI
2 0.0502 0.0497
3rd layer
3
6
2nd layer
4
Node T
1st layer
2
Max
Table 12. Performance index of the PN-based SONN of each layer versus the increase of maximal number of inputs to be selected for Type III
98 S.-K. Oh and W. Pedrycz
Genetically Optimized Self-organizing Neural Networks Maximal number of inputs to be selected(Max) 2(A) ;
0.06
, 4(C) ;
,5 (D) ;
Maximal number of inputs to be selected (Max) , 3(B) ; , 4(C) ; ,5 (D) ; , 10(E) ;
, 10(E) ;
2(A) ;
0.05
A : (4 6) A : (17 64) B : (3 4 6) B : ( 1 50 73) C : (1 3 4 6) C : (47 51 61 69) D : (1 3 4 5 6) D : (74 76 79 86 95) E : (1 2 3 4 5 6 0 0 0 0) E : ( 2 3 4 31 52 61 72 78 81 85) A : (12 23) B : (31 39 73) A : (70 73) C : ( 4 25 59 63) B : ( 8 14 43) D : ( 2 51 70 74 83) C : (29 40 75 95) E : ( 1 35 38 42 45 56 57 70 91 97) D : (10 21 34 46 67) E : ( 3 8 23 38 50 53 56 63 85 99)
0.04 0.03
0.045 0.04 0.035
Testing error
0.05
Training error
, 3(B) ;
A : (31 86) B : (15 80 93) C : ( 6 31 46 66) D : (16 36 43 53 58) E : ( 3 4 20 27 38 57 80 82 91 96)
0.02
99
0.03 0.025 0.02 0.015 0.01
0.01
0.005 0
1
2
3
4
0
5
1
2
3
Layer
4
5
Layer
(a) Training error
(b) Testing error
Fig. 24. Performance index according to the increase of number of layers in case of Type III No. of system inputs : 4 ;
,5;
No. of system inputs : 4 ;
,6;
0.035
0.03
0.03
0.025
Testing error
Training error
0.025 0.02 0.015 0.01
,6;
0.02 0.015 0.01 0.005
0.005 0
,5;
1
2
3
Layer
(a) Training error
4
5
0
1
2
3
4
5
Layer
(b) Testing error
Fig. 25. Performance index of gPNN with respect to the increase of number of system inputs (Max = 10)
Table 16 gives a comparative summary of the network with other models. The experimental results clearly reveal that it outperforms the existing models both in terms of better approximation capabilities (lower values of the performance index on the training data, PIs ) as well as superb generalization abilities (expressed by the performance index on the testing data EPIs ). PIs (EPIs ) is defined as the root mean square errors (RMSE) computed for the experimental data and the respective outputs of the network.
100
S.-K. Oh and W. Pedrycz
Table 13. Performance index of PN based SONN architectures for Types I, II, and III PN-based SONN
Max = 5 Type I (SI: 4) Max = 10
Max = 5 Type II (SI: 5) Max = 10
Max = 5 Type III (SI: 6) Max = 10
Layer 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Optimal Polynomial Neuron No. of inputs (Node no.) Polynomial type 4(1, 2, 3, 4) Type 2 5(19, 26, 28, 34, 41) Type 2 5(2, 17, 23, 48, 86) Type 2 5(3, 25, 71, 75, 78) Type 2 5(1, 3, 54, 94, 98) Type 2 4(1, 2, 3, 4)) Type 2 10(2,4,8,10,11,12,13,15,21,23) Type 2 10(12,34,35,43,52,60,62,72,89,95) Type 2 10(12,13,28,29,32,41,57,73,91,94) Type 2 9(44,51,53,67,69,75,86,92,93) Type 2 5(1, 2, 3, 4, 5) Type 2 5(39, 59, 71, 74, 84)) Type 2 5(46, 51, 66, 67, 88) Type 2 4(29, 38, 58, 72, 83) Type 2 5(10, 21, 23, 63, 85) Type 2 5(1, 2, 3, 4, 5) Type 2 10(7,38,39,40,47,48,56,59,60,68) Type 2 10(3,11,18,35,42,50,52,53,67,73) Type 2 10(2,8,26,30,33,35,48,79,81,85) Type 2 10(3,7,10,31,54,59,79,89,95,97) Type 2 5(1, 3, 4, 5, 6) Type 2 5(2, 51, 70, 74, 83) Type 2 5(74, 76, 79, 86, 95) Type 2 5(16, 36, 43, 53, 58) Type 2 5(10, 21, 34, 46, 67) Type 2 6(1, 2, 3, 4, 5, 6) Type 2 10(1,35,38,42,45,56,57,70,91,97) Type 2 10(2,3,4,31,52,61,72,78,81,85) Type 2 10(3,4,20,27,38,57,80,82,91,96) Type 2 10(3,8,23,38,50,53,56,63,85,99) Type 2
PIs
EPIs
0.0302 0.0082 0.0063 0.0050 0.0043 0.0302 0.0047 0.0039 0.0032 0.0026 0.0282 0.0088 0.0054 0.0042 0.0037 0.0282 0.0026 0.0019 0.0015 0.0012 0.0231 0.0126 0.0066 0.0049 0.0041 0.0211 0.0036 0.0021 0.0013 0.0009
0.0294 0.0081 0.0061 0.0049 0.0043 0.0294 0.0045 0.0038 0.0031 0.0026 0.0274 0.0087 0.0052 0.0041 0.0035 0.0274 0.0024 0.0018 0.0014 0.0012 0.0223 0.0124 0.0063 0.0046 0.0038 0.0203 0.0034 0.0019 0.0012 0.0009
6 Concluding remarks In this study, we introduced a class of genetically optimized self-organizing neural networks, discussed their topologies, came up with a detailed genetic design procedure, and used these networks to nonlinear system modeling. The comprehensive experimental studies involving well-known datasets quantify a superb performance of the network in comparison to the existing fuzzy and neuro-fuzzy models. The key features of this approach can be enumerated as follows: • The gSONN is sophisticated and optimized architecture capable of constructing models out of a few number of entire system inputs as well as a limited data set. • The proposed design methodology helps reach a compromise between approximation and generalization capabilities of the constructed gSONN model. • The depth (layer size) and width (node size of each layer) of the gSONN can be selected as a result of a tradeoff between accuracy and complexity of the overall model.
14
EPI
19 2 3.2e-5 4.1e-5 7
17 28
21
18
12 18
3 0.0002 0.0002
4 29 2 2.7e-5 3.6e-5 15
14
EPI
23
EPI
0
0
2 2.4e-5 3.3e-5
3 2.7e-5 3.5e-5
4 0.0002 0.0002
5th layer T PI 21
Node
4 2.8e-5 3.6e-5 2 3 18 0 2 2.4e-5 3.4e-5 7 19 0
2 3.1e-5 3.9e-5 6
3 0.0002 0.0002
4th layer T PI
1 4 5 0 0 3 3.6e-5 4.5e-5 12 14 18 0 0 2 3.2e-5 4.1e-5 7 8 13 14 20 4 2.7e-5 3.6e-5 6 14 15 0 0 3 2.4e-5 3.4e-5 18 26 28 0 0 4 2.3e-5 3.3e-5
18
13
Node
5
5 3 3.6e-5 0.199 3
4 0.0002 0.0002
EPI
3 5.6e-5 9.9e-5
1 4 5 0 3 3.6e-5 4.5e-5 7 16 19 24 2 2.8e-5 4.1e-5 11 17 23
20
3rd layer T PI
0
25 3 0.0002 0.0002
3 0.0015 0.0013
1 4
18
20
16
4
3 0.0003 0.0003
Node
22 4 0.0002 0.0003 14
3 0.0015 0.0014
3
6
EPI
14
22
4 6.0e-5 1.2e-4 20 25 29 0 4 5.8e-5 1.0e-4 10 13 15
4 0.0003 0.0003 2
(b) Gaussian-like MF
0
24
10
5th layer T PI
4
EPI
2nd layer Node T PI
10
3 0.0016 0.0014
Node
2
Max
1st layer Node T PI
13 1 0.0003 0.0003 4
7
EPI
3 4 5 6 0 3 7.7e-5 1.6e-4 6 16 23 0 0 4 6.7e-5 1.2e-4 12 18 0 0 0 4 6.1e-5 1.0e-4 3 15 28 0 0 3 5.6e-5 9.7e-5 13 15 19 21 0 1 5.7e-5 9.4e-5
3
4
4th layer T PI
5
5 3 0.0004 0.0004 2
4 0.0016 0.0015
Node
3 4 5 6 3 7.7e-5 1.6e-4 4 14 23 0 2 6.9e-5 1.3e-4 6 10 14
25
EPI
3 4
14
3rd layer T PI
4
3 0.0019 0.0017
Node
3
5
EPI
(a) Triangular MF
4
EPI
2nd layer Node T PI
2
Max
1st layer Node T PI
Table 14. Performance index of the network of each layer versus the increase of maximal number of inputs to be selected for Type III (in case of Type T∗ )
Genetically Optimized Self-organizing Neural Networks 101
102
S.-K. Oh and W. Pedrycz
2
x 10 −3
Maximal number of inputs to be selected (Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
1.8
1.8
1.4
1.4
A : (14 25) B : ( 2 3 13) C : ( 4 14 23 0) D : ( 6 16 23 0 0)
1.2 1 0.8
A : (10 22) B : ( 2 14 22) C : (20 25 29 0) D : ( 3 15 28 0 0)
A : (14 16) B : (14 20 25) C : (10 13 15 0) D : (13 15 19 21 0)
Testing error
Training error
Maximal number of inputs to be selected (Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
1.6 A : (4 5) B : (3 4 5) C : (3 4 5 6) D : (3 4 5 6 0)
1.6
A : ( 4 7) B : ( 4 10 24) C : ( 6 10 14 0) D : (12 18 0 0)
0.6
1.2 1 0.8 0.6
0.4
0.4
0.2
0.2
0
x 10 −3
1
2
3
Layer
4
0
5
1
2
3
Layer
4
5
(a-1) Training error error (a-2) Testing error (a) Triangular MF
4
x 10 −4
Maximal number of inputs to be selected (Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
4
A : (4 6) B : (1 4 5) C : (1 4 5 0) D : (1 4 5 0 0)
2.5
3
Testing error
3
Training error
Maximal number of inputs to be selected (Max) 2(A) ; , 3(B) ; , 4(C) ; ,5 (D) ;
3.5
3.5
A : (18 20) B : ( 3 18 19) A : ( 4 21) C : ( 7 16 16 24) B : (15 23 0) D : (12 14 18 0 0) C : ( 7 19 0 0) A : (13 18) D : (18 26 28 0 0) B : ( 7 17 21) A : (12 14) C : (11 17 23 28) B : ( 6 18 29) D : ( 7 8 13 14 20) C : ( 2 2 18 0) D : ( 6 14 15 0 0)
2 1.5 1
2.5
2 1.5
1 0.5
0.5 0
x 10 −4
1
2
3
Layer
4
5
0
1
2
3
4
5
Layer
(b-1) Training error error (b-2) Testing error (b) Gaussian-like MF Fig. 26. Performance index according to the increase of number of layers
• The structure of the network is not predetermined (as in most of the existing neural networks) but becomes dynamically adjusted and optimized during the development process. • With a properly selected type of membership functions and the organization of the layers, FPN based gSONN performs better than other fuzzy and neurofuzzy models. • The gSONN comes with a diversity of local neuron characteristics such as PNs or FPNs that are useful in copying with various nonlinear characteristics of the nonlinear systems. GA-based design procedure at each stage (layer) of gSONN leads to the optimal selection of these preferred nodes (PNs or FPNs) with local characteristics (such as the number of input variables, the order of the polynomial, and a collection of specific subset of input variables) available within a single node, and then based on these selections, we build the flexible and optimized architecture of gSONN.
Genetically Optimized Self-organizing Neural Networks −3 1.8 x 10
No. of system inputs : 4 ;
,5;
,6;
−3 1.8 x 10
No. of system inputs : 4 ;
1.6
: Entire system input variables
: Entire system input variables
1.4
Testing error
Training error
1.4
,6;
: Selected input variables
: Selected input variables
1.6
,5;
103
1.2 1 0.8 0.6
1.2 1
0.8 0.6
0.4
0.4
0.2
0.2
0
1
2
3
4
0
5
1
2
Layer
3
4
5
Layer
(a-1) Training error error (a-2) Testing error (a) Triangular MF −3 1.4 x 10
No. of system inputs : 4 ;
,5;
,6;
−3 1.4 x 10
No. of system inputs : 4 ;
: Selected input variables
1.2
: Entire system input variables
1
Testing error
Training error
1.2
0.8 0.6 0.4 0.2 0
,5;
,6;
: Selected input variables : Entire system input variables
1 0.8 0.6 0.4 0.2
1
2
3
4
5
0
1
Layer
2
3
4
5
Layer
(b-1) Training error error (b-2) Testing error (b) Gaussian-like MF Fig. 27. Performance index of gFPNN with respect to the increase of number of system inputs (Max = 5) x(t-30)
x(t-30)
x(t-24)
x(t-24)
x(t-18)
x(t-18) x(t-12)
FPN13 3
y^
3
x(t-12)
x(t-6)
x(t-6)
x(t)
x(t)
(a) Triangular MF (1 layer and Type III, Max = 3)
FPN16 4
3
(b) Triangular MF (1 layer and Type III, Max = 5)
x(t-30) x(t-24) x(t-18) x(t-12)
FPN18 2
3
y^
x(t-6) x(t)
(c) Gaussian-like MF (1 layer and Type III, Max = 2)
(d) Gaussian-like MF (1 layer and Type III, Max = 5)
Fig. 28. FPN-based gSONN architecture
y^
104
6.5
S.-K. Oh and W. Pedrycz x 10 −5
8 7.5
6
7 6.5
5
Testing error
Training error
5.5
4.5 1st layer
2nd layer
3rd layer
4th layer
5th layer
4 3.5
6 5.5
1st layer
2nd layer
3rd layer
4th layer
5th layer
5 4.5
3
4
2.5
3.5
2
x 10 −5
0
100
200 300 Generation
400
(a) Training error
500
3
0
100
200 300 Generation
400
500
(b) Testing error
Fig. 29. The optimization process of each performance index by the genetic algorithms (Type III, Max = 5, Gaussian) Table 15. Performance index of FPN based SONN architectures for Types I, II, and III PN-based SONN Max = 5 (Triangular MF) Type I (SI: 4) Max = 5 (Gaussian-like MF)
Max = 5 (Triangular MF) Type II (SI: 5) Max = 4 (Gaussian-like MF)
Max = 5 (Triangular MF) Type III (SI: 6) Max = 5 (Gaussian-like MF)
Layer 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Optimal Polynomial Neuron No. of inputs (Node no.) Polynomial type 4(1, 2, 3, 4) Type 3 5(8, 24, 25, 27, 28) Type 2 4(2, 5, 9, 15) Type 2 4(2, 6, 17, 26) Type 3 4(1, 3, 17, 24) Type 1 4(1, 2, 3, 4) Type 3 5(7, 12, 18, 22, 29) Type 4 4(10, 18, 20, 22) Type 2 4(3, 7, 21, 26) Type 2 2(2, 15) Type 4 5(1, 2, 3, 4, 5) Type 4 4(1, 8, 9, 14) Type 4 4(2, 17, 23, 24) Type 3 5(8, 11, 18, 21, 26) Type 1 4(7, 9, 14, 29) Type 1 3(2, 3, 5) Type 3 4(9, 11, 21, 30) Type 2 4(2, 8, 19, 21) Type 4 2(22, 24) Type 4 3(8, 18, 27) Type 3 4(3, 4, 5, 6) Type 3 3(6, 16, 23) Type 4 2(12, 18) Type 4 3(3, 15, 28) Type 3 4(13, 15, 19, 21) Type 1 3(1, 4, 5) Type 3 3(12, 14, 18) Type 2 5(7, 8, 13, 14, 20) Type 4 3(6, 14, 15) Type 3 3(18, 26, 28) Type 4
PIs
EPIs
0.0017 0.0014 0.0011 0.0009 0.0009 0.0005 0.0004 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002 0.0002 0.0002 0.0004 0.0003 0.0002 0.0002 0.0002 7.7e-5 6.7e-5 6.1e-5 5.6e-5 5.7e-5 3.6e-5 3.2e-5 2.7e-5 2.4e-5 2.3e-5
0.0016 0.0013 0.0011 0.0010 0.0010 0.0006 0.0006 0.0005 0.0005 0.0005 0.0004 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002 0.0002 0.0002 1.6e-4 1.2e-4 1.0e-4 9.7e-5 9.4e-5 4.5e-5 4.1e-5 3.6e-5 3.4e-5 3.3e-5
Genetically Optimized Self-organizing Neural Networks
105
Table 16. Comparative analysis of the performance of the network; considered are models reported in the literature Model Wang’s model [35]
PI 0.044 0.013 0.010
Performance index PIs EPIs NDEI∗
Cascaded-correlation NN [36] 0.06 Backpropagation MLP [36] 0.02 6th-order polynomial [36] 0.04 ANFIS [37] 0.0016 0.0015 0.007 FNN model [38] 0.014 0.009 Recurrent neural network [39] 0.0138 Basic Case 1 0.0011 0.0011 0.005 Type I (5th layer) Case 2 0.0027 0.0028 0.011 Modified Case 1 0.0012 0.0011 0.005 SONN∗∗ [40] (5th layer) Case 2 0.0038 0.0038 0.016 Basic Case 1 0.0003 0.0005 0.0016 Type II (5th layer) Case 2 0.0002 0.0004 0.0011 Modified Case 1 0.000001 0.00009 0.000006 Type III (5th layer) Case 2 0.00004 0.00007 0.00015 PNType I (5th layer) Max = 10 0.0026 0.0026 based Type II (5th layer) Max = 10 0.0012 0.0012 Type III (5th layer) Max = 10 0.0009 0.0009 Proposed Type I Triangular Max = 5 0.0017 0.0016 gSONN FPN- (1st layer) Gaussian Max = 5 0.0005 0.0006 based Type II Triangular Max = 5 0.0003 0.0004 st FPN- (1 layer) Gaussian Max = 5 0.0004 0.0003 based Type III Triangular Max = 5 7.7e-5 1.6e-4 (1st layer) Gaussian Max = 5 3.6e-5 4.5e-5 *Non-dimensional error index (NDEI) as used in [41] is defined as the root mean square errors divided by the standard deviation of the target series. ** is called “conventional optimized FPN-based SONN”.
• The design methodology comes with hybrid structural optimization and parametric learning viewed as two phases of modeling building. The GMDH method is comprised of both a structural phase such as a selforganizing and an evolutionary algorithm (rooted in natural law of survival of the fittest), and a parametric phase of Least Square Estimation (LSE)based learning. Therefore the structural and parametric optimization help utilize hybrid method (combining GAs with a structural phase of GMDH) and LSE-based technique in the most efficient way.
7 Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD)(KRF-2006-311-D00194 and KRF-2006D00019).
106
S.-K. Oh and W. Pedrycz
References 1. Cherkassky V, Gehring D, Mulier F (1996) Comparison of adaptive methods for function estimation from samples. IEEE Trans Neural Netw 7:969–984 2. Dickerson JA, Kosko B (1996) Fuzzy function approximation with ellipsoidal rules. IEEE Trans Syst Man Cybern Part B 26:542–560 3. Sommer V, Tobias P, Kohl D, Sundgren H, Lundstrom L (1995) Neural networks and abductive networks for chemical sensor signals: a case comparison. Sens Actuators B 28:217–222 4. Kleinsteuber S, Sepehri N (1996) A polynomial network modeling approach to a class of large-scale hydraulic systems. Comput Elect Eng 22:151–168 5. Cordon O, et al. (2004) Ten years of genetic fuzzy systems: current framework and new trends. Fuzzy Set Syst 141(1):5–31 6. Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst Man Cybern SMC-1:364–378 7. Ivakhnenko AG, Madala HR (1994) Inductive learning algorithms for complex systems modeling. CRC, Boca Raton, FL 8. Ivakhnenko AG, Ivakhnenko GA (1995) The review of problems solvable by algorithms of the group method of data handling (GMDH). Pattern Recogn Image Anal 5(4):527–535 9. Ivakhnenko AG, Ivakhnenko GA, Muller JA (1994) Self-organization of neural networks with active neurons. Pattern Recogn Image Anal 4(2):185–196 10. Oh SK, Pedrycz W (2002) The design of self-organizing polynomial neural networks. Inf Sci 141:237–258 11. Oh SK, Pedrycz W, Park BJ (2003) Polynomial neural networks architecture: analysis and design. Comput Electr Eng 29(6):703–725 12. Park HS, Park BJ, Oh SK (2002) Optimal design of self-organizing polynomial neural networks by means of genetic algorithms. J Res Inst Eng Technol Dev 22:111–121 (in Korean) 13. Oh SK, Pedrycz W (2003) Fuzzy polynomial neuron-based self-organizing neural networks. Int J Gen Syst 32(3):237–250 14. Pedrycz W, Reformat M (1996) Evolutionary optimization of fuzzy models in fuzzy logic: a framework for the new millennium. In: Dimitrov V, Korotkich V (eds.) Studies in fuzziness and soft computing, pp 51–67 15. Pedrycz W (1984) An identification algorithm in fuzzy relational system. Fuzzy Set Syst 13:153–167 16. Oh SK, Pedrycz W (2000) Identification of fuzzy systems by means of an auto-tuning algorithm and its application to nonlinear systems. Fuzzy Set Syst 115(2):205–230 17. Michalewicz Z (1996) Genetic algorithms + Data structures = Evolution programs. Springer, Berlin Heidelberg Newyork 18. Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbour 19. Jong DKA (1996) Are genetic algorithms function Optimizers?. In Manner R, Manderick B (eds.) Parallel problem solving from nature 2, North-Holland, Amsterdam 20. Box DE, Jenkins GM (1976) Time series analysis forcasting and control. Holden Day, California 21. Tong RM (1980) The evaluation of fuzzy models derived from experimental data. Fuzzy Set Syst 13:1–12
Genetically Optimized Self-organizing Neural Networks
107
22. Sugeno M, Yasukawa T (1991) Linguistic modeling based on numerical data. IFSA 91, Brussels, Comput Manage & Syst Sci:264–267 23. Sugeno M, Yasukawa T (1993) A fuzzy-logic-based approach to qualitative modeling. IEEE Trans Fuzzy Syst 1(1):7–31 24. Xu CW, Zailu Y (1987) Fuzzy model identification self-learning for dynamic system. IEEE Trans Syst Man Cybern SMC 17(4):683–689 25. Chen JQ, Xi YG, Zhang ZJ (1998) A clustering algorithm for fuzzy model identification. Fuzzy Set Syst 98:319–329 26. Gomez-Skarmeta AF, Delgado M, Vila MA (1999) About the use of fuzzy clustering techniques for fuzzy model identification. Fuzzy Set Syst 106:179–188 27. Kim ET, et al. (1997) A new approach to fuzzy modeling. IEEE Trans Fuzzy Syst 5(3):328–337 28. Kim ET, et al. (1998) A simple identified Sugeno-type fuzzy model via double clustering. Inf Sci 110:25–39 29. Leski J, Czogala E (1999) A new artificial neural networks based fuzzy inference system with moving consequents in if-then rules and selected applications. Fuzzy Set Syst 108:289–297 30. Lin Y, Cunningham III GA (1995) A new approach to fuzzy-neural modeling. IEEE Trans Fuzzy Syst 3(2):190–197 31. Wang Y, Rong G (1999) A self-organizing neural-network-based fuzzy system. Fuzzy Set Syst 103:1–11 32. Park HS, Oh SK, Yoon YW (2001) A new modeling approach to fuzzy-neural networks architecture. J Control Autom Syst Eng 7(8):664–674 (in Korean) 33. Oh SK, Kim DW, Park BJ (2000) A study on the optimal design of polynomial neural networks structure. Trans Korean Inst Electr Eng 49D(3):145–156 (in Korean) 34. Oh SK, Pedrycz W, Kim DW (2002) Hybrid fuzzy polynomial neural networks. Int J Uncertain, Fuzziness Knowl-Based Syst 10(3):257–280 35. Wang LX, Mendel JM (1992) Generating fuzzy rules from numerical data with applications. IEEE Trans Syst Man Cybern 22(6):1414–1427 36. Crowder III RS (1990) Predicting the mackey-glass time series with cascadecorrelation learning. In: Touretzky D, Hinton G, Sejnowski T (eds.) Proceedings of the 1990 connectionist models summer school, Carnegic Mellon University 37. Jang JSR (1993) ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans Syst Man Cybern 23(3):665–685 38. Maguire LP, Roche B, McGinnity TM, McDaid LJ (1998) Predicting a chaotic time series using a fuzzy neural network. Inf Sci 112:125–136 39. James LC, Huang TY (1999) Automatic structure and parameter training methods for modeling of mechanical systems by recurrent neural networks. Appl Math Model 23:933–944 40. Oh SK, Pedrycz W, Ahn TC (2002) Self-organizing neural networks with fuzzy polynomial neurons. Appl Soft Comput 2(IF):1–10 41. Lapedes AS, Farber R (1987) Non-linear signal processing using neural networks: prediction and system modeling. Technical Report LA-UR-87-2662, Los Alamos National Laboratory, Los Alamos, New Mexico 42. Mackey MC, Glass L (1977) Oscillation and chaos in physiological control systems. Science 197:287–289 43. Cho MW, Kim GH, Seo TI, H YC, Cheng HH (2006) Integrated machining error compensation method using OMM data and modified PNN algorithm, International Journal of Machine Tool and Manufacture 46(2006):1417–1427
108
S.-K. Oh and W. Pedrycz
44. Menezes LM, Nikolaev NY (2006) Forecasting with genetically programmed polynomial neural networks. International Journal of Forecasting 22(2):249–265 45. Pei JS, Wright JP, Smyth AW (2005) Mapping polynomial fitting into feedforward neural networks for modeling nonlinear dynamic systems and beyond. Compt Methods Appl Mech Eng 194(42–44): 4481–4505 46. Nariman-Zadeh N, Darvizeh A, Jamali A, Moeini A (2005) Evolutionary design of generalized polynomial neural networks for modelling and prediction of explosive forming process. J Mater Process Technol 164–165:1561–1571 47. Delivopoulos E, Theocharis JB (2004) A modified PNN algorithm with optimal PD modeling using the orthogonal least squares method. Inf Sci 168(1–4):133–170 48. Huang LL, Shimizu A, Hagihara Y, Kobatake H (2003) Face detection from cluttered images using a polynomial neural network. Neurocomputing 51:197–211
Evolution of Inductive Self-organizing Networks Dongwon Kim and Gwi-Tae Park
Summary. We discuss a new design methodology of the self-organizing approximate technique using evolutionary algorithm (EA). In this technique, the selforganizing network dwells on the idea of group method of data handling. The performances of the network depend strongly on the number of input variables available to the model, the number of input variables, and type (order) of the polynomials to each node. They must be fixed by the designer in advance before the architecture is constructed. Therefore, the trial and error method has a heavy computation burden and low efficiency. Moreover, it does not guarantee that the obtained model is the best one. In this chapter, we propose evolved inductive self-organizing networks to alleviate these problems. The order of the polynomial, the number of input variables, and the optimum input variables are encoded as a chromosome, and the fitness of each chromosome is computed. The appropriate information of each node is evolved accordingly and tuned gradually throughout the EA iterations. The evolved network is a sophisticated and versatile architecture, which can construct models from a limited data set as well as from poorly defined complex problems. Comprehensive comparisons showed that the performance of the proposed model, which has a much simpler structure than the conventional model as well as previous identification methods, was significantly improved with respect to approximation and prediction abilities.
1 Introduction System modelling and identification is important in system analysis, control, and automation as well as scientific research, so much effort has been directed to developing advanced techniques of system modelling. Neural networks (NNs) and fuzzy systems have been widely used for modelling nonlinear systems. The approximation capability of neural networks, such as multilayer perceptrons, radial basis function (RBF) networks, or dynamic recurrent neural networks, has been investigated [1–3]. On the other hand, fuzzy systems can approximate nonlinear functions with arbitrary accuracy [4–5]. But the resultant neural network representation is very complex and difficult to understand, and fuzzy systems require too many fuzzy rules for accurate function D. Kim and G.-T. Park: Evolution of Inductive Self-organizing Networks, Studies in Computational Intelligence (SCI) 82, 109–128 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
110
D. Kim and G.-T. Park
approximation, particularly in the case of multidimensional inputs. Alternatively, there is the GMDH-type algorithm. Group method of data handling (GMDH) was introduced by Ivakhnenko in the early 1970’s [6–10]. GMDHtype algorithms have been extensively used since the mid-1970’s for prediction and modelling complex nonlinear processes. GMDH is mainly characterized as being self-organizing algorithm, which can provide automated selection of essential input variables without the use of prior information about the relationship among input-output variables [11]. Self-organizing Polynomial Neural Networks (SOPNN) [12–13], a GMDHtype algorithm, is a useful approximate technique. SOPNN has an architecture similar to that of the feedforward neural networks whose neurons are replaced by polynomial nodes. The output of the each node in SOPNN structure is obtained by using several types of high-order polynomials such as linear, quadratic, and modified quadratic of input variables. These polynomials are called partial descriptions (PDs). SOPNNs have fewer nodes than NNs, but its nodes are more flexible. Although the SOPNN is structured by a systematic design procedure, it has some drawbacks that must be solved. If there are sufficiently large number of input variables and data points, SOPNN algorithm has a tendency to produce overly complex networks. On the other hand, for a small number of available input variables, SOPNN does not maintain good performance. Moreover, the performances of SOPNN depend strongly on the number of input variables available to the model as well as the number of input variables and polynomial types or order in each PD. They must be chosen in advance before the architecture of SOPNN is constructed. In most cases, they are determined by a trial and error method, which must bear a heavy computational burden at low efficiency. Moreover, the SOPNN algorithm is a heuristic method so it does not guarantee that the obtained SOPNN is the best one for nonlinear system modelling. Therefore, these above-mentioned drawbacks must be overcome. In this chapter, we present a new design methodology of SOPNN using evolutionary algorithm (EA) to alleviate the above-mentioned drawbacks of the SOPNN. This new network is called the EA-based SOPNN. Evolutionary Algorithm (EA) has been widely used as a parallel global search method for optimization problems [14–16]. The EA is used to determine how many input variables are chosen to each node, which input variables are optimally chosen among many input variables, and what is the appropriate type of polynomials in each PD. This chapter is organized as follows. The conventional SOPNN and its drawbacks are briefly explained to illustrate the proposed modelling algorithm in Section 2. The new algorithm, the design methodology of EA-based SOPNN, is described and the coding of the key factors of the SOPNN, the representation of chromosome, and fitness function are also discussed in Section 2. The proposed EA-based SOPNN is applied to nonlinear systems modelling to assess its performances, and its simulation results are compared with those of other methods including the conventional SOPNN in Section 3. Conclusions
Evolution of Inductive Self-organizing Networks
111
are given in Section 4. Finally Appendix contains a summary of the design procedure of the conventional SOPNN algorithm.
2 Design of EA-based SOPNN The conventional SOPNN algorithm is based on the GMDH method and utilizes various types of polynomials such as linear, quadratic, and modified quadratic. By choosing the most significant input variables and polynomial types, the PDs in each layer can be obtained. The framework of the design procedure of the conventional SOPNN is summarized in Appendix, and further discussion on the conventional SOPNN can be obtained in [13]. As provided in the Appendix, when the final layer is constructed, the node with the best predictive capability is selected as the output node. All remaining nodes except the output node in the final layer are discarded. Furthermore, all the nodes in the previous layers that do not influence the output node are also removed by tracing the data flow path of each layer. The SOPNN is a flexible neural architecture whose structure is developed through a modeling process. Each PD can have a different number of input variables and can exploit a different order of the polynomial. As a result, SOPNN provides a systematic design procedure. But the number of input variables and the polynomial order type must be fixed by the designer in advance before the architecture is constructed. As a result, the trial and error method must bear a heavy computation burden at low efficiency. That is, it does not guarantee the best model, only that a good model for a certain system. Therefore, its performances depend strongly on a few factors stated in the section 1. In this section, we propose a new design procedure using EA for the systemic design of SOPNN with the optimum performance. In the SOPNN algorithm, the key problems are the determination of the optimal number of input variables, selection of input variables, and selection of the order of the polynomial forming a PD in each node. In [13], these factors are determined in advance by the trial and error method. But in this chapter, these problems are solved by using EA automatically. The EA is implemented using crossover and mutation probability rates for better exploitation of the optimal inputs and order of polynomial in each node of SOPNN. All of the initial EA populations are randomized to use minimum heuristic knowledge. The appropriate inputs and order are evolved accordingly and are tuned gradually throughout the EA iterations. In the evolutionary design procedure, the key issues are the encoding process of the order of the polynomial, the selection of the number of input variables, and the selection of the optimum input variables as a chromosome and the defining of a criterion to compute the fitness of each chromosome. A detailed representation of the coding strategy and choice of fitness function are given in the following sections.
112
D. Kim and G.-T. Park
2.1 Representation of chromosome for appropriate information of each PD When the SOPNN is designed by using EA, the most important consideration is the representation strategy, that is, the process by which the key factors of the SOPNN are encoded into the chromosome. A binary coding is employed for the available design specifications. The order and the inputs of each node in the SOPNN are coded as a finite-length string. Our chromosomes are made of three sub-chromosomes. The first one consists of 2 bits for the order of polynomial (PD), the second one consists of 3 bits for the number of inputs of PD, and the last one consists of N bits, which are equal to the total number of input candidates in the current layer. These input candidates are the node outputs of the previous layer. The representation of binary chromosomes is illustrated in Fig. 1. The 1st sub-chromosome is made of 2 bits, which represent the several types of the order of PD. The relationship between bits in the 1st sub-chromosome and the order of PD is shown in Table 1. Thus, each node can exploit a different order of the polynomial. The 3rd sub-chromosome has N bits, which are concatenated bits of 0’s and 1’s coding. The input candidate is represented by a ‘1’ bit if it is chosen as the input variable to the PD and by a ‘0’ bit it is not chosen. This type of representation solves the problem of which input variables are to be chosen. If many input candidates are chosen for a model design, the modelling becomes computationally complex, and normally requires a lot of time to achieve good results. Also, complex modelling can produce inaccurate results and poor generalizations. Good approximation performance does not necessarily guarantee good The 3rd sub-chromosome: N bits equal to input candidates in the current layer
The 1st sub-chromosome: 2 bits for the order of PD
The 2nd sub-chromosome: 3 bits for the number of inputs of PD
0
1
1
0 0
0
0
1
• • •
1
1
Fig. 1. Structure of binary chromosome for a PD Table 1. Relationship between bits in the 1st sub-chromosome and order of PD Bits in the 1st sub-chromosome
Order of polynomial (PD)
00
Type 1 - Linear
01 10
Type 2 - Quadratic Type 2 - Quadratic
11
Type 3 - Modified quadratic
Evolution of Inductive Self-organizing Networks
113
Table 2. Relationship between bits in the 2nd sub-chromosome and number of inputs to PD Bits in the 2nd sub-chromosome
Number of inputs to PD
000
1
001 010
2 2
011 100
3 3
101 110
4 4
111
5
generalization capability [18]. To overcome this drawback, we introduced the 2nd sub-chromosome into the chromosome. The 2nd sub-chromosome consists of 3 bits and represents the number of input variables to be selected. The number based on the 2nd sub-chromosome is shown in the Table 2. The number of input variables selected for each node among all input candidates is as many as the number represented in the 2nd sub-chromosome. Designer must determine the maximum number, considering the characteristic of system, design specification, and prior knowledge of the model. With this method, we can solve problems such as the conflict between overfitting and generalization and excessive computation time. The relationship between chromosome and PD information is shown in Fig. 2. The PD corresponding to the chromosome in Fig. 2 is described briefly in Fig. 3. Figure 2 shows an example of PD. The various pieces of required information are obtained from its chromosome. The 1st sub-chromosome shows that the polynomial order is Type 2 (quadratic form). The 2nd sub-chromosome identifies two input variables to this node. The 3rd sub-chromosome indicates that x1 and x6 are selected as the input variables. The node with PD corresponding to Fig. 2 is shown in Fig. 3. Thus, the output of this PD, can be expressed as (1). yˆ = f (x1 , x6 ) = c0 + c1 x1 + c2 x6 + c3 x21 + c4 x26 + c5 x1 x6
(1)
where coefficients c0 , c1 , . . . , c5 are evaluated by using the training data set and the standard least square estimation (LSE). Therefore, the polynomial function of PD is formed automatically according to the information provided by the sub-chromosomes. The design procedure of EA-based SOPNN is shown in Fig. 4. At the beginning of the process, the initial populations are comprised of a set of chromosomes that are scattered all over the search space. The populations
114
D. Kim and G.-T. Park Input cadidates
Chromosome
Information on PD
Forming a PD
1 0
1st subchromosome
Order of polynomial
2nd subchromosome
No. of inputs
0 1 0 x1
1
x2
f
selected
^ y
ignored
0
x3
0
x4
0
x5
0
ignored
x6
1
selected
ignored
3rd subchromosome
ignored
Fig. 2. Example of PD whose various pieces of required information is obtained from its chromosome :quadratic (Type 2) x1 x6
2 2
^ y
PD : 2 inputs
Fig. 3. Node with PD corresponding to chromosome in Fig. 2
are all randomly initialized; thus, the use of heuristic knowledge is minimized. The fitness assignment in EA serves as a guide in the search toward the optimal solution. The fitness function for specific cases of modeling will be explained later. After each chromosome is evaluated and associated with a fitness, the current population undergoes the reproduction process to create the next generation of population. The roulette-wheel selection scheme is used to determine the members of the new generation of population. After the new group of population is built, a mating pool is formed for the crossover. The crossover proceeds in three steps. First, two newly reproduced strings are selected from the mating pool, which is produced by reproduction. Second, a position (one point) along the two strings is selected uniformly at random. In the third step, all characters are exchanged by following the crossing site. We use one-point crossover operator with a crossover probability of pc (0.85). This crossover is then followed by the mutation operation. The mutation is the occasional alteration of a value at a particular bit position (we flip the states of a bit from 0 to 1 or vice versa). The mutation serves as an insurance policy for recovering the loss of a particular piece of information (any simple bit). The mutation rate used is fixed at 0.05 (pm ). Generally, after these three operations, the overall fitness of the population improves. Each of the
Evolution of Inductive Self-organizing Networks
115
Start Generation of initial population : the parameters are encoded into a chromosome
The fitness values of the new chromosomes are improved trough generations with genetic operators Reproduction: roulette wheel
Evaluation: each chromosome is evaluated and has its fitness value
A: 0 0 0 0 0 0 0 1 1 1 1
B: 1 1 0 0 0 1 1 0 0 1 1
One-point crossover
NO Termination condition
YES Results: chromosomes which have good fitness value are selected for the new input variables of the next layer
---: crossover site
A: 0 0 0 0 0 0 0 1 1 1 1 B: 1 1 0 0 0 1 1 0 0 1 1 before crossover
A`: 0 0 0 0 0 0 0 0 0 1 1 B`: 1 1 0 0 0 1 1 1 1 1 1 after crossover
Invert mutation ---: mutation site A`: 0 0 0 0 0 0 0 0 0 1 1 A`: 0 0 0 1 0 0 0 0 0 1 1 before mutation
after mutation
End: one chromosome (PD) characterized by the best performance is selected as the output when the 3rd layer is reached
Fig. 4. Block diagram of the design procedure of EA-based SOPNN
population generated then goes through a series of evaluation, reproduction, crossover, and mutation, and the procedure is repeated until a termination condition is reached. After the evolution process, the final generation of population consists of highly fit bits that provide optimal solutions. After the termination condition is satisfied, one chromosome (PD) with the best performance is selected as the output PD in the final generation of the population. All other remaining chromosomes are discarded, and all the nodes that do not influence this output PD in the previous layers are also removed. Finally, the EA-based SOPNN model is obtained. 2.2 Fitness function for modelling In EA, the fitness function must be determined. The genotype representation encodes the problem into a string, and the fitness function measures the performance of the model. It is quite important for evolving systems to find a good fitness measurement. To construct models with superior approximation and generalization ability, we introduce the error function such as E = Θ × P I + (1 − Θ) × EP I
(2)
where Θ ∈ [0, 1] is a weighting factor for PI and EPI, which the performance index values of the training data and testing data, respectively,
116
D. Kim and G.-T. Park
as expressed in (4). Then the fitness value [12] is determined as follows: 1 (3) F = 1+E Maximizing F is identical to minimizing E. The choice of Θ establishes a certain tradeoff between the approximation and generalization ability of the EA-based SOPNN.
3 Simulation Results In this section, we show the performance of our new EA-based SOPNN for two well known nonlinear system modeling. One is a time series of a gas furnace (Box-Jenkins data) [19], which was studied previously in [19–21], and the other is a nonlinear system already exploited in fuzzy modeling [22–26]. 3.1 Gas furnace process The delayed terms of methane gas flow rate u(t) and carbon dioxide density y(t) such as u(t−3), u(t−2), u(t−1), y(t−3), y(t−2), and y(t−1)are used as input variables to the EA-based SOPNN. The actual system output y(t) is used as the target output variable for this model. We choose the input variables of nodes in the 1st layer from these input variables. The total data set consisting of 296 input-output pairs is split into two parts. The first part (consisting of 148 pairs) is used as the training data set, and the remaining part of the data set serves as a testing data set. Using the training data set, the EA-based SOPNN can estimate the coefficients of the polynomial by using the standard LSE. The performance index is defined as the mean squared error 1 (yi − yˆi )2 m i=1 m
P I(EP I) =
(4)
where yi is the actual system output, yˆi is the estimated output of each node, and m is the number of data. The design parameters of EA-based SOPNN for modeling are shown in Table 3. In the 1st layer, 20 chromosomes are generated and evolved during 40 generations, where each chromosome in the population is defined as a corresponding node. So 20 nodes (PDs) are produced in the 1st layer based on the EA operators. All PDs are estimated and evaluated using the training and testing data sets, respectively. They are also evaluated by the fitness function of (3) and ranked according to their fitness value. We choose nodes as many as a predetermined number w from the highest ranking node, and use their outputs as new input variables to the nodes in the next layer. In other words, the chosen PDs (w nodes) must be preserved for the design of the next layer, and the outputs of the preserved PDs serve as inputs to the
Evolution of Inductive Self-organizing Networks
117
Table 3. Design parameters of EA-based SOPNN for modeling Parameters
1st layer
2nd layer
3rd layer
Maximum generations
40
60
80
Population size:(w)
20:(15)
60:(50)
80
String length
11
20
55
Crossover rate (pc )
0.85
Mutation rate (pm )
0.05
Weighting factor (Θ)
0.1˜ 0.9
Type (order)
1˜ 3
Table 4. Values of performance indices of the proposed EA-based SOPNN Weighting factor
1st layer PI – EPI
2nd layer PI – EPI
3rd layer PI – EPI
0.1
0.0214 – 0.1260
0.0200 – 0.1231
0.0199 – 0.1228
0.25
0.0214 – 0.1260
0.0149 – 0.1228
0.0145 – 0.1191
0.5
0.0214 – 0.1260
0.0139 – 0.1212
0.0129 – 0.1086
0.75
0.0214 – 0.1260
0.0139 – 0.1293
0.0138 – 0.1235
0.9
0.0173 – 0.1411
0.0137 – 0.1315
0.0129 – 0.1278
next layer. The value of w is different for each layer, which is also shown in Table 3. This procedure is repeated for the 2nd layer and the 3rd layer. Table 4 summarizes the values of the performance indices, PI and EPI, of the proposed EA-based SOPNN according to the weighting factor. These values are the lowest in each layer. The overall lowest values of the performance indices are obtained at the third layer when the weighting factor is 0.5. If this model is designed to have a fourth or higher layer, the performance values become much lower, and the computation time increases considerably for the model with such a complex network. Fig. 5 depicts the trend of the performance index values produced in successive generations of the EA for the weighting factor Θ of 0.5. Fig. 6 illustrates the values of the error function and fitness function in successive EA generations when Θ = 0.5. Fig. 7 shows the proposed EA-based SOPNN model with 3 layers and its identification performance for Θ = 0.5. The model output follows the actual output very well. The values of the performance indices of the proposed method are equal to PI = 0.012, EPI = 0.108, respectively.
D. Kim and G.-T. Park 0.145
0.022
Performance index(EPI)
Performance index(PI)
118
PI 0.020 0.018 0.016 0.014 1st layer 0.012
0
20
40
2nd layer
60
3rd layer
0.140 0.130 0.125 0.120 0.115 0.110 0.105
80 100 120 140 160 180 Generations
EPI
0.135
2nd layer
1st layer 0
20
40
60
(a)
3rd layer
80 100 120 140 160 180 Generations
(b)
0.945
0.080
Value of fitness function(F)
Value of error function(E)
Fig. 5. (a) performance index for the training data set (b) performance index for the testing data set: Trend of performance index values with respect to generations through layers
0.075 0.070 0.065 1st layer
0.060 0
20
40
2nd layer 60
3rd layer
0.940 0.935 0.930 0.925
80 100 120 140 160 180 Generations
1st layer 0
20
2nd layer
40
60
(a)
3rd layer
80 100 120 140 160 180 Generations
(b)
Fig. 6. (a) error function (b) fitness function : Values of the error function and fitness function for successive generations 1 4 1 5 1 3 3 5 2 5
u(t-3) u(t-2) u(t-1) y(t-3)
PD PD PD 3 5 PD 3 5 PD 2 4 PD
PD PD
2 4 PD
y(t-2)
3 PD 3
^ y
2 5 PD 2 4 PD 1 4 PD
y(t-1)
(a) Actural output Model output
60
2.25
58
1.50
56
0.75
54
Error
CO2 concentration
62
52 50 48
−1.50
46 44
0.00 −0.75
training data 0
50
100
testing data
150 200 Data number
(b)
250
−2.25 300
0
50
100
150 200 Data number
250
300
(c)
Fig. 7. (a) Proposed EA-based SOPNN model with 3 layers, (b) actual output versus model output, (c) error: Proposed EA-based SOPNN model with 3 layers and its identification performance
Evolution of Inductive Self-organizing Networks
119
Fig. 8. (a) Basic SOPNN & Case 1, (b) Modified SOPNN & Case 2: Conventional SOPNN models with 5 layers
For the comparison of the network size of the proposed EA-based SOPNN with that of conventional SOPNN, conventional SOPNN models are visualized in Fig. 8. The structure of the basic SOPNN & Case 1 in Fig. 8(a) is obtained by use of 4 input variables and Type 3 polynomial for every node in all layers to the fifth layer. Its performance is as follows. PI = 0.012 and EPI = 0.084. On the other hand, the structure of the modified SOPNN and Case 2 in Fig. 8(b) is obtained by use of 2 input variables and Type 1 polynomial for every node in the 1st layer and 3 input variables and Type 2 polynomial for every node from the 2nd layer to the 5th layer. In this model, PI is 0.016, and EPI is 0.101. Figures 7 and 8 show that the structure of the EA-based SOPNN is much simpler than that of the conventional SOPNN in terms of the number of nodes and layers, despite their similar performance. Table 5 provides a comparison of the proposed model with other techniques. The comparison is realized on the basis of the same performance index for the training and testing data sets. Additionally, PI represents the performance index of the model for the training data set and EPI for the testing data. The proposed architecture, EA-based SOPNN model, outperforms other models both in terms of accuracy and generalization capability.
120
D. Kim and G.-T. Park Table 5. Values of performance index of some identification models Model
PI
EPI
Kim’s model [20]
0.034
0.244
Lin’s model [21]
0.071
0.261
Kim’s model [12]
0.013
0.126
0.012 0.016
0.084 0.101
0.012
0.108
SOPNN (5 layers) [13]
Basic & case 1 Modified & case 2
EA-based SOPNN (3 layers)
3.2 Three-input nonlinear function This example demonstrates the application of the proposed EA-based SOPNN model to identify highly nonlinear functions. The performance of this model is compared with those of earlier models. The function to be identified is a three-input nonlinear function given by (5) −1 −1.5 2 y = (1 + x0.5 ) 1 + x2 + x3
(5)
which has been widely used by Takagi and Hayashi [22], Sugeno and Kang [23], and Kondo [24] to test their modelling approaches. Table 6 shows 40 pairs input-output data obtained from (5) [26]. The input x4 is a dummy variable, which is not related to (5). The data in Table 6 is divided into the training data set (Nos. 1–20) and the testing data set (Nos. 21–40). To compare the performance, the same performance index, the average percentage error (APE), adopted in [22–26] is used. 1 |yi − yˆi | × 100 m i=1 yi m
AP E =
(6)
where m is the number of data pairs and yi and yˆ1 are the i-th actual output and model output, respectively. Again, a series of comprehensive experiments was conducted, and the results are summarized. The design parameters of EA-based SOPNN for each layer are shown in Table 7. The simulation results of the EA-based SOPNN are summarized in Table 8. The lowest values of the performance indices, PI = 0.188 EPI = 1.087, are obtained at the third layer when the weighting factor (Θ) is 0.25. Figure 9 illustrates the trend of the performance index values produced in successive generations of the EA for the weighting factor Θ of 0.25. Figure 10 shows the values of the error function and fitness function in successive EA generations when Θ is 0.25.
Evolution of Inductive Self-organizing Networks
121
Table 6. Input-output data of three-input nonlinear function No.
x1
x2
x3
x4
y
No.
x1
x2
x3
x4
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 5 5 5 5 5
3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5
1 2 3 4 5 4 3 2 1 2 3 4 5 4 3 2 1 2 3 4
1 1 5 5 1 1 5 5 1 1 5 5 1 1 5 5 1 1 5 5
11.11 6.521 10.19 6.043 5.242 19.02 14.15 14.36 27.42 15.39 5.724 9.766 5.87 5.406 10.19 15.39 19.68 21.06 14.15 12.68
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 5 5 5 5 5
1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3
5 4 3 2 1 2 3 4 5 4 3 2 1 2 3 4 5 4 3 2
1 1 5 5 1 1 5 5 1 1 5 5 1 1 5 5 1 1 5 5
9.545 6.043 5.724 11.25 11.11 14.36 19.61 13.65 12.43 19.02 6.38 6.521 16 7.219 5.724 19.02 13.39 12.68 19.61 15.39
Table 7. Design parameters of EA-based SOPNN for modeling Parameters
1st layer
2nd layer 3rd layer
Maximum generations
40
60
80
Population size:(w)
20:(15)
60:(50)
80
String length
8
20
55
Crossover rate (pc )
0.85
Mutation rate (pm )
0.05
Weighting factor (Θ)
0.1˜0.9
Type (order)
1˜3
Figure 11 depicts the proposed EA-based SOPNN model with 3 layers for Θ = 0.25. The structure of EA-based SOPNN is very simple and gives good performance. However, the conventional SOPNN has difficulty in structuring the model for this nonlinear function. Therefore, only a few
122
D. Kim and G.-T. Park Table 8. Values of performance indices of the proposed EA-based SOPNN Weighting factor
1st layer PI – EPI
2nd layer PI – EPI
3rd layer PI – EPI
0.1
5.7845 – 6.8199
2.3895 – 3.3400
2.2837 – 3.1418
0.25
5.7845 – 6.8199
0.8535 – 3.1356
0.1881 – 1.0879
0.5
5.7845 – 6.8199
1.6324 – 5.5291
1.2268 – 3.5526
0.75
5.7845 – 6.8199
1.9092 – 4.0896
0.5634 – 2.2097
0.9
5.7845 – 6.8199
2.5083 – 5.1444
0.0002 – 4.8804
7 PI
5
Performance index(EPI)
Performance index(PI)
6
4 3 2 1 1st layer
0 0
20
2nd layer 40
60
3rd layer
80 100 120 Generations
140
EPI
6 5 4 3 2 1st layer
1 160
180
0
20
2nd layer 40
60
(a)
3rd layer
80 100 120 Generations
140
160
180
(b)
Fig. 9. (a) performance index for the training data set (b) performance index for the testing data set: Trend of performance index values with respect to generations through layers
Value of fitness function(F)
Value of error function(E)
7 6 5 4 3 2 1 0
2nd layer
1st layer 0
20
40
60
3rd layer
80 100 120 Generations
(a)
140
160
180
0.6 0.5 0.4 0.3 0.2 1st layer
0.1 0
20
40
2nd layer 60
80 100 120 Generations
3rd layer 140
160
180
(b)
Fig. 10. (a) error function (b) fitness function : Values of the error function and fitness function for successive generations
number of input candidates are considered. Fig. 12 shows the identification performance and error of the proposed EA-based SOPNN when Θ is 0.25. The output of the EA-based SOPNN follows the actual output very well.
Evolution of Inductive Self-organizing Networks
X1
X2
X3
2 3 2 3 2 3 2 3 2 3 1 3 2 2
123
PD 2 2 3 2 3 3 3 2 3 1
PD PD PD PD PD
PD PD 3 5 PD
PD
y^
PD PD
PD
Fig. 11. Structure of the EA-based SOPNN model with 3 layers (Θ = 0.25)
20
30
Actual output Model output
25
10 5 Errors
20 y_tr
15
15
−5 −10
10
−15
5
−20 5
10 15 Data number
20
5
10 Data number
15
20
5
10 Data number
15
20
20
30
Actual output Model output
15
25
10 5 Errors
20 y_te
0
15
0 −5 −10
10
−15
5
−20 5
10 15 Data number
20
Fig. 12. (a) actual output versus model output of training data, (b) errors of (a)(c) actual output versus model output of testing data, (b) errors of (c): Identification performance and errors of EA-based SOPNN model with 3 layers
Table 9 shows the performances of the proposed EA-based SOPNN model and other models studied in the literature. The experimental results clearly showed that the proposed model outperformed the existing models both in terms of approximation capability (PI) as well as generalization ability (EPI). But the conventional SOPNN cannot be applied properly to the identification of this example.
124
D. Kim and G.-T. Park Table 9. Performance comparison of various identification models Model
PI
GMDH model[24]
EPI
4.7
5.7
Fuzzy model [23]
model 1 model 2
1.5 0.59
2.1 3.4
FNN [26]
type 1 type 2 type 3
0.84 0.73 0.63
1.22 1.28 1.25
GD-FNN [25]
2.11
1.54
SOPNN (5 layers) [13]
Basic & case 1 2.59 8.52 Modified Impossible Impossible
EA-based SOPNN (3 layers)
0.188
1.087
4 Conclusions In this chapter, we proposed a new design methodology of SOPNN using an evolutionary algorithm, which is called the EA-based SOPNN and studied the properties of the EA-based SOPNN. The EA-based SOPNN is a sophisticated and versatile architecture that can construct models from a limited data set and poorly-defined complex problems. Moreover, the architecture of the model is not predetermined, but can be self-organized automatically during the design process. The conflict between overfitting and generalization can be avoided by using a fitness function with a weighting factor. The experimental results showed that the proposed EA-based SOPNN is superior to the conventional SOPNN models as well as other previous models in terms of modeling performance.
5 Acknowledgement The authors would like to thank the financial support of the Korea Science & Engineering Foundation. This work was supported by grant No. R01-2005-00011044-0 from the Basic Research Program of the Korea Science & Engineering Foundation. The authors are also very grateful to the anonymous reviewers for their valuable comments.
Evolution of Inductive Self-organizing Networks
125
APPENDIX SELF-ORGANIZING POLYNOMIAL NEURAL NETWORK [13] This appendix summarizes the design procedure of the conventional SOPNN algorithm. Step 1: Determine system’s input variables We define input variables such as x1i , x2i , . . . , xN i , related to output variables yi , where N and i are the number of all input variables and input-output data set, respectively. The normalization of the input data is also performed if required. Step 2: Form training and testing data The input - output data set is separated into the training (ntr ) data set and the testing (nte ) data set. Then, we have n = ntr + nte . The training data set is used to construct a SOPNN model. And the testing data set is used to evaluate the constructed SOPNN model. Step 3: Choose a structure of the SOPNN The structure of SOPNN is strongly dependent on the number of input variables and the order of PD in each layer. Two kinds of SOPNN structures, namely, the basic SOPNN structure and the modified SOPNN structure are available. Each of them is specified with two cases. (a) Basic SOPNN structure - The number of input variables of PDs is the same in every layer. Case 1. The polynomial order of the PDs is the same in each layer of the network. Case 2. The polynomial order of the PDs in the 2nd or higher layer is different from the one of PDs in the 1st layer. (b) Modified SOPNN structure - The number of input variables of PDs varies from layer to layer. Case 1. The polynomial order of the PDs is same in every layer. Case 2. The polynomial order of the PDs in the 2nd layer or higher is different from the one of PDs in the 1st layer. Step 4: Determine the number of input variables and the order of the polynomial forming a PD We determine arbitrarily the number of input variables and the type of the polynomial in PDs. The polynomials are different according to the number of input variables and the polynomial order. The total number of PDs located at the current layer is determined by the number of the selected input variables (r) from the nodes of the preceding layer, because the outputs of the nodes of the preceding layer become the input variables to the current layer. The total number of PDs in the current layer is equal to the combination, that is N! r!(N −r)! , where N is the number of nodes in the preceding layer. Step 5: Estimate the coefficients of the PD The vector of coefficients of the PDs is determined using standard mean squared errors (MSE) by minimizing the following index
126
D. Kim and G.-T. Park Choice of estimated models / stop conditions
Possible inputs
x1i
j th layer
1
PD
x2i
Z1 1
x3i
PD
x4i
PD
xNi
z
j-1
1
PD
Z2
∧
1 Z N!/{(N-r)!r!}
xN-1i
Optimal model
yi
z j-1 p z j-1q
zi PD
selected inputs: (j-1)th layer
z j-1p z j-1q
z j-1p , z j-1q
order Type 2
zi
PD: j th layer c0+c1z j-1p+c2z j-1q+c3(zj-1p)2+c4(zj-1q)2+c5zj-1pzj-1q
Fig. 13. Overall architecture of the SOPNN
Ek =
ntr 1 (yi − zki )2 , ntr i=1
k = 1, 2, . . . ,
N! r!(N − r)!
(7)
where, zki denotes the output of the k-th node with respect to the i-th data. This step is completed repeatedly for all the nodes in the current layer and, subsequently, all layers of the SOPNN starting from the input to the output layer. Step 6: Select PDs with the good predictive capability The predictive capability of each PD is evaluated by the performance index ! using the testing data set. Then, we choose w number of PDs among r!(NN−r)! PDs in the order of the best predictive capability (the lowest value of the performance index). Here, w is the pre-defined number of PDs that must be preserved to next layer. The outputs of the chosen PDs serve as inputs to the next layer. Step 7: Check the stopping criterion The SOPNN algorithm terminates when the number of layers predetermined by the designer is reached. Step 8: Determine new input variables for the next layer If the stopping criterion is not satisfied, the next layer is constructed by repeating step 4 through step 8. The overall architecture of the SOPNN is shown in Fig. 13.
References 1. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. IEEE Trans Neural Netw 2:359–366
Evolution of Inductive Self-organizing Networks
127
2. Chen T, Chen H (1995) Approximation capability to functions of several variables, nonlinear functions, and operators by radial basis function neural networks. IEEE Trans Neural Netw 6:904–910 3. Li K (1992) Approximation theory and recurrent networks. Proc IJCNN 2: 266–271 4. Wang LX, Mendel JM (1992) Generating fuzzy rules by learning from examples IEEE Trans Syst Man Cybern 22:1414–1427 5. Wang LX, Mendel JM (1992) Fuzzy basis function, universal approximation, and orthogonal least-squares learning. IEEE Trans Neural Netw 3:807–814 6. Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst Man Cybern SMC-1:364–378 7. Ivakhnenko AG, Ivakhnenko NA (1974) Long-term prediction by GMDH algorithms using the unbiased criterion and the balance-of-variables criterion (1974) Sov Automat Control 7:40–45 8. Ivakhnenko AG, Ivakhnenko NA (1975) Long-term prediction by GMDH algorithms using the unbiased criterion and the balance-of-variables criterion. Part 2 Sov. Automat Control 8:24–38 9. Ivakhnenko AG, Vysotskiy VN, Ivakhnenko NA (1978) Principal version of the minimum bias criterion for a model and an investigation of their noise immunity. Sov Automat Control 11:27–45 10. Ivakhnenko AG, Krotov GI, Ivakhnenko NA (1970) Identification of the mathematical model of a complex system by the self-organization method. In: Halfon E (ed.) Advances and case studies Theoretical Systems Ecology, Academic, New York 11. Farlow SJ (1984) Self-organizing methods in modeling: GMDH type-algorithms. Marcel Dekker, New York 12. Kim DW (2002) Evolutionary design of self-organizing polynomial neural networks. MA Thesis, Wonkwang University, Korea 13. Oh SK, Pedrycz W (2002) The design of self-organizing polynomial neural networks. Inf Sci 141:237–258 14. Shi Y, Eberhart R, Chen Y (1999) Implementation of evolutionary fuzzy systems. IEEE Trans Fuzzy syst 7:109–119 15. Kristinnson K, Dumont GA (1992) System identification and control using genetic algorithms. IEEE Trans Syst Man Cybern 22:1033–1046 16. Uckun S, Bagchi S, Kawamura K (1993) Managing genetic search in job shop scheduling. IEEE Expert 8:15–24 17. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MA 18. Kung SY, Taur JS (1995) Decision-based neural networks with signal/image classification applications. IEEE Trans Neural Netw 6:170–181 19. Box GEP, Jenkins FM (1976) Time series analysis: forecasting and control. Holden-day, Sanfrancisco, CA 20. Kim E, Lee H, Park M, Park M (1998) A simple identified Sugeno-type fuzzy model via double clustering. Inf Sci 110:25–39 21. Lin Y, Cunningham GA (1995) A new approach to fuzzy-neural modeling. IEEE Trans Fuzzy Syst 3:190–197 22. Takagi H, Hayashi I (1991) NN-driven fuzzy reasoning. Int J Approx Reasoning 5:191–212 23. Sugeno M, Kang GT (1988) Structure identification of fuzzy model. Fuzzy Sets Syst 28:15–33
128
D. Kim and G.-T. Park
24. Kondo T (1986) Revised GMDH algorithm estimating degree of the complete polynomial. Trans Soc Instrum Control Eng 22:928–934 25. Wu S, Er MJ, Gao Y (2001) A fast approach for automatic generation of fuzzy rules by generalized dynamic fuzzy neural networks. IEEE Trans Fuzzy Syst 9:578–594 26. Horikawa SI, Furuhashi T, Uchikawa Y (1992) On fuzzy modeling using fuzzy neural networks with the back-propagation algorithm. IEEE Trans Neural Netw 3:801–806
Recursive Pattern based Hybrid Supervised Training Kiruthika Ramanathan and Sheng Uei Guan
Summary. We propose, theorize and implement the Recursive Pattern-based Hybrid Supervised (RPHS) learning algorithm. The algorithm makes use of the concept of pseudo global optimal solutions to evolve a set of neural networks, each of which can solve correctly a subset of patterns. The pattern-based algorithm uses the topology of training and validation data patterns to find a set of pseudo-optima, each learning a subset of patterns. It is therefore well adapted to the pattern set provided. We begin by showing that finding a set of local optimal solutions is theoretically equivalent, and more efficient, to finding a single global optimum in terms of generalization accuracy and training time. We also highlight that, as each local optimum is found by using a decreasing number of samples, the efficiency of the training algorithm is increased. We then compare our algorithm, both theoretically and empirically, with different recursive and subset based algorithms. On average, the RPHS algorithm shows better generalization accuracy, with improvement of up to 60% when compared to traditional methods. Moreover, certain versions of the RPHS algorithm also exhibit shorter training time when compared to other recent algorithms in the same domain. In order to increase the relevance of this paper to practitioners, we have added pseudo code, remarks, parameter and algorithmic considerations where appropriate.
1 Introduction In this chapter we study the Recursive Pattern Based Hybrid Supervised (RPHS) learning System, a hybrid evolutionary-learning based approach to task decomposition. Typically, the RPHS system consists of a pattern distributor and several neural networks, or sub-networks. The input signal propagates through the system in a forward direction, through the pattern distributor, which chooses the best sub-network to solve the problem, which then outputs the corresponding solution. RPHS has been applied successfully to solve some difficult and diverse classification problems by training them using a recursive hybrid approach to training. The algorithm is based on the concept of pseudo global optima. This concept is based on the idea of local optima and the gradient descent rule. K. Ramanathan and S.U. Guan: Recursive Pattern based Hybrid Supervised Training, Studies in Computational Intelligence (SCI) 82, 129–156 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
130
K. Ramanathan and S.U. Guan
Basically, RPHS learning involves four steps: evolutionary learning [19], decomposition, gradient descent [26], and integration. During evolutionary learning, a set of T patterns is fed to a population of chromosomes to decide the structure and weights of a preliminary pseudo global optimal solution (also known as a preliminary subnetwork). Decomposition identifies the learnt and unlearnt patterns of the preliminary subnetwork. A gradient descent approach works on the learnt patterns and optimizes the preliminary subnetwork using the learnt patterns. This process produces a pseudo-global optimal solution or a final subnetwork. The whole process is then repeated recursively with the unlearnt patterns. Integration works by creating a pattern distributor, a classification system which helps associate a given pattern with a corresponding subnetwork. The RPHS system has two distinctive characteristics 1. Evolutionary algorithms are used to find the vicinity of a pseudo global optimum and gradient descent is used to find the actual optimum. The evolutionary algorithms therefore aim to span a larger area of the search space and to improve the performance of the gradient descent algorithm 2. Two validation algorithms are used in training the system, to identify the correct stopping point for the gradient descent [26] and to identify when to stop the recursive decomposition. These two validation algorithms ensure that the RPHS system is completely adapted to the topology of the training and validation data. It is through a combination of these characteristics, together with the ability to unlearn old information and adapt to new information, that the RPHS system derives its high accuracy. 1.1 Motivation The RPHS system performs task decomposition. Task decomposition is founded based on the strategy of divide and conquer [3, 20]. Given a large problem, it makes more sense to split the task into several subtasks and hand them over to specialized individuals to solve. The job is done more efficiently than it would be when one individual is given the complete responsibility. Yet the splitting of the tasks into subtasks is a challenge by itself, and many factors need to be determined, such as the method of decomposition, the number of decompositions etc. In order to motivate the RPHS system, we look at other divide-and-conquer approaches proposed in the literature. Some recent task decomposition work proposed the output parallelism and output partitioning [9, 10] algorithms. The algorithm decomposes the task according to class labels. The assumption is that a two class problem is easier to solve than an n class problem. An n class problem is therefore divided into n two-class problems and a subnetwork applied to solve each problem. While this approach is shown to be effective on various benchmark classification problems, the class-based decomposition approach is limited, as it means that the algorithm can be applied to classification problems only. Further, the
Recursive Pattern based Hybrid Supervised Training
131
assumption that a two-class problem is easier to solve than a K-class problem does not necessarily hold, in which the effectiveness of output parallelism is questionable. Recursive algorithms overcome this dependency on class labels. One of the earliest recursive algorithms is the decision tree algorithm [2, 25] which develops a set of rules based on the information presented in the training data. Incremental neural network based learning algorithms [9, 10, 13, 14] also attempt to overcome the dependency on class labels, but this time, the decomposition is done based on the input attributes. Another set partitioning algorithm is the multisieving algorithm [24] which uses a succession of neural networks to train the system until all the patterns are learnt. While these approaches are efficient, there is a problem of getting trapped in local optima in the neural network based or decision tree based divide and conquer approaches to learning. Further, the recursive algorithms are focused on optimal training accuracy, but do not target explicitly for optimal generalization accuracy. To overcome the problem of being trapped in local optima, genetic algorithm based counterparts were proposed for divide and conquer algorithms. [16] proposed the class based decomposition for GA-based classifiers, which encoded the solution in a rule-based system. They also proposed the input attribute based decomposition using genetic algorithms [17]. Topology-based subset selection algorithms [22, 32] provide a genetic algorithm counterpart to the multisieving algorithm. However, while these approaches solve the local optima problem, the problem of training time still remains, and the long training time of the genetic algorithms is one of the bottlenecks to their adaptation in real world applications. We therefore have the neural network based, divide-and-conquer approaches, which are fast, while having a risk of being stuck in a local optimum and the evolutionary based approaches which solve this problem, but take much more time. There is, further, the problem of generalization accuracy. How do we guarantee that we obtain the best possible generalization accuracy? Can there be a different configuration of decomposition that can result in better generalization accuracy? In Output parallelism [10, 15], for instance, the best partitioning of outputs is a problem dependant variable and plays a significant part in the generalization accuracy of the system. Solving these problems encountered in divide and conquer approaches is the goal of the RPHS system. Two validation procedures are incorporated to solve the problem of generalization accuracy. The problem of local optima and training time is overcome by the use of hybrid Genetic Algorithm based neural networks or GANNs [11, 30] and gradient descent training [26]. However, the hybridization algorithm proposed is not the same as implemented in earlier works. On the other hand, we use the genetic algorithm to simply find the best possible partial solution before the gradient descent algorithm can take over to optimize the partial solution.
132
K. Ramanathan and S.U. Guan
The use of a hybrid combination of genetic algorithms and neural networks is widespread in literature. An efficient combination of GAs and backpropagation has been shown effective in various problems including forecasting applications [1], classification [11,31] and image processing [4]. A comprehensive review of GANN based hybridization methods can be found in [30]. In RPHS, we make use of GA-based neural networks to serve a dual purpose: to evolve the structure and (partially) the weights of the subnetwork. 1.2 Organization of the chapter In this chapter, we study the Recursive Pattern based Hybrid Supervised (RPHS) learning system, as well as the associated validation and pattern distribution algorithms. The chapter is organized as follows: We begin with some preliminaries and related work in section 2 and pave the way for the introduction of the RPHS system and training algorithm. In section 3, we present a detailed description of the algorithm. To increase the practical significance of the paper, we include pseudo code, remarks and parameter considerations where appropriate. A summary of the algorithm is then presented in section 4. In section 5, we illustrate the use of the RPHS system by solving the two-spiral problem and the gauss problem, which are difficult to solve with non recursive approaches. In section 6, we present some heuristics and practical guidelines for making the algorithm perform better. Section 7 presents the results of the RPHS algorithm on some benchmark pattern recognition problems, comparing them with non hybrid and non recursive approaches. In section 8, we complete the study of the RPHS algorithm. We summarize the important advantages and limitations of the algorithm and conclude with some general discussion and future work.
2 Some preliminaries 2.1 Notation m n K I O Tr V al P S E NH
: : : : : : : : : : :
Input dimension Output dimension Number of RPHS recursions Input Output Training Validation Ensemble of subsets Neural network solution Error Number of hidden nodes in the 3-layered percepteron
Recursive Pattern based Hybrid Supervised Training
133
Fig. 1. The architecture of the RPHS system
ξ T i λ t Npop
: : : : : : :
Mean square error Error tolerance for defining learnt patterns Number of training patterns Recursion index Number of epochs of gradient descent training Time taken Number of chromosomes
2.2 Simplified architecture Figure 1 shows a simplified architecture of the RPHS system executed with K recursions. The input constitutes an m-dimensional pattern vector. The output is an n-dimensional vector. The integrator provides the select inputs to the multiplexer which then outputs the corresponding data input. The data inputs to the multiplexer are the outputs or each of the subnetworks. Each subnetwork is therefore a neural network (in this case, a three layered percepteron [26] with m inputs and n outputs. The integrator is a nearest neighbor classifier [29] with m inputs and K outputs. 2.3 Problem formulation Let Itr = {I1 , I2 ,..., IT } be a representation of T training inputs. Ij is defined, for any j ∈ T over an m dimensional feature space, i.e, Ij ∈ Rm . Let Otr = {O1 , O2 ,..., OT } be a representation of the corresponding T training outputs. Oj is a binary string of length n. T r is defined such that Tr = {Itr , Otr }.
134
K. Ramanathan and S.U. Guan
Further, let Iv = {I1 , I2 ,..., ITv } and Ov = {O1 , O2 ,..., OTv } represent the input and output patterns of a set of validation data, such that Val = {Iv , Ov }. We wish to take T r as training patterns to the system and V al as the validation data and come up with an ensemble of K subsets. Let P represent this ensemble of K subsets: K for i ∈ K P = P1 , P2 ,...P , where, i i i i i P = Tr , Val , Tr = Itr , Oitr ,Vali = Iiv , Oiv i i and Otr are mT i and nT i matrices respectively and Ivi and Ovi are Here, Itr K K mT v and nT v matrices respectively, such that Ti = T and Tvi = Tv. i=1 i=1 We need to find a set of neural networks S = S1 , S2 ,...SK , where S 1 solves P 1 , S 2 solves P 2 and so on. P should fulfill two conditions:
1. The individual subsets can be trained with a small mean square training i error, i.e, Eitr = Oitr − Si (Itr ) −→ 0, j Eival < 2. None of the subsets T ri to T rK are overtrained, i.e, j+1 i=1
i=1
Eival ; j, j + 1 ∈ K
The first property implies that each of the neural networks is a global optimum with respect to their training subset. The second property implies two things: firstly, each individual network should not be over trained, and secondly, none of the decompositions should be detrimental to the system 2.4 Variable length genetic algorithm In RPHS, we make use of the GA based neural networks to serve a dual purpose: to evolve the structure and (partially) the weights of the subnetwork. For this purpose, a variable length genetic algorithm is employed. The use of variable length Genetic Algorithms was inspired by the concept of messy genetic algorithms. Messy genetic algorithms (mGAs) [7] allow the use of variable length strings which may be over specified or underspecified with respect to the problem being solved. The original work by Goldberg shows that mGAs obtain tight building blocks and are thus more explorative in solving a given problem. The variable length Genetic Algorithms used in this paper are also aimed at building blocks that are more explorative in solving the given problem. In a three-layered neural network, the number of free parameters, NP , is given by a product of the number of inputs, the number of outputs and the number of hidden nodes (NH ). NP = mNH + NH n + n + NH
(1)
Recursive Pattern based Hybrid Supervised Training
135
Each of these free parameters is one element of the chromosome and represents one of the weights or the biases in the network. According to equation 1, a chromosome is therefore defined by the value of NP which in turn depends on the value of NH . Initialization of the population is done by generating a random number of hidden nodes for each individual and a chromosome based on this number. 2.5 Pseudo global optima The performance of the RPHS algorithm can be attributed to the fact that RPHS aims to find several pseudo-global solutions as opposed to a single global solution. We define a pseudo-global optimal solution as follows: Definition 1. A pseudo-global optima is a global optimum when viewed from the perspective of a subset of training patterns, but could be a local (or global) optimum when viewed from the perspective of all the training patterns. In this section, we highlight the difference among the multisieving algorithm [24], single staged GANN training algorithms [11, 31], and the RPHS algorithm, based on the concept and simplified model of pseudo-global optima. Consider the use of the RPHS algorithm to model a function S i such i i that Otr = S i (w, Itr ). where w is the weight vector of values to be optimized. The training error at any point of time is given by Etr = Eitr + Eunlearnt ≈ Ti εitr + (T − Ti )εunlearnt
(2)
We know that at any given point, the training error can be split into the error of the learnt patterns T i and the error of the unlearnt patterns (T − T i ). Definition 2: A pattern is considered learnt if its output differs from the ideal output by a value no greater than an error tolerance ξ. The number of total patterns learnt is therefore ⎧⎡ ⎫ ⎤ T O ⎨ 1 ⎬ δ ⎣ φ ξ − Oi,j − Oˆi,j ⎦ − 1 Ti = (3) ⎩ O ⎭ i=0
j=0
where δ(.) is the unit impulse function and φ(.) is the unit step function and Oˆi,j is the network approximation to Oi,j . By definition of learnt and unlearnt patterns tr ≤ ξ < unlearnt . Also, as we approach the optimal points, Et ri → 0. Also, consider that at the end of evolutionary training, all the learnt patterns have an error less than the error tolerance ξ, i.e. i + Eunlearnt < ξT i + Eunlearnt Etr = Etr
(4)
RPHS splits up the training patterns after evolutionary training of recursion k Gk such that the gradient descent training Lk of recursion k is carried
136
K. Ramanathan and S.U. Guan
Fig. 2. Illustration of solutions found by (i) RPHS(SRP HS ), (ii) Single staged hybrid search (Sss ), (iii) Multisieving algorithm (Sm )
out with T i patterns and the step Gk+1 with T −
k
T i patterns. The value of
i=1
Eunlearnt is therefore a constant during Lk , i.e. for any given gradient descent epoch, i +C (5) Etr = Etr Figure 2 illustrates how the RPHS algorithm, a single-staged training algorithm, and the multisieving algorithm [24] find their solutions. The graph shows a hypothetical one-dimensional error surface to be optimized with respect to w. Assume that at the end of the evolutionary training phase, solution Sg has an error value Eg computed according to equation 3. A single-staged algorithm such as backpropagation or GA classifiers will either try to search for an optimum at this stage or, if the probability of finding the optimum is too small, climb the hill and reach the local optima marked by Sss . However, by virtue of equation 5, Eunlearnt is a constant value C. The error curve (represented by the dotted line), is just a vertically translated copy of the part of the original curve, which is of interest to us. Now, if we consider the multisieving algorithm [24] or the topology-based selection algorithm [22], the occurs with learnt patterns being data splitting ˆ classified with those with Oj − Oj < ξ. The splitting of the data therefore depends on the error tolerance of learnt patterns ξ, as defined in equation 3. With respect to figure 2, we can think of this algorithm as finding the solution Sg using a gradient descent algorithm. The solution Sg , in itself, is found by gradient descent. The final solution Sm is a vertically translated Sg . However,
Recursive Pattern based Hybrid Supervised Training
137
Sm , can only be equal to the translated local optima if the error tolerance ξ is set to optimum values. This is because the solution to the multisieving algorithm is considered found when the pattern is learnt to the error tolerance ξ. On the other hand, we can see from 2, that the translated local optima due to the splitting of patterns is more optimal than the other optimal solutions, i.e., (6) ET LO ≤ Eglobal This is by virtue of the fact that Et ri ← 0 as we approach the optimal i! point. Further, from equation 5, ∂Etr/∂w → 0 as ∂Etr ∂w → 0. Therefore, the solution found by the RPHS algorithm is a pseudo global optima, i.e, it could be a local optimum but it appears global from the perspective of a pattern subset. In contrast to the multisieving algorithm, the RPHS solution, adapts itself accordingly, regardless of the error tolerance ξ, to the problem topology due to gradient descent at the end of each recursion. Finding a pseudo global optimum therefore reduces the dependence of the algorithm on the error tolerance of learnt patterns ξ. It is also the natural optima based on the data subset. Note: Since early stopping is implemented during backpropagation so as to prevent overtraining, the optima found by RPHS may not necessarily be SRP HS , but in the vicinity of SRP HS .
3 The RPHS training algorithm 3.1 Hybrid recursive training The RPHS training algorithm can be summarized as a hybrid, recursive algorithm. While hybrid combinations of Genetic algorithms and neural networks are used in various works to improve the accuracy of the neural network, the RPHS hybrid algorithm is a novel recursive hybrid and works as outlined below. The hybrid algorithm uses Genetic Algorithms to find a partial solution with a set of learnt and unlearnt patterns. Neural networks are used to learn “to perfection” the learnt patterns and Genetic Algorithms are used again to tackle the unlearnt patterns. The process is repeated recursively until an increase in the number of recursion leads to overfitting. The training process is described in detail below. 1. As we are only looking for a partial solution fast, we use GANNs to perform the global search across the solution space with all the available training patterns. 2. We continue training until one of the following two conditions are satisfied: a). There is stagnation or b) A percentage of the patterns are learnt.
138
K. Ramanathan and S.U. Guan
3. In this stage, we use a condition similar to that in [22] and the multisieving network [24] to identify learnt patterns, i.e., a pattern is considered learnt if Oj − Oˆj < ξ. More formally, we can define the percentage of total patterns learnt as in equation 3 Note that, similar to the multisieving algorithm, a tolerance ξ is used to identify learnt patterns; the arbitrarily set value of ξ for RPHS does not affect the performance of the algorithm as explained in section 2. 4. The dataset is now split into learnt and unlearnt patterns. With the unlearnt patterns, we repeat steps 1 to 3. 5. Since the learnt patterns are only learnt up to a tolerance ξ, we use gradient descent to train the learnt patterns. The aim of gradient descent is to best adapt the solution to the data topology. Backpropagation is used in all the recursions except the last one for which constructive backpropagation is used. The optimum thus found is called the pseudoglobal optimal solution, and is found using a validation set of data to prevent over training and to overcome the dependence of the algorithm on ξ. As the number of patterns in a data subset is small, especially as the number of recursions increases, it is possible for the pseudo global optimal solution to over fit the data in the subset. In order to avoid this possibility, we use a validation dataset. The validation dataset is used along with the training data to detect generalization loss using an algorithm in [10]. The data decomposition technique of the RPHS algorithm can be best described by figure 3. During the first recursion, the entire training set (size T ) is learnt using evolutionary training until stagnation occurs. Only the learnt patterns are learnt further using backpropagation, with measures to prevent overtraining. This ensures the finding of a pseudo global optimal solution.
Legend: BP: Backpropagation CBP: Constructive backpropagation GANN: genetic algorithm evolved neural nets S i : Solution corresponding to the dataset T r i Fig. 3. Recursive data decomposition employed by RPHS
Recursive Pattern based Hybrid Supervised Training
139
Fig. 4. The two-level RPHS problem solver
The second recursion repeats the same procedure with the unlearnt patterns. The process repeats until the total number of patterns in a given recursion (Recursion K) is too small, in which case, constructive backpropagation is applied to the whole dataset to learn the remaining patterns to the best possible extent. 3.2 Testing Testing in the RPHS algorithm is implemented using a Kth nearest neighbor (KNN) [29] based pattern distributor. KNN was used to implement the pattern distributor due to the ease of its implementation. At the end of the RPHS training phase, we have K subsets of data. A given test pattern is matched with its nearest neighbor. If the neighbor belongs to subset i, the pattern is also deemed as belonging to subset i. The solution for subset i is then used to find the output of the pattern. A multiplexer is used for this function. The KNN distributor provides the select input for the multiplexer, while the outputs of subnetworks 1 to K are the data inputs. This process is illustrated by figure 4.
4 Summary of the RPHS algorithm Figure 5 presents the pseudo code for training the RPHS system. Train is initially called with i = 1 and T r and V al as the whole training and validation set. In addition to the algorithm described in the previous section, Figure 5
140
K. Ramanathan and S.U. Guan
Train (T r,V al,i,) { Use Genetic algorithms to learn the dataset T r using a new set of chromosomes IF stagnation occurs { 1. Identify the learnt patterns 2. Split T r into T r i (consisting of the learnt patterns) and (T r − T r i ) (consisting of the unlearnt patterns). Find corresponding V ali and (V al − V ali ) 3. T r i is now trained with the existing solution using the backpropagation algorithm. The procedure is validated using dataset V ali 4. IF local training is complete (stagnation OR generalization loss) IF (T r − T r i ) has too few patterns { a. T r i = T r − T r i b. Locally train T r i until Generalization loss OR stagnation c. STORE network d. END Training } ELSE { FREEZE Train ((T r − T r i ), (V al − V ali ), i + 1) }}} Fig. 5. The pseudo code for training the RPHS system
also introduces the two validation procedures used in terminating the system (indicated in bold). Step 2 uses the validation subset to train the patterns in a given recursion to ensure that the neural network does not over represent the patterns in question. Step 4 uses the validation procedure to ensure when the recursions should be stopped and to determine whether a subsequent decomposition is detrimental to the RPHS system.
5 The two spiral problem The two spiral problem (part of whose data is shown in figure 6a is considered to be complicated, since there is no obvious separation between the two spirals. Further more, it is difficult to solve this problem with a 3-layered neural network, and even with more than one hidden layer, information such as the number of neurons and number of hidden layers play an important part in distinguishing the spirals apart [21]. The problem gets more complicated when the spirals are distorted by noise. In this section, we illustrate how the use of evolutionary algorithms in the RPHS algorithm splits the two spiral data into easily separable subsets. The training data (50%) of the two spiral dataset is as shown in figure 6a. With the use of the RPHS algorithm, evolutionary training is used to split
Recursive Pattern based Hybrid Supervised Training
141
Fig. 6. The two spiral data set and an example of how it can be decomposed into several smaller datasets that are more easily separable
the data into easily learnable subsets. Backpropagation is used to further optimize the error in the EA trained data. The data points in T r1 , T r2 and T r3 are shown in figures 6b to 6d. We can observe that the two spiral data is split such that while the original dataset is not easy to classify, each of the decomposed datasets are far simpler and can be classified by a simple neural network. This is remarkable improvement from the data decomposition that is employed by the multisieving algorithm [24], where genetic algorithms are not used in decomposition. The subsets found in each recursion are not as separable as the subsets in figure 6.
6 Heuristics for making the RPHS algorithm better Haykins [18] says that the design of a neural network system is more an art than a science in the sense that many of the numerous factors involved in the design are as a result of ones personal experience. While the statement is true, we wish to make the RPHS system as less artistic as possible. Therefore, we propose here several methods which will improve and make the algorithm more focused implementation wise.
142
K. Ramanathan and S.U. Guan
6.1 Minimal coded genetic algorithms The implementation of Minimal coded Genetic Algorithms (MGG) [8,27] was considered because the bulk of the training time of an evolutionary neural network is due to the evaluation of the fitness of a chromosome. In Minimal coded GAs however, only a minimal number of offspring is generated at each stage. The algorithm is outlined briefly below. 1. From the population P , select u parents randomly. 2. Generate θ offspring from the u parents using recombination/mutation. 3. Choose 2 parents at random from u. 4. Of the two parents, 1 is replaced with the best from θ and the other is replaced by a solution chosen by a roulette wheel selection procedure of a combined population of θ offspring and 2 selected parents. In order to make the genetic algorithm efficient timewise, we choose the values of u = 4 and θ = 1 for the GA based neural networks. Therefore, except for the initial population evaluation, the time taken for evolving one epoch using MGG is equivalent to the forward pass of the backpropagation algorithm. 6.2 Seperability In section 3, we have described the RPHS testing algorithm as a K th nearest neighbor based pattern distributor. If the data solved by recursion i and recursion j are well separated, then the K th nearest neighbor will give error-free pattern distribution. However, the RPHS algorithm described so far does not guarantee that data subsets from two recursions are well separated. Error can therefore be introduced into the system because of the pattern distributor. In this section we discuss the efforts made to increase the separation between data subsets. Empirically, there is some improvement in experimental results when the separation criterion is implemented, although there is a tradeoff in time. We outline below the algorithm proposed to implement subset separability. Definition 2. Inter-recursion separation is defined as the separation between the learnt data of recursion k, (T rk ) and the data learnt by other recursions (Trk ). The two data subsets are mutually exclusive. Definition 3. Intra-recursion separation represents the separation of the data in the same subset of RPHS. If M classes of patterns are present in the subset i (patterns learnt by the solution of recursion i), then intra recursion separation M M ! is expressed by 1 M 2 sep(ωj , ωk )1, where sep(ωj , ωk ) is the separation j=1 k=1
between patterns of class ωj and ωk . 1
It is noted that in the case of learning with neural networks, the MSE error of the k classes can be used as a substitute for the intra-recursion substitution
Recursive Pattern based Hybrid Supervised Training
143
Separability criterion The separability criterion is a mathematical expression that evaluates the separation between two sets of data. In this work, we will use the Battacharya criterion of separability for a 2-class problem [6]. DBatt = 18 (µ(l) − µ(u) )
+ 12 log
"
(l)
+ 2
(u)
#−1 $ % µ(l) − µ(u)
| 12 ( (l) + (u) )| 1 1 | (l) | 2 +| (u) | 2
(7)
In the equation above, µ is the data mean, is the covariance matrix and the subscripts (l) and (u) represent the learnt and unlearnt patterns of the current recursion. It should be noted that the selection of the Bhattacharya criterion is purely arbitrary. Other criterion that can be used are Fisher’s criterion [6], Mahalanobhis distance [5], and so on. Objective function for evolutionary training In order to increase the inter recursion separation, we modify the fitness function for GANNs as given by the equation below. g(x) = w1 (x) − (1 − w1 )DBatt (x)
(8)
The fitness g(x) of the chromosome x is therefore dependent on both the inter recursion separation between the learnt and unlearnt data of the chromosome x, DBatt (x), and the intra recursion separation, which can be expressed by the MSE error, (x), of the chromosome. w1 is the importance of the intra recursion separation with respect to the chromosome fitness. In the next section, we present our results with w1 = 0.5. 6.3 Computation intensity and population size Here we present an argument to the computational complexity of the RPHS algorithm. Let the time taken to forward pass a single pattern through a neural network be t and the number of training patterns be T . For simplicity, we assume the following. 1. The neural network architecture is the same throughout. 2. The time required for other computations (backpropagation, crossover, mutation, selection, etc.) is negligible when compared to the evaluation time. The second assumption is valid as the exp function of the forward pass stage is more computationally intensive than the other functions. Therefore the total time required for λ epochs of CBP is therefore tCBP = λP t.
144
K. Ramanathan and S.U. Guan
The total time required for RPHS with MGG can also be expressed as a R summation of the time taken in each recursion i, tRP HS = tP D + ti . The i
time taken for each recursion is given as below. ti = λgi Ti t + λli (Ti − αi )t + Npop Tit
(9)
where λgi and λli refer to the number of epochs required for evolutionary and backpropagation training in recursion i. alphai is the number of learnt patterns at the end of the recursion. The last term refers to the initialization of the recursion population with Npop chromosomes (as explained in figure 5). The bulk of the time in the equation above depends on the third term, i.e., the initial evaluation of the chromosome population in each recursion. The justification of the above claim follows from the properties of RPHS and evolutionary search: 1. As the patterns in each backpropagation epoch are already learnt, fewer epochs are required than when compared to the training of modules in the output parallelism, i.e λli,RP HS < λi,OP . 2. From the experimental results observed and due to the capability of genetic algorithms to find partial solutions faster, we can also say that λgi,RP HS is small. In the experiments carried out, the value of λgi is usually less than 20 epochs. The location of the pseudo global optimal solution found by GA is relatively unimportant as the pseudo global optima is always globally optimal in terms of the patterns selected. This implies that with a small population size, the RPHS algorithm is likely to be faster than the output parallelism algorithm. In order to observe the effect of the number of chromosomes Npop on the training time and the generalization accuracy of the RPHS system, we performed a set of experiments using the MGG based RPHS algorithm with a varying number of chromosomes. The graphs in figure 7 show the trend in training time and generalization accuracy for initial population sizes between 5 and 30. The population size of 5 chromosomes was chosen so that MGG can be implemented with 4 chromosomes for mating and still retains the best fitness values. It is interesting to note that the number of chromosomes Npop in the initial population of each recursion does not play a big role in the generalization accuracy of the system. This is, once again, an expected property of the RPHS algorithm as it is the backpropagation algorithm that completes the training of the system according to the validation data. The part played by the genetic algorithm is only partial training and it is the presence of the local optima, not its relative position that is important for the RPHS algorithm. Therefore, if training time is an issue, using the minimal requirement of 5 chromosomes and implementing MGG can solve the problem to certain accuracy comparable to that using a larger population.
Recursive Pattern based Hybrid Supervised Training
145
Fig. 7. The effect of using different sized initial populations for RPHS with the (a) SEGMENTATION, (b) VOWEL, (c) SPAM datasets. Graphs show the training time ( ) and generalization error(...) against the number of chromosomes in the initial population.
Therefore, the most efficient training time for the RPHS algorithm will be as given by equation 10, which is based on equation 9, tRP HS = tP D +
R i=1
ti = tP D +
R
Kgi Ti t + Kli (Ti − αi )t + 5T i t
(10)
i=1
6.4 Validation data For optimal training, it is necessary to use suitable validation data for each decomposed training sets. In this section we propose and justify the algorithm for choosing the optimal validation data for each subset of training data. Consider the distribution of data shown in figure 8. Each colored zone represents data from a different recursion. The patterns learnt by solution i are explicitly exclusive of the patterns learnt by solution j, ∀i = j. The RPHS decomposition tree in figure 3 can therefore be expressed as shown in figure 9. According to figure 9 and the RPHS training algorithm described in 4, the first recursion begins with T r, the data to be trained using EAs, At the end of the recursion, T r is split into T r1 (data to be trained with backpropagation) to give S 1 (the network representing the data T r1 , and (T r − T r1 )(data to be trained using EAs to give solutions 2 to n). We represent all the networks
146
K. Ramanathan and S.U. Guan
Fig. 8. Sample data distribution
Fig. 9. The data distribution of the RPHS recursion tree
that represent (T r − T r1 ) as S 1 , i.e., the data that is represented by S 1 can never be represented by S 1 . We therefore propose the following pseudo code for validation. Do until stagnation or early stopping Optimize MSE criterion locally() Validate () End Validate() FindVi() Use the validation set V ali to validate the solution S i for recursion i FindVi() For each validation pattern Use KNN (or intermediate pattern distributor) If pattern ∈ T ri Add pattern to V ali Given a set of patterns, F indV i() finds out which patterns can possibly be solved by the solutions that exist. Patterns that can be solved are isolated and used as specific validation sets. Besides a more accurate validation dataset, it is also possible to obtain the intermediate generalization capability of the system, which is useful is stopping recursions, as described in the next section.
Recursive Pattern based Hybrid Supervised Training
147
The intermediate pattern distributor is similar to that described in Section 3 except that it only has two outputs. Its responsibility is to decide whether a pattern is suitable for validating the subset of patterns in question or not. 6.5 Early stopping Decompositions of data in the RPHS algorithm are done as follows: An intermediate pattern distributor with two outputs is implemented after each recursion as described in the previous section. Using the intermediate pattern i i 2 distributor, we obtain the validation error (Eval ) and training error (Etr ) of i−1 i a recursion i. at the end of each recursion. If (Eval > Eval ), the recursion i is overtraining the system. Therefore only the results of i − 1 recursions only are considered. The overall RPHS training algorithm can therefore be described as shown in figure 10.
7 Experiments and results 7.1 Problems considered The table below summarizes the five classification problems considered in this paper. The problems were chosen such that they varied in terms of input and output dimensions as well as in the number of training, testing and validation patterns made available. All the datasets, other than the two-spiral dataset, were obtained from the UCI repository. The results of the SPAM and the two-spiral datasets were compared with constructive backpropagation, multisieving and the topology based subset selection algorithms only. This was because the SPAM and two spiral problems were two-class problems. Therefore implementing the output parallelism will not make a difference to the results obtained by CBP. The two spiral dataset consists of 194 patterns. To ensure a fair comparison to the Dynamic subset selection algorithm [22], test and validation datasets of 192 patterns were constructed by choosing points next to the original points in the dataset as mentioned in the paper. Table 1. Summary of problems considered Problem name Training set size Test set size Validation set size Number of inputs Number of outputs
2
Segmentation
Vowel
Letter recognition
Two-spiral
Spam
1155 578 577 18 7
495 248 247 10 11
10000 5000 5000 16 26
194 192 192 2 2
2301 1150 1150 57 2
i i Both (Eval ) and (Etr ) represent the percentage of (training and validation) patterns in error of the RPHS system with i recursions
148
K. Ramanathan and S.U. Guan
Note: In the above flowchart, the following process is described. 1. The unlearnt data for recursion i (All the data for i = 1) is used to train a GANN subnetwork. 2. Based on the learnt and unlearnt data from the recursion i, the nearest neighbor algorithm is used to decompose the validation data into validation subset i, Vi (Patterns belonging to recursion i) and V i (Patterns belonging to recursions other than i) 3. The training subset i and validation subset i are used together with the GANN subnetwork to obtain the final subnetwork 4. If the validation accuracy of the first i subnetworks is lower than the validation accuracy of the first i − 1 subnetworks, the final subnetwork i is retrained using the remaining unlearnt data and the training subset i to the best possible extent possible. 5. If 4 is not true, then 1, 2 and 3 are repeated with the remaining unlearnt patterns. Fig. 10. The overall RPHS training algorithm including training and validation set decomposition, the use of backpropagation and constructive backpropagation and the recursion stopping algorithm
Recursive Pattern based Hybrid Supervised Training
149
7.2 Experimental parameters and control algorithms implemented Table 2 summarizes the parameters used in the experiments. As we wish for the RPHS technique to be as problem independent as possible, we make all the experimental parameters constant for all problems and as given below. Each experiment was run 40 times, with 4-fold cross validation. For comparison purposes, the following algorithms were designed and implemented. The constructive backpropagation algorithm was implemented as a single staged (non hybrid) algorithm which conducts gradient descent based search with the possibility of evolutionary training by the addition of a hidden node. The multisieving algorithm (recursive non hybrid algorithm) was implemented to show the necessity to find the correct pseudo global optimal solution. The following control experiments were carried out based on the multisieving algorithm and the dynamic topology based subset finding algorithm. Both the versions of output parallelism implemented also show the effect of the hybrid RPHS algorithm. 1. 2. 3. 4.
Multisieving with KNN based pattern distributor Dynamic topology based subset selection Output parallelism without pattern distributor [10] Output parallelism with pattern distributor [15] Table 2. Summary of parameters used Parameter
Value
Evolutionary search Population size parameters Crossover probability Small change mutation probability Large change mutation probability Pattern learning tolerance of EA training ξ. Also used for identifying learnt patterns in TSS and multisieving
20
MGG parameters Neural network parameters
0.9 0.1 0.7 0.2
µ = 4, θ=1
Generalization loss tolerance for validation 1.5 Backpropagation learning rate 10−2 Number of stagnation epochs before CBP 25 increases one hidden node
Pattern distribution Number of neighbors in the KNN pattern 1 related parameters distributor Weight of intra recursion separation in the 0.5 modified obj function (w1 )
150
K. Ramanathan and S.U. Guan
7.3 Experimental results The graphs in figure 11 below compare the Mean square training error for the various datasets. The typical training curves for Constructive Backpropagation [23], and RPHS training are shown. From Figure 11, we can observe that the training error obtained by RPHS is lower than the training error that is obtained using CBP [23]. The empirical comparison shows that typically, the RPHS training curve converges to a better optimal solution when compared to the CBP curve. At this stage, a note should be made on the shape of the RPHS curve. The MSE value increases at the end of each recursion before reducing further. This is due to the fact that RPHS algorithm reinitializes the population at the end of each recursion. The reinitialization of the population at the end of each recursion benefits the solution set reached. The new population is now focused on learning the “unlearnt” patterns, thereby enabling the search to exit from the pseudo global (local) optima of the previous recursion. Tables 3 to 7 summarize the training time and generalization accuracy obtained by CBP [23], Output Parallelism with [15] and without the pattern distributor [10] and RPHS. The Output Parallelism algorithm is chosen so as to illustrate the difference between manual choice of subsets and evolutionary search based choice of subsets.
Fig. 11. The data distribution of the RPHS recursion tree
Recursive Pattern based Hybrid Supervised Training
151
Table 3. Summary of the results obtained from the Segmentation problem (average number of recursions: 7.2) Algorithm used Constructive backpropagation Multisieving with KNN pattern distributor Output parallelism Output parallelism with pattern distributor RPHS-GAND RPHS-GAD RPHS-MGGND RPHS-MGGD RPHS-with separation
Training time (s) 693.8 760.64 1719.6 2219.2 1004.8 1151.8 545.8 688.34 1435.6
Classification error (%) 5.74 7.28 5.18 5.44 5.62 4.32 5.27 4.41 4.17
Table 4. Summary of the results obtained from the Vowel problem (average number of recursions: 6.425) Algorithm used Constructive backpropagation Multisieving with KNN pattern distributor Output parallelism Output parallelism with pattern distributor RPHS-GAND RPHS-GAD RPHS-MGGND RPHS-MGGD RPHS-with separation
Training time (s) 237.9 318.23 418.9 534.3 812.95 842.16 396.34 473.88 884.55
Classification error (%) 37.16 39.43 25.54 24.89 25.271 16.721 23.24 17.73 14.82
Table 5. Summary of the results obtained from the Letter recognition problem (average number of recursions: 21.3) Algorithm used Constructive backpropagation Multisieving with KNN pattern distributor Output parallelism Output parallelism with pattern distributor RPHS-GAND RPHS-GAD RPHS-MGGND RPHS-MGGD RPHS-with separation
Training time (s) 20845 55349 42785.4 45625.4 38461 47447 27282 29701 94.898
Classification error (%) 21.672 65.04 20.06 18.636 13.14 11.1 13.08 13.42 10.12
152
K. Ramanathan and S.U. Guan
Table 6. Summary of the results obtained from the Spam problem (average number of recursions: 2.475) Algorithm used Constructive backpropagation Multisieving with KNN pattern distributor RPHS-GAND RPHS-GAD RPHS-MGGND RPHS-MGGD RPHS-with separation
Training time (s) 43.649 123.12 142.24 156.81 58.721 82.803 517.68
Classification error (%) 27.92 21.06 21.00 20.75 22.11 20.97 18.76
Table 7. Summary of the results obtained from the Two-spiral problem (average number of recursions: 2.475)3 Algorithm used Constructive backpropagation Multisieving with KNN pattern distributor Dynamic topology based subset selection (TSS) RPHS-GAND RPHS-GAD RPHS-MGGND RPHS-MGGD RPHS-with separation
Training time (s) 15.58 35.89 − 76.25 87.91 45.72 59.97 129.24
Classification error (%) 49.38 23.61 28.0 15.42 10.54 13.25 11.08 10.31
To show the effect of the heuristics proposed, the RPHS training is carried out with four options 1. Genetic Algorithms with no decomposition of validation patterns (RPHS-GAND) 2. Genetic Algorithms with decomposition of validation patterns (RPHSGAD) 3. MGG with no decomposition of validation patterns (RPHS-MGGND) 4. MGG with decomposition of validation patterns (RPHS-MGGD) The graphs in figure 11 compare CBP with the 4th training option (RPHSMGGD). Based on the results presented, we can make the following observations and classify them according to training time and generalization accuracy. 3
Results of the topology based subset selection algorithm [22]
Recursive Pattern based Hybrid Supervised Training
153
Generalization accuracy • All the RPHS algorithms give better generalization accuracy when compared to the traditional algorithms (CBP, Topology based selection and multisieving). • The algorithms which include the decomposition of validation data, although marginally longer than that without decomposition, have better generalization accuracy than output parallelism. As the algorithms implementing output parallelism do so with a manual decomposition of validation data, it follows that a version of RPHS will be more accurate than a similar corresponding algorithms based on output parallelism. • Implementing RPHS with the separation criterion gives the best generalization accuracy although there is a large tradeoff in time. This is discussed in greater detail in the following section. • The RPHS algorithm that uses MGG with the decomposition of validation patterns (MGGD) provides the best tradeoff between training time and generalization accuracy. When compared to RPHS-GAD and RPHS with separation, the tradeoff in generalization accuracy is minimal when compared to the reduction in training time. • The number of recursions required by RPHS, on average, is lower than the number of classes in a problem and gives better generalization accuracy. This suggests that classwise decomposition of data is not always optimal. Training time • The training time required by CBP is the shortest of all the algorithms. However, as seen from the graphs, this short training time is most likely due to to premature convergence of the CBP algorithm. • The training of RPHS is also more gradual. While premature convergence is easily observed in the case of CBP, RPHS converges more slowly. The recursions reduce the training error in small steps, learning a small number of patterns at a time. • Apart from the CBP algorithm, the RPHS algorithm carried out with MGG has shorter training time than the output parallelism algorithms. The training time of the multisieving algorithm is larger or less than the RPHS-MGG based algorithms depending on the datasets. This is expected as the nature of the dataset determines the number of levels that multisieving has to be implemented and therefore influences the training time. • The basic contribution of the Minimal coded genetic algorithms is the reduction of training time. However, there is a small trade off in generalization accuracy when MGGs are used. This can be observed across all the problems. • The use of the separation criterion with the RPHS algorithm increases the training time by several fold. This is expected as the training time includes the calculation of the inverse covariance matrix. This is the tradeoff for
154
K. Ramanathan and S.U. Guan
obtaining marginally better generalization accuracy. The time taken using the separation criterion may or may not be acceptable depending on the problem dimension, the number of patterns, etc. However, when the primary goal is to improve the generalization accuracy of the system and the learning is done offline, the separability criterion can be included for better results.
8 Conclusions Task decomposition is a natural extension to supervised learning as it decomposes the problem into smaller subsets. In this paper, we have proposed the RPHS algorithm, a topology adaptive method to implement task decomposition automatically. With a combination of automatic selection of validation patterns and adaptive detection of decomposition extent, the algorithm enables to decompose efficiently the data into subsets such that the generalization accuracy of the problem is improved. We have compared the classification accuracy and training time of the algorithm with four algorithms, illustrating the effectiveness of (1) recursive subset finding, (2) pattern topology oriented recursions, and (3) efficient combination of gradient descent and evolutionary training. We discovered that the classification accuracy of the algorithm is better than both the constructive backpropagation algorithm and the output parallelism. The improvement in classification error when compared to the constructive backpropagation is up to 60% and 40% when compared to output parallelism. The training time of the algorithm is also better than the time required by the output parallelism algorithm. On a conceptual level, the main contribution of this paper is twofold. Firstly, the algorithm shows, both theoretically and empirically, that when training is performed based on pattern topology using a combination of evolutionary training and gradient descent, generalization is better than partitioning the data based on output classes. It also shows that the combination of EAs and gradient descent is better than the use of gradient descent only, as in the case of the multisieving algorithm [24]. Secondly, the paper also presents a data separation method to improve further the generalization accuracy of the system by consciously reducing the pattern distributor error. While this is shown, both conceptually and empirically, to reduce the generalization error, the algorithm incurs some cost due to its increased training time. One future work would be how this training time can be reduced without compromising the accuracy. Another future work involves the investigation of the decomposition mechanism of the evolutionary algorithms and to improve its efficiency.
Recursive Pattern based Hybrid Supervised Training
155
References 1. Andreou AS, Efstratios F, Spiridon G, Likothanassis D (2002) Exchange rates forecasting: a hybrid algorithm based on genetically optimized adaptive neural networks. Comput Econ 20(3):191–200 2. Breiman L (1984) Classification and regression trees. Wadsworth International Group, Belmont, California 3. Chiang CC, Fu HH (1994) Divide-and-conquer methodology for modular supervised neural network design. In: Proceedings of the 1994 IEEE international conference on neural networks 1:119–124 4. Dokur Z (2002) Segmentation of MR and CT images using hybrid neural network trained by genetic algorithms. Neural Process Lett 16(3):211–225 5. Foody GM (1998) Issues in training set selection and refinement for classification by a feedforward neural network. Geoscience and remote sensing symposium proceeding:401–411 6. Fukunaga K (1990) Introduction to statistical pattern recognition, Academic, Boston 7. Goldberg DE, Deb K, Korb B (1991) Don’t worry, be messy. In: Belew R, Booker L (eds.) Proceedings of the fourth international conference in genetic algorithms and their applications, pp 24–30 8. Gong DX, Ruan XG, Qiao JF (2004) A neuro computing model for real coded genetic algorithm with the minimal generation gap. Neural Comput Appl 13:221–228 9. Guan SU, Liu J (2002) Incremental ordered neural network training. J Intell Syst 12(3):137–172 10. Guan SU, Li S (2002) Parallel growing and training of neural networks using output parallelism. IEEE Trans Neural Netw 13(3):542–550 11. Guan SU, Ramanathan K (2007) Percentage-based hybrid pattern training with neural network specific cross over, Journal of Intelligent Systems 16(1):1–26 12. Guan SU, Li P (2002) A hierarchical incremental learning approach to task decomposition. J Intell Syst 12(3):201–226 13. Guan SU, Li S, Liu J (2002) Incremental learning with an increasing input dimension. J Inst Eng Singapore 42(4):33–38 14. Guan SU, Liu J (2004) Incremental neural network training with an increasing input dimension. J Intell Syst 13(1):43–69 15. Guan SU, Neo TN, Bao C (2004) Task decomposition using pattern distributor. J Intell Syst 13(2):123–150 16. Guan SU, Zhu F (2004) Class decomposition for GA-based classifier agents – a pitt approach. IEEE Trans Syst Man Cybern, Part B: Cybern 34(1):381–392 17. Guan SU, Zhu F (2005) An incremental approach to genetic algorithms based classification. IEEE Trans Syst Man Cybern Part B 35(2):227–239 18. Haykins S (1999) Neural networks, a comprehensive foundation, Prentice Hall, Englewood Cliffs, NJ 19. Holland JH (1973) Genetic algorithms and the optimal allocation of trials. SIAM J Comput 2(2):88–105 20. Kim SP, Sanchez JC, Erdogmus D, Rao YN, Wessberg J, Principe J, Nicolelis M (2003) Divide and conquer approach for brain machine interfaces: non linear mixture of competitive linear models, Neural Netw 16(5–6):865–871
156
K. Ramanathan and S.U. Guan
21. Lang KJ, Witbrock MJ (1988) Learning to tell two spirals apart. In: Touretzky D, Hinton G, Sejnowski T (eds.) Proceedings of the 1988 connectionist models summer School, Morgan Kaufmann, San Mateo, CA 22. Lasarzyck CWG, Dittrich P, Banzhaf W (2004) Dynamic subset selection based on a fitness case topology. Evol Comput, 12(4):223–242 23. Lehtokangas (1999), Modeling with constructive Backpropagation. Neural Netw 12:707–714 24. Lu BL, Ito K, Kita H, Nishikawa Y (1995) Parallel and modular multi-sieving neural network architecture for constructive learning. In: Proceedings of the 4th international conference on artificial neural networks 409:92–97 25. Quilan JR (1986) Introduction of decision trees. Mach Learn 1:81–106 26. Rumelhart D, Hinton G, Williams R (1986) Learning internal representations by error propagation. In Rumelhart D, McClelland J (eds.) Parallel distributed processing, 1: Foundations. MIT, Cambridge, MA 27. Satoh H, Yamamura M, Kobayashi S (1996) Minimal generation gap model for GAs considering both exploration and exploitation. In: Proceedings of 4th Int Conference on Soft Computing, Iizuka:494–497 28. The UCI Machine Learning repository: http://www.ics.uci.edu/mlearn/MLRepository.html 29. Wong MA, Lane T (1983) A kth nearest neighbor clustering procedure. JR Stat Soc (B) 45(3):362–368 30. Yao X (1993) A review of evolutionary artificial neural networks. Int J Intell Syst 8(4):539–567 31. Yasunaga M, Yoshida E, Yoshihara I (1999) Parallel backpropagation using genetic algorithm: real-time BP learning on the massively parallel computer CP-PACS. In: International Joint Conference on Neural Networks 6:4175–4180 32. Zhang BT, Cho DY (1998) Genetic programming with active data selection. Simulated Evol Learn 1485:146–153
Enhancing Recursive Supervised Learning Using Clustering and Combinatorial Optimization (RSL-CC) Kiruthika Ramanathan and Sheng Uei Guan
Summary. The use of a team of weak learners to learn a dataset has been shown better than the use of one single strong learner. In fact, the idea is so successful that boosting, an algorithm combining several weak learners for supervised learning, has been considered to be one of the best off-the-shelf classifiers. However, some problems still remain, including determining the optimal number of weak learners and the overfitting of data. In an earlier work, we developed the RPHP algorithm which solves both these problems by using a combination of genetic algorithm, weak learner and pattern distributor. In this paper, we revise the global search component by replacing it with a cluster-based combinatorial optimization. Patterns are clustered according to the output space of the problem, i.e., natural clusters are formed based on patterns belonging to each class. A combinatorial optimization problem is therefore formed, which is solved using evolutionary algorithms. The evolutionary algorithms identify the “easy” and the “difficult” clusters in the system. The removal of the easy patterns then gives way to the focused learning of the more complicated patterns. The problem therefore becomes recursively simpler. Overfitting is overcome by using a set of validation patterns along with a pattern distributor. An algorithm is also proposed to use the pattern distributor to determine the optimal number of recursions and hence the optimal number of weak learners for the problem. Empirical studies show generally good performance when compared to other state-of-the-art methods.
1 Introduction Recursive supervised learners are a combination of weak learners, data decomposition and integration. Instead of learning the whole dataset, different learners (neural networks, for instance) are used to learn different subsets of the data, resulting in several sub solutions (or sub-networks). These subnetworks are then integrated together to form the final solution to the system. Figure 1 shows the general architecture of such learners. K. Ramanathan and S.U. Guan: Enhancing Recursive Supervised Learning Using Clustering and Combinatorial Optimization (RSL-CC), Studies in Computational Intelligence (SCI) 82, 157–176 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
158
K. Ramanathan and S.U. Guan
Fig. 1. Architecture of the final solution of recursive learners
In the design of recursive learners, several factors come into play in determining the generalization accuracy of the system. Factors include 1. The accuracy of the sub networks 2. The accuracy of the integrator 3. The number of subnetworks The choice of subsets for training each of the subnetworks plays an important part in determining the effect of all these factors. Various methods have been used for subset selection in literature. Several works, including topology based selection [17], and recursive pattern based training [6], use evolutionary algorithms to choose subsets of data. Algorithms such as active data selection [23], boosting [21] and multisieving [19] implement this subset choice using neural networks trained with various training methods. Multiple weak learners have also been used and the best weak learner given the responsibility of training and solving a subset [9]. Other algorithms make use of more brute force methods such as the use of the Mahalanobhis distance [3]. Even other algorithms such as the output parallelism algorithm manually decompose their tasks [5], [7], [10]. Clustering has also been used to decompose datasets in some cases [2]. The common method used in these algorithms (except the manual decomposition), is to allow a network to learn some patterns and declare these patterns as learnt. The system then creates other networks to deal with the unlearnt patterns. The process is done recursively and with successively decreasing subset size, allowing the system to concentrate more and more on the difficult patterns. While all these methods work relatively well, the hitch lies in the fact that, with the exception of manual partitioning, most of the techniques above use
RSL Using Clustering and Combinatorial Optimization
159
some kind of intelligent learner to split the data. While intelligent learners and algorithms such as neural networks [12], genetic algorithms [13] and such are effective algorithms, they are usually considered as black boxes [1], with little known about the structure of the underlying solution. In this chapter, our aim is to reduce, by a certain degree, this black box nature of recursive data decomposition and training. Like in previous works, genetic algorithms are used to select subsets, however, unlike previous works; genetic algorithms are not used to select the patterns for a subset, but to select clusters of patterns for a subset. By using this approach, we hope to group patterns into subsets and derive a more optimal partitioning of data. We also aim to gain a better understanding of optimal data partitioning and the features of well partitioned data. The system proposed consists of a pre-trainer and a trainer. The pretrainer is made up of a clusterer and a pattern distributor. The clusterer splits the data set into clusters of patterns using Agglomerative hierarchical clustering. The pattern distributor assigns validation patterns to each of these clusterers. The trainer now solves a combinatorial optimization problem, choosing the clusters that can be learnt with best training and validation accuracy. These clusters now form the easy patterns which are then learnt using a gradient descent with the constructive backpropagation algorithm [18] to create the first subnetwork (a three layered percepteron). The remaining clusters form the difficult patterns. The trainer now focuses attention on the difficult patterns, thereby recursively isolating and learning increasingly difficult patterns and creating several corresponding subnetworks. The use of genetic algorithms in selecting clusters is expected to be more efficient than their use in the selection of patterns for two reasons 1. The number of combinations is now n Ck as opposed to T CL , where the number of available clusters n, is less than the number of training patterns T . Similarly, the number of clusters chosen k is smaller than the number of training patterns chosen L. The solution space is now smaller, therefore increasing the probability of finding a better solution. 2. The distribution of validation information is performed during pretraining, as opposed to during the training time. Validation pattern distribution is therefore a one-off process, thereby saving training time. The rest of the paper is organized as follows. We begin with some preliminaries and related work in section 2. In section 3, we present a detailed description of the proposed Recursive supervised learning with clustering and combinatorial optimization (RSL-CC) algorithm. To increase the practical significance of the paper, we include pseudo code, remarks and parameter considerations where appropriate. More details and specifications of the algorithm are then presented in section 4. In section 5, we present some heuristics and practical guidelines for making the algorithm perform better. Section 5.4 presents the results of the RSL-CC algorithm on some benchmark pattern recognition problems, comparing them with other recursive hybrid
160
K. Ramanathan and S.U. Guan
and non-hybrid techniques. In section 6, we complete the study of the RSLCC algorithm. We summarize the important advantages and limitations of the algorithm and conclude with some general discussion and future work.
2 Some preliminaries 2.1 Notation m n K I O Tr V al P S E T r Nchrom Nc
: : : : : : : : : : : : : :
Input dimension Output dimension Number of recursions Input Output Training Validation Ensemble of subsets Neural network solution Error Number of training patterns Recursion index Number of chromosomes Number of clusters
2.2 Problem formulation for recursive learning Let Itr = {I1 , I2 ,..., IT } be a representation of T training inputs. Ij is defined, for any j ∈ T over an m dimensional feature space, i.e, Ij ∈ Rm . Let Otr = {O1 , O2 ,..., OT } be a representation of the corresponding T training outputs. Oj is a binary string of length n. Tr is defined such that Tr = {Itr , Otr }. Similarly Iv = {Iv,1 , Iv,2 ,..., Iv,Tv } and Ov = {Ov,1 , Ov,2 ,..., Ov,Tv } represent the input and output patterns of a set of validation data, such that Val = {Iv , Ov }. We wish to take T r as training patterns to the system and V al as the validation data and come up with an ensemble of K subsets. Let P represent this ensemble of K subsets: K P = P1 , P2 ,...P , where, for i ∈ K i i i i P = Tr , Val , Tr = Iitr , Oitr , Vali = Iiv , Oiv Here, Iitr and Oitr are mxTi and nxTi matrices respectively and Iiv and K Oiv are mxTvi and nxTvi matrices respectively, such that i=1 Ti = T and 1 2 K i K , i=1 Tv = Tv. We need to find a set of neural networks S = S , S ,...S where S 1 solves P1 , S 2 solves P2 and so on.
RSL Using Clustering and Combinatorial Optimization
161
P should fulfill two conditions: Condition set 1 Conditions for a good ensemble 1. The individual subsets can be trained with a small mean square training error, i.e., Eitr = Oitr − Si (Iitr ) → 0. 2. None of the subsets Tr1 to TrK are overtrained, i.e, j j+1 i i i=1 Eval < i=1 Eval ; j, j + 1 ∈ K 2.3 Related work As mentioned in the introduction, several methods are used to select a suitable partition P = P1 , P2 ,...PK that fulfills the conditions 1 and 2 above. In this section we shall discuss some of them in greater detail and discuss their pros and cons. Manual decomposition Output parallelism was developed by Guan and Li [5]. The idea involves splitting a n-class problem into n two-class problems. The idea behind the decomposition is that a two class problem is easier to solve than an n-class problem and hence is more efficient. Each of the n sub problem consists of two outputs, class i and class i. Guan et al. [7] later added an integrator in the form of a pattern distributor to the system. The Output parallelism algorithm essentially develops a set of n sub-networks, each catering to a 2-class problem and integrates them using a pattern distributor. While the algorithm is shown to be effective in terms of both training time and generalization accuracy, a major drawback of the algorithm is its class based manual decomposition. In fact, research has been carried out [7] shows empirically that the 2-class decomposition is not necessarily the optimum decomposition for a problem. This optimum decomposition is a problem dependent value. Some problems are better solved when decomposed into three class sub problems, others when decomposed into sub-problems with a variable number of classes. While automatic algorithms have been developed to overcome this problem of manual decomposition [8], the net result is an algorithm which is computationally expensive. The other concern associated with the output parallelism is that it can only be applied to classification problems. Genetic algorithm based partitioning algorithms GA based algorithms are shown to be effective in partitioning the datasets into simpler subsets. One interesting observation was that using GA for partitioning leads to simpler and easily separable subsets [6]. However, the problem with these approaches is that the use of GA is computationally costly. Also, a criterion has to be established, right at the
162
K. Ramanathan and S.U. Guan
beginning, to separate the difficult patterns from the easy ones. In the case of various algorithms, criterions used include the history of difficulty, the degree of learning [17] and so on. In the Recursive Pattern based hybrid supervised learning (RPHS) algorithm [6], the problem of the degree of learning is overcome by hybridizing the algorithm and adapting the solution to the problem topology by using a neural network. In this algorithm, GA is only a pattern selection tool. However, the problem of computational cost still remains. Brute force methods Brute force mechanisms include the use of a distance or similarity measure such as the Mahalanobhis distance [3]. Subsets are selected to ensure good separation of patterns in one output class from another, resulting in good separation between classes in each subset. A similar method, bounding boxes, has been developed [9] which clusters patterns based on their Euclidean distances and overlap from patterns of other classes. While the objective of the approach is direct, much effort in involved in pretraining as it involves brute force distance computation. In [9], the authors have shown that the complexity of the brute force distance measure increases almost exponentially with the problem dimensionality. Also, a single distance based criterion may not be the most suitable for different problems, or even different classes. Neural network approaches Neural network approaches to decomposition are similar to genetic algorithms based approaches. Multisieving [19], for instance, allows a neural network to learn patterns, after which it isolates the network and allows another network to learn the “unlearnt” patterns. The process continues until all the patterns are learnt. Boosting [21] is similar in nature, except that instead of isolating the “learnt” patterns, a weight is assigned to them. An unlearnt pattern has more weight and is thus concentrated on by the new network. Boosting is a successful method and has been regarded as “one of the best off the shelf classifiers” in the world [11]. However, the system uses neural networks to select patterns, which solves the problem, but results in the black box like structure. The underlying reason for the solution formation is unknown i.e., we do not know why some patterns are better learnt. While one can solve the dataset with these approaches, not much information is gained about the data in the solving process. Also, the multisieving algorithm, for instance, does not talk about generalization, which is often an issue in task decomposition approaches. Subsets which are too small can result in the overtraining of the corresponding subnetwork.
RSL Using Clustering and Combinatorial Optimization
163
Clustering based approaches Clustering based approaches to task decomposition for supervised learning are many fold [2]. Most of them divide the dataset into clusters and solve individual clusters separately. However, this particular approach is not very good as it creates a bias. Many subsets have several patterns in one class and few patterns in other classes. The effect of this bias has been observed in the PCA reduced visualizations of the data and also in the generalization accuracy of the resulting network. Therefore, the networks created are not sufficiently robust. Separability For the purpose of this chapter, we are interested in subsets which fulfill the following conditions: Condition set 2 Separability conditions for good subset partitioning 1. Each class in a subset must be well separated from other classes in the same subset 2. Each subset must be well separated from another subset Intuitively, we can observe that the fulfilling of these two conditions is equivalent to fulfilling condition set 1
3 The RSL-CC algorithm The RSL-CC algorithm can be described in two parts, pre-training and training. In this section, we explain these two aspects of training in detail. 3.1 Pre-training 1. We express Itr , Otr , Iv and Ov as a combination of m classes of patterns, i.e., Itr Otr Iv Ov
= {IC1 , IC2 ,..., ICn } = {OC1 , OC2 ,..., OCn } C2 Cn = {IC1 v , Iv ,..., Iv } C2 Cn = {OC1 v , Ov ,..., Ov }
2. The datasets Itr , Otr , Iv and Ov are split into n subsets, C1 C2 C2 C2 Cn Cn Cn C2 Cn IC1 , OC1 , IC1 , I , .... I , O , O , I , O , O , I , O v v v v v v (1) where each subset in expression 1 consists of only patterns from one class.
164
K. Ramanathan and S.U. Guan
Ci , i ∈ n, now undergoes a clustering 3. Each subset, ICi , OCi , ICi v , Ov treatment as below: • Cluster ICi into k Ci partitions or natural clusters. Any clustering algorithm can be used, including SOMs, K-means [14], Agglomerative Hierarchical clustering [15], Bounding Boxes [9]. • Using a pattern distributor, patterns in Iv Ci are assigned to one of the k Ci patterns. In this paper, we implement the pattern distributor using the nearest neighbor algorithm [22]. • Each pattern, validation or training, in a given cluster j Ci , j ∈ k , has the same output pattern. 4. The total number of clusters is now the sum of the natural clusters formed in each class. n k ci (2) Nc = i=1
3.2 Training 1. Number of recursions r = 1. 2. A set of binary chromosomes are created, each chromosome having Nc elements, where Nc is defined in equation 2. An element in a chromosome is set at 0 or 1, 1 indicating that the corresponding cluster will be selected for solving using recursion r. 3. A genetic algorithm is executed to minimize Etot , the average of the training and validation errors Etr and Eval 1 Etot = (ET r + EV al ) (3) 2 4. The best chromosome Chrombest is a binary string with a combination of 0s and 1s, with the size Nc . The following steps are executed: a) Ncr = 0, Trr = [], Valr = [] b) For e = 1 to Nc c) if Chrombest (e) == 1 Ncr + + Trr = Trr + Trchrombest (e) Valr = Valr + Valchrombest (e) d) The data is updated as follows: Tr = Tr − Trr Val = Val − Valr Nc = Nc − Ncr r++ r
r
r
e) Tr and Val are used to find S , the solution network corresponding to the subset of data in recursion r. 5. Steps 2 to 4 are repeated with the new values of Tr, Val, Nc and r.
RSL Using Clustering and Combinatorial Optimization
165
Fig. 2. Using a KNN distributor to test the RSL-CC system
3.3 Simulation Simulating and testing the RSL-CC algorithm is implemented using a Kth nearest neighbor (KNN) [22] based pattern distributor. KNN was used to implement the pattern distributor due to the ease of its implementation. At the end training, we have K subsets of data. A given test pattern is matched with its nearest neighbor. If the neighbor belongs to subset i, the pattern is also deemed as belonging to subset i. The solution for subset i is then used to find the output of the pattern. A multiplexer is used for this function. The KNN distributor provides the select input for the multiplexer, while the outputs of subnetworks 1 to K are the data inputs. This process is illustrated by figure 2.
4 Details of the RSL-CC algorithm Figure 3 summarizes the data decomposition of the RSL-CC algorithm. Hypothetically, we can observe the algorithm as finding the simpler subset of data and developing a subnetwork to solve the subset. The size of the “complicated” subset becomes smaller as training proceeds, thereby allowing the system to focus more on the “complicated” data. When the size of the remaining dataset becomes too small, we find that there is no motivation for further decomposition and the remaining data is trained in the best possible
166
K. Ramanathan and S.U. Guan
Fig. 3. Illustration of the data decomposition of the RSL-CC algorithm
way. Later in this paper, we observe how the use of GA’s combinatorial optimization automatically takes care of when to stop recursions. The use of GAs to select patterns [17], [6] requires extensive tests for detrimental decomposition and overtraining. The proposed RSL-CC algorithm eliminates the need for such tests. As a result, the resulting algorithm is self sufficient and very simple, with minimal adaptations.
RSL Using Clustering and Combinatorial Optimization
167
4.1 Illustration Figure 4 shows a hypothetical condition where the RSL-CC algorithm is applied to create a system to learn the dataset shown. The steps performed on the dataset are traced below. With the data in Figure 4, the best chromosome selected at the end of the first recursion has the configuration: “0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 1”, the chromosome selected at the end of the second recursion has a configuration: “1 0 1 0 1 1 1 1 1”. And at the end of the third recursion, the chromosome has the configuration “1 1”. All the remaining data is selected and the training is considered complete. Note: 1. Hypothetical data pre-training: Patterns are clustered according to 1: class labels and 2. Natural clusters within the class. Clusters 1, 3, 5, 8, 12, 14, 15 and 17 contain patterns from class 1 and the rest of the clusters contain patterns from class 2. 3. The combinatorial optimization procedure of the 1st recursion selects the above clusters as the “easy” patterns. They are isolated and separately learnt. 4. These patterns are the “difficult” patterns of the 1st recursion, they are focused on in the second recursion. 5. The above patterns are considered “easy” by the combinatorial optimization of the second recursion and are isolated and learnt separately. 6. The remaining “difficult” patterns of the second recursion are solved by the 3rd recursion. 4.2 Termination criteria The grouping of patterns means that clusters of patterns are selected for each subset. Further, in contrast with any other method, the GA based recursive subset selection proposed selects the optimal subset combination. Therefore, we can assume that the decomposition performed is optimal. We prove this by using apogage Proof. Apogage is used: If the subset chosen at recursion i is not optimal, an alternative subset will be chosen. The largest possible alternative subset is the training set for the r−1 recursion, Trr = Tr − i=1 Tri . Therefore if decomposition at a particular recursion is suboptimal, no decomposition will be performed and the training will be completed. We therefore have the following termination conditions: Condition set 3 Termination conditions 1. No clusters of patterns are left in the system
168
K. Ramanathan and S.U. Guan
a.
b.
c.
d.
e.
Fig. 4. RSL-CC decomposition illustration
RSL Using Clustering and Combinatorial Optimization
169
2. Only one cluster is left in the remaining data 3. More than one cluster is present in the remaining data, but all the clusters belong to the same class Condition 1 above occurs when the optimal choice in a system is to choose all the clusters as decomposition is not favorable. Conditions 2 and 3 describe dealing with cases when it is not necessary to create a classifier due to the homogeneity of output classes. Fitness function for combinatorial optimization In equation 3, we defined the fitness function as an average of the training and validation errors obtained when training the subset selected by the chromosome, 12 (ET r + EV al ). The values of Etr and Eval are calculated as follows: 1. Design a three layered neural network with an arbitrary number of hidden nodes (we use 10 nodes, for the purpose of this paper). 2. Use the training and validation subsets selected by the corresponding chromosome to train the network. The best performing network is chosen as Chrombest .
5 Heuristics for improving the performance of the RSL-CC algorithm The design of a neural network system “is more an art than a science in the sense that many of the numerous factors involved in the design are as a result of one’s personal experience.” [12]. While the statement is true, we wish to make the RSL-CC system as less artistic and as much scientific as possible. Therefore, we propose here several methods which will improve and make the algorithm “to the point” as far as implementation is concerned. 5.1 Minimal coded genetic algorithms The implementation of Minimal coded Genetic Algorithms (MGG) [4], [20] was considered because the bulk of the training time of an evolutionary neural network is due to the evaluation of the fitness of a chromosome. In Minimal coded GAs however, only a minimal number of offspring is generated at each stage. The algorithm is outlined briefly below. 1. 2. 3. 4.
From the population P , select u parents randomly. Generate θ offspring from the u parents using recombination/mutation. Choose two parents at random from u. Of the two parents, one is replaced with the best from θ and the other is replaced by a solution chosen by a roulette wheel selection procedure of a combined population of θ offspring and two selected parents.
170
K. Ramanathan and S.U. Guan
In order to make the genetic algorithm efficient timewise, we choose the values of u = 4 and θ = 1 for the GA based neural networks. Therefore, except for the initial population evaluation, the time taken for evolving one epoch using MGG is equivalent to the forward pass of the backpropagation algorithm. 5.2 Population size The number of elements in each chromosome depends on the total number of cluster formed. However, the number of chromosomes in the population, in this paper, is evaluated as follows Nchrom = min(2Nc , P OP SIZE)
(4)
This means that the population size is either P OP SIZE, a constant for the maximal population size, or if Nc is small, 2NC . The argument behind the use of a smaller population size is so that when there are 4 clusters, for example, it does not make much efficiency to evaluate a large number of chromosomes. So only 16 chromosomes are created and evaluated. 5.3 Number of generations In the case where the number of chromosomes is 2NC , only one generation is executed. This step is to ensure efficiency of the algorithm. 5.4 Duplication of chromosomes Again, with efficiency in mind, we ensure that in the case where the population size is 2NC , we ensure that all the chromosomes are unique. Therefore, when the number of clusters is small, the algorithm is a brute force technique. sectionExperimental results 5.5 Problems Considered The table below summarizes the three classification problems considered in this paper. The problems were chosen such that they varied in terms of input and output dimensions as well as in the number of training, testing and validation patterns made available. All the datasets, other than the two-spiral dataset, were obtained from the UCI repository. The results of the two-spiral dataset were compared with constructive backpropagation, multisieving and the topology based subset selection algorithms only. This was because the SPAM and two spiral problems were twoclass problems. Therefore implementing the output parallelism will not make a difference to the results obtained by CBP.
RSL Using Clustering and Combinatorial Optimization
171
The two spiral dataset consists of 194 patterns. To ensure a fair comparison to the Dynamic subset selection algorithm [17], test and validation datasets of 192 patterns were constructed by choosing points next to the original points in the dataset as mentioned in the paper. 5.6 Experimental parameters and control algorithms implemented Table 2 summarizes the parameters used in the experiments. As we wish for the RSL-CC technique to be as problem independent as possible, we make all the experimental parameters constant for all problems and as given below. Each experiment was run 40 times, with 4-fold cross validation. For comparison purposes, the following algorithms were designed and implemented. The constructive backpropagation algorithm [18] was implemented as a single staged (non hybrid) algorithm which conducts gradient descent based search with the possibility of evolutionary training by the addition of a hidden node. The Multisieving algorithm [19] (recursive non hybrid pattern based selection) was implemented to show the necessity to find the correct pseudo global optimal solution. Table 1. Summary of the problems considered Problem Name
Vowel Letter recognition Two spiral
Training set size 495 Test set size 248 Validation set size 247 Number of inputs 10 Number of outputs 11
10000 5000 5000 16 26
194 192 192 2 2
Table 2. Summary of parameters used Parameter Evolutionary search parameters
Value Population size
20
Crossover probability
0.9
Small change mutation probability
0.1
MGG parameters
µ = 4, θ=1
Neural network parameters
Generalization loss tolerance for validation Backpropagation learning rate Number of neighbors in the KNN pattern distributor
1.5 10−2 1
172
K. Ramanathan and S.U. Guan
The following control experiments were carried out based on the multisieving algorithm and the dynamic topology based subset finding algorithm. Both the versions of output parallelism implemented also show the effect of hybrid selection. In order to illustrate the effect of the GA based combinatorial optimization, we also implement the single cluster algorithm explained in section 2 [2]. The algorithm, in contrast to the RSL-CC algorithm, simply divides the data into clusters and develops a network to solve each cluster separately. The RSL-CC algorithm was also compared to our earlier work on RPHS [6], which uses a hybrid algorithm to recursively select patterns, as opposed to clusters. In a nutshell, the following algorithms were implemented to compare with the various properties of the RSL-CC algorithm 1. 2. 3. 4. 5. 6. 7.
Constructive Backpropagation Multisieving1 Dynamic topology based subset finding Output parallelism without pattern distributor [5] Output parallelism with pattern distributor [7] Single clustering for supervised learning Recursive pattern based hybrid supervised learning
Table 2 summarizes the parameters used. For clustering, the agglomerative hierarchial clustering is employed with complete linkage and cityblock distance mechanism. Using thresholding, the natural clusters of patterns in each class were obtained. AHC was preferred to other clustering methods such as Kmeans or SOMs due to its parametric nature and since the number of target clusters is not required beforehand. 5.7 Results We divide this section into two parts. In the first part, we compare the mean generalization accuracies of the various recent algorithms described in this paper with the generalization accuracy of the RSL-CC algorithm. In the second part, we present the clusters and data decomposition employed by RSLCC, illustrating the finding of simple subsets. Comparison of generalization accuracies. From the tables above, we can observe that the generalization error of the RSL-CC is comparable to the generalization error of the RPHS algorithm and is a general improvement over other recent algorithms. A particularly significant improvement can be observed in the vowel dataset. There is some tradeoff observed in terms of training time. The training times for the RSL-CC algorithm for the vowel and two-spiral problems are 1
The multisieving algorithm [19] did not propose a testing system. We are testing the generalization accuracy of the system using the KNN pattern distributor, similar to the RSL-CC pattern distributor.
RSL Using Clustering and Combinatorial Optimization
173
Table 3. Summary of the results obtained from the VOWEL problem (38 clusters, 12 recursions) Algorithm used
Training Classification time (s) error (%)
Constructive backpropagation Multisieving with KNN pattern distributor Output Parallelism Output parallelism with pattern distributor RPHS Single clustering RSL-CC
237.9 318.23 418.9 534.3 473.88 458.43 547.37
37.16 39.43 25.54 24.89 17.733 25.24 9.84
Table 4. Summary of the results obtained from the LETTER RECOGNITION problem (100 clusters, 16 recursions) Algorithm used
Training Classification time (s) error (%)
Constructive Backpropagation Multisieving with KNN pattern distributor Output Parallelism Output parallelism with pattern distributor RPHS- MGGD RSL-CC
20845.05 55349
21.672 65.04
42785.4 45625.4
20.06 18.636
29701 12682
12.42 13.04
higher than other methods. However, it is interesting to note that the training time for the Letter Recognition problem is 50% less than any of the recent algorithms. It is felt that this reduction is training time comes from the reduction of the problem space from the selection of patterns to the selection of clusters, where clusters are selected from 100 possible clusters while RPHS has to select patterns out of 10,000, thereby reducing the solution space by 100 fold. On the other hand, for the vowel problem, the problem space is reduced by only about 13 fold. The performance of RSL-CC is more efficient when the reduction of the problem space is more significant than the GA-based combinatorial optimization. The RSL-CC decomposition figures Figures 5 and 6 illustrate the data decomposition for the letter recognition and the vowel problems. Only one instance of decomposition is presented in the figures. From the figures, we can observe the data being split into increasingly smaller subsets, thereby increasing focus on the difficult patterns. The decomposition presented is the 2 dimensional projections on the principal component axis (PCA) [3] of the input space.
174
K. Ramanathan and S.U. Guan
Table 5. Summary of the results obtained from the TWO-spiral problem (4 clusters, 2 recursions) Algorithm used Constructive Backpropagation Multisieving with KNN pattern distributor Dynamic Topology Based subset selection (TSS) RPHS Single clustering RSL-CC
Training Classification time (s) error (%) 15.58 35.89 – 59.97 14.35 30.61
49.38 23.61 28.0 11.08 10.82 10.82
Fig. 5. Decomposition of data for the Letter recognition problem
Fig. 6. Decomposition of data for the vowel problem
RSL Using Clustering and Combinatorial Optimization
175
6 Conclusions and future directions In this chapter, we present the RSL-CC algorithm which divides the problem space into class based clusters, where a combination of clusters will form a subset. Therefore, the problem becomes a combinatorial optimization problem, where the clusters chosen for the subset becomes the parameter to be optimized. Genetic algorithms are used to solve this problem to select a good subset. The subset chosen is then trained separately, and the combinatorial optimization problem is repeated with the remaining clusters. The situation progresses recursively until all the patterns are learnt. The sub networks are then integrated using a KNN based pattern distributor and a multiplexer. Results show that reducing the problem space into clusters simplifies the problem space and produces generalization accuracies which are either comparable to or better than other recent algorithms in the same domain. Future directions would include parallelizing the RSL-CC algorithm and exploring the use of other clustering methods such as K-means or SOMs on the algorithm. The study of the effect of various clusteringalgorithms will help us determine better the algorithm simplicity and the robustness. Also to be studied and determined are methods to further reduce the training time of combinatorial optimization, alternative fitness functions and ways to determine the robustness of class based clustering.
References 1. Dayhoff JE, DeLeo JM (2001) Artificial neural networks: Opening the black box. Cancer 91(8):1615–1635 2. Engelbrechet AP, Brits R (2002) Supervised training using an unsupervised approach to active learning. Neural Process Lett 15(3):247–260 3. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic, Boston 4. Gong DX, Ruan XG, Qiao JF (2004) A neuro computing model for real coded genetic algorithm with the minimal generation gap. Neural Comput Appl 13:221–228 5. Guan SU, Li S (2002) Parallel growing and training of neural networks using output parallelism. IEEE Trans Neural Netw 13(3):542–550 6. Guan SU, Ramanathan K (2004) Recursive percentage based hybrid pattern training. In: Proceedings of the IEEE conference on cybernetics and intelligent systems, pp 455–460 7. Guan SU, Neo TN, Bao C (2004) Task decomposition using pattern distributor. J Intell Syst 13(2):123–150 8. Guan SU, Qi Y, Tan SK, Li S (2005) Output partitioning of neural networks. Neurocomputing 68:38–53 9. Guan SU, Ramanathan K, Iyer LR (2006) Multi learner based recursive training. In: Proceedings of the IEEE conference on cybernetics and intelligent systems (Accepted)
176
K. Ramanathan and S.U. Guan
10. Guan SU, Zhu F (2004) Class decomposition for GA-based classifier agents – a pitt approach. IEEE Trans Syst Man Cybern B Cybern 34(1):381–392 11. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: Data mining, inference, and prediction. Springer, New York 12. Haykins S (1999) Neural networks, a comprehensive foundation. Prentice Hall, Upper Saddle River, NJ 13. Holland JH (1973) Genetic algorithms and the optimal allocation of trials. SIAM J Comput 2(2):88–105 14. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, Berkeley, pp 281–297 15. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: A review, ACM Comput Surv 31(3):264–323 16. Kohonen T (1997) Self organizing maps. Springer, Berlin 17. Lasarzyck CWG, Dittrich P, Banzhaf W (2004) Dynamic subset selection based on a fitness case topology. Evol Comput 12(4):223–242 18. Lehtokangas M (1999) Modeling with constructive backpropagation. Neural Netw 12:707–714 19. Lu BL, Ito K, Kita H, Nishikawa Y (1995) Parallel and modular multi-sieving neural network architecture for constructive learning. In: Proceedings of the 4th international conference on artificial neural networks, 409, pp 92–97 20. Satoh H, Yamamura M, Kobayashi S (1996) Minimal generation gap model for GAs considering both exploration and exploitation. In: Proceedings of 4th international conference on soft computing, Iizuka, pp 494–497 21. Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence, pp 1–5 22. Wong MA, Lane T (1983) A kth nearest neighbor clustering procedure. J Roy Stat Soc B 45(3):362–368 23. Zhang BT, Cho DY (1998) Genetic programming with active data selection. Lect Notes Comput Sci 1585:146–153
Evolutionary Approaches to Rule Extraction from Neural Networks Urszula Markowska-Kaczmar
Summary. A short survey of existing methods of rule extraction from neural networks starts the chapter. Because searching rules is similar to NP-hard problem it justifies an application of evolutionary algorithm to the rule extraction. The survey contains a short description of evolutionary based methods, as well. It creates a background to show own experiences from satisfying applications of evolutionary algorithms to this process. Two methods of rule extraction namely: REX and GEX are presented in details. They represent a global approach to rule extraction, perceiving a neural network by the set of pairs: input pattern and response produced by the neural network. REX uses prepositional fuzzy rules and is composed of two methods REX Michigan and REX Pitt. GEX takes an advantage of classical crisp rules. All details of these methods are described in the chapter. Their efficiency was tested in experimental studies using different benchmark data sets from UCI repository. The comparison to other existing methods was made and is presented in the chapter.
1 Introduction Neural networks are widely used in real life. Many successful applications in various areas may be listed here, for example: in show business [29], in pattern recognition [36] and [11], in medicine, e.g. in drug development, image analysis and patient diagnosis (two major cornerstones are detection of coronary artery disease and processing EEG signals [30]), in robotics [27], in industry [37] and optimization [7]. Their popularity is a consequence of ability to learn. Instead of an algorithm that describes how to solve the problem neural networks need a training set with patterns representing the proper transformation of the input patterns on the output patterns. After training they can generalize knowledge they possessed during training performing the trained task on the unseen before data. They are well known because of their skill in removing noise from data, also. The next advantages is their resistance against damages. U. Markowska-Kaczmar: Evolutionary Approaches to Rule Extraction from Neural Networks, Studies in Computational Intelligence (SCI) 82, 177–209 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
178
U. Markowska-Kaczmar
Although the number of the neural network applications increases, one of the obstacles to their acceptance in real life is the black box image of them. In industrial and bio-medical projects, the clients need to be convinced that the neural network output can be justified in an understandable way. This is the main reason why methods of knowledge extraction from neural networks are developed. Knowledge acquired from neural networks is represented as crisp prepositional rules. Fuzzy rules [26], first order rules and decision trees are used, as well. The most popular representation are prepositional rules, because they are easy comprehensible for human. As it is presented in Section 3 searching for condition standing in the premise part of the rule is similar to NP-hard problem. That is why evolutionary approach which is well known from its ability of the quick search is very helpful in this case.
2 The basics of neural networks In this section the elementary concepts of neural networks will be presented. It will give grounds for understanding, the way in which rules from neural network are extracted. Generally speaking, each neural network consists of elements that perform very simple operations: multiplication, addition and transformation obtained result by a given function. It is amazing how so simple elements by connecting them together are able to solve so complex problem. Let’s see in Fig. 1 that presents a model of neuron. It determines a net-input value on the basis of all its input connections. Typically, we calculate the net value assigned as net by summing the input values multiplied by corresponding weights as in Eq. (1). xj wij (1) neti = j
Once the net input is calculated it is converted to the output value by applying an activation function f : (2) yi = fi (neti ) Various types of functions can be applied as an activation function. Fig. 2 shows typical one. It is the sigmoidal function. The other popular one is hyperbolic tangent. Its shape is similar to the sigmoidal function, but the
Fig. 1. The model of a neuron
Rule Extraction from Neural Networks
179
Fig. 2. The example of sigmoidal function
Fig. 3. The feedforward neural network
values belong to the range (−1,1) instead of (0,1). These simple elements connected together create a neural network. Depending on the way they are joined one can distinguish feedforward and recurrent neural networks. Here we will focus on the first type of the neural network architecture. Frequently, in the feedforward network, neurons are organized in layers. The neurons in a given layer are not connected to each other. Connections exist only between neurons of the neighbouring layers. Such a network is shown in Fig. 3. One can distinguish the input layer that split input information to the neurons in the next layers. Then, the information is processed by the hidden layer. Each neuron in this layer adds weighted signals and processes the total net by an activation function. Once all neurons in this layer have calculated the output values, the neurons in the output layer become active. They calculate the total net and process it by the activation function as it was described for the previous layer. This is the way in which information is processed by the network. In general, the network can contain more than one hidden layer. As we mentioned above, the network possesses its ability to solve the problem after the training process. It consists in searching the weights in the
180
U. Markowska-Kaczmar
iterative way. After training the network can produce the response on the basis of the input value. To train the network, it is necessary to collect data in the training set T. (3) T = {(x1 , y1 ), (x2 , y2 ) . . . (xp , yp )} One element is created by a pair, which is an example of the proper transformation of an input value represented by the vector x = [x1 , x2 , . . . , xN ] onto the desired output value represented by the vector y = [y1 , y2 , . . . , yM ]. The most popular training rule is backpropagation. It minimizes a squared error Ep between desired output y for the given pattern x and the answer of the neural network represented by the vector o = [o1 , o2 , . . . , oM ] for this input. Ep = 1/2 (ypk − opk )2 (4) k
The weights are updated iteratively until an error reaches the assumed value. In each time step the weight is changed according to Eq. (5): w(t + 1) = w(t) + α · ∆w(t),
(5)
where α is the training coefficient, which influences on the speed of training, ∆w(t) is the change in weight during time step t. This value is calculated on the basis of the gradient error Ep . However, because of a large number of neurons in a network and the parallelism of processing, it is difficult to describe clearly how a network solves the problem. Broadly speaking, the knowledge of this problem is encoded in the network’s architecture, weights and activation functions of individual neurons.
3 Rule extraction from neural networks In the literature two main group of neural network applications is considered: classification and regression. The majority of methods is applied to the networks that perform the classification task, although papers concerning rule extraction for the network solving a regression task appear, as well [32]. Here we focused on the first group. 3.1 Problem formulation Let’s assume that a network solves a classification problem. This resolves itself into dividing objects (patterns) into k mutually separable classes C1 , C2 , . . . , Ck . For the network the class of object is encoded by the vector y = [y1 , y2 , . . . , yk ], where exists exactly one yi = 1 (for k = i yk = 0). Fig. 4 represents the scheme of the neural network solving classification task. As an input it takes a vector x describing attributes (features) of the pattern (object) and as an output it produces the vector y encoding the class
Rule Extraction from Neural Networks
181
Fig. 4. The scheme of the neural network for classification
of this pattern. We assume that during rule extraction we know the patterns from the training set. The aim of the rule extraction from a network is to find a set of rules that describe the performance of the neural network. The most popular form are prepositional rules, however predicates are applied, as well. Because of their comprehensibility we concentrate on the prepositional rules that have the following form: IF prem1 AN D prem2 AN D . . . .AN D premN T HEN classj
(6)
The premise premi refers to the i-th network input and formulates a condition that has to be satisfied by the value of i-th attribute in order to classify the pattern to the class, which is indicated by the conclusion of the rule. The other popular form of the rule are prepositional fuzzy rules that is shown below: IF x1 is Z1r AN D . . . AN D xN is ZN b T HEN y1 y2 . . . yk
(7)
where: xi corresponds to the i-th input of NN, the premise xi is Zib states that attribute (input variable) xi belongs to the fuzzy set Zib (i ∈ [1, N ]), and y1 y2 . . . yk is a code of a class, where k is the number of classes and only one yc = 1 (c ∈ [1, N ], for i = c yi = 0). In general, the number of premises is less or equal to N , which stands for the number of inputs in a neural network. Fuzzy sets can take different shapes. The most popular are: triangular, trapezoidal and Gaussian function. The example of triangular fuzzy sets are shown in Fig. 5. There are many criteria of the evaluation of the extracted rules’ quality. The most frequently used include: • fidelity – stands for the degree to which the rules reflect the behaviour of the network they have been extracted from, • accuracy – is determined on the basis of the number of previously unseen patterns that have been correctly classified, • consistency – occurs if, during different training sessions, the network produces sets of rules that classify unseen patterns in the same way,
182
U. Markowska-Kaczmar
Fig. 5. The way of assigning of triangular fuzzy sets; a < b < c
• comprehensibility – is defined as the ratio of the number of rules to the number of premises in a single rule. In [2] these four requirements are abbreviated to FACC. These criteria are also discussed in detail by Gosh and Taha in [34]. In real applications not all of them may be taken into account and their importance weight may differ. The main problem in rule extraction is that these criteria, especially fidelity and comprehensibility, tend to be contradictory. The least complex sets, consisting of few rules, cover usually only the most typical cases that are represented by large numbers of training patterns. If we want to improve a given set’s fidelity, we must add new rules that would deal with the remaining patterns and exceptions, thus making the set more complicated and less understandable. Therefore rule extraction requires finding a compromise between different criteria, since their simultaneous optimization is practically unfeasible. A good algorithm of rule extraction should have the following properties [35]: • • • •
independence from the architecture of the network, no restrictions on the process of a network’s training, guaranteed correctness of obtained results, a mechanism of accurate description of a network’s performance.
3.2 The existing methods of rule extraction from neural networks The research on effective methods of acquiring knowledge from neural networks has been carried out since the 90 s, which testifies not only to the importance of the problem, but also to its complexity. The developed approaches may be divided into two main categories: global and local. This taxonomy bases on the degree to which a method examines the structure of a network. In the local approach the first stage consists in describing the conditions of a single neuron’s activation, i.e. in determining the values of its inputs that produce an output equal to 1. Let us consider an example shown in Fig. 6. It comes from [2]. The neuron has five binary inputs and one binary output (for simplicity the step function is used in this case
Rule Extraction from Neural Networks
183
Fig. 6. A single neuron and extracted rules
with threshold in θ). Neuron will fire if the following condition is satisfied: 5
xi wi > θ
(8)
i=1
Let us consider the rule: IF x1 AND x2 AND NOT x5 THEN 1 which in Fig. 6 is represented as: y ← x1 ∧ x2 ∧¬x5 . The rule says that if x1 = True, x2 = True and x5 = False, then the output y is 1, i.e. True. This rule is true independently of the values of other attributes because: 0 ≤ x3 w3 + x4 w4 ≤ 4
(9)
What does mean rule extraction in the context of neural network with many hidden and output neurons and the continuous activation function. When sigmoidal function or hyperbolic tangent is applied then by setting appropriate value of β we can model the step function (Fig. 2). In this case to obtain rules for a single neuron we can follow the way presented above. Such rules are created for all neurons and concatenated on the basis of mutual dependencies. Thus we obtain rules that describe the relations between the inputs and outputs for the entire network as illustrated in Fig. 7. As examples one may mention the following methods: Partial RE [34], M of N [12], Full RE [34], RULEX [3]. The majority of algorithms belonging to this group in order to create a rule, which describes the activation of neuron considers the influence of the sign of the weight. When the weight is positive then this input helps to move the neuron activation to be fired (output = 1), when the weight is negative it means that this input disturbs in firing the neuron. The problem of rule extraction by means of the local approach is simple if the network is relatively small. Global methods treat a network as a black box, observing its inputs and responses produced at the outputs. In other words, a network provides the method with training patterns. The examples include VIA [35] that uses a procedure similar to classical sensibility analysis, BIO-RE [34] that applies
184
U. Markowska-Kaczmar
1
2
3
4
1
2
3 4
Fig. 7. The local approach to rule extraction; source: [2]
truth tables to extract rules, Ruleneg [13], where an adaptation of PAC algorithm is applied or [28], based on inversion. It is worth mentioning that due to such an approach the architecture of a neural network is insignificant. Most of the methods of rule extraction concern multilayer perceptrons (MLP networks - Fig. 3). However, methods dedicated to other networks are developed as well, e.g. [33], [10]. A method that would meet all these requirements (or at least the vast majority) has not been developed yet. Some of the methods are applicable to enumerable or real data only, some require repeating the process of training or changing the network’s architecture, some require providing a default rule that is used if no other rule can be applied in a given case. The methods differ in computational complexity (that is not specified in most cases). Hence the necessity of developing new methods. Some of them use evolutionary algorithm to solve this problem, what will be presented in section 5.
4 Basic concepts of evolutionary algorithms Evolutionary algorithms (EA) as inspired by biological evolution work with the population of individuals, that encode the solution1 . Individuals evolve toward better and better individuals by using genetic operators like mutation and crossover and by selection and reproduction. The operators especially mutation are specifically designed depending on the way of encoding the solution. For some problems specialized genetic operators are defined.
1
Also the evolutionary algorithms in the literature [9], [25] are used as a general term for different concepts like: genetic algorithms, evolutionary strategies, genetic programming, here we will you use it in a narrowed sense in order to underline the difference with reference to the genetic algorithm. In the classical genetic algorithm a solution is encoded in binary way. Here real numbers are used to encode information. It requires to specify special genetic operators but general outline of the algorithm is the same.
Rule Extraction from Neural Networks
185
The outline of evolutionary algorithm is included in the following steps: 1. Randomly generate an initial population P (0) of individuals I. 2. Compute and save fitness f (I) for each individual I in the current population P (t). 3. Define a selection probabilities of p(I) for each individual I in P (t), so that p(I) is proportional to f (I). 4. Generate P (t + 1) by selecting individuals from P (t) to produce offspring population by using genetic operators. 5. Go to step 2 until satisfying solution is obtained. These steps can be summarised as follows. An evolutionary algorithm starts with an initial population, which in most cases is created randomly. Then individuals are evaluated on the basis of fitness function that is a crucial issue in the design of the genetic algorithm. It expresses the function to be optimized in the target problem. The higher is the fitness function, the higher is the probability that an individual I will be selected to the new generation. There are different selection schemes, the main are: rank, roulette wheel, tournament selection. After selection, genetic operators - typically mutation and crossover are applied to form a new generation of individuals. Crossover is the operator that exchanges the genetic material between parents with probability pc . After crossover, mutation operator is applied, which is designed depending on the information encoding. For binary encoding mutation flips the bit. For real encoding usually the actual value is changed about small value with probability pc . With reference to the rule extraction these genetic operators will be described in the next sections.
5 Evolutionary methods in rule extraction from neural networks The description of rule extraction from neural networks shown in the previous section gave the reader an image of the search space scale and awakened to difficulties concerned with it. That is why there is a need for applying another techniques that limit the search space. Evolutionary algorithms are well known from their ability to search a solution in a huge space. Typically, they do not offer the optimal solution in the global sense2 but they produce it relatively quick. Taking into account the search of condition in the premises of rules, which is very hard problem, evolutionary algorithms may be very useful in this case.
2
There are evolutionary algorithms with proven convergence to a global optimum [8]
186
U. Markowska-Kaczmar
5.1 Local approach Let us recall that the local approach to rule extraction (Section 3.2), which starts with searching rules describing the activation of each neuron in the network, is simple only when the network is relatively small. In other case the methods decreasing of the connection number and the number of neurons are used. As an example of solutions applied in this case we can enumerate here clusterisation of hidden neurons activation (substituting a cluster of neurons by one neuron is used) or the structure of a neural network is optimized by introducing a special training procedure that causes the pruning of a neural network. Using evolutionary algorithms as it is shown in [31], [15] may be also very helpful. Let us start our survey of EA applications in rule extraction from neural network by presenting the approach from [31]. The authors evolve feedforward neural networks topology, that is trained by RPROP algorithm which is a gradient method similar to backpropagation method. They use ENZO tool, which evolves fully connected feedforward neural network with a single hidden layer. The tool is extended by implementation of RX algorithm of rule extraction from neural network (proposed by [19],) which is a local method. In ENZO each gene in a genotype representing an individual is interpreted as connection between two neurons. The selection is based on ranking. The higher is the ranking of an individual, the higher is the probability that it will be selected. Two crossover operators are designed. The first one inserts all connections from parents into the child and the weights are the average of the weights inherited from parents. The second one inserts in random chosen hidden neuron and its connection from parent with higher fitness value to the current topology of the child. The mutation operator inserts or removes neuron from the current topology. The next crucial element of the method based on evolutionary algorithm is a fitness function. Here the fitness function is expressed as follows: f itness = 1 − Wacc Acc + Wcom comprehensibility
(10)
where: Wacc , Wcom are the importance weights for terms accuracy and comprehensibility, Acc is accuracy, which is expressed as quotient of true positive covered patterns to the number of all patterns and comprehensibility in tern is defined as follows: comprehensibility = 1 −
2
R MaxR
+ 3
C MaxC
(11)
In this equation R is the number of rules, C is the total number of rule conditions, M axR is the maximum number of rules extracted from an individual, M axC is the maximum number of rule conditions among all individuals evolved so far.
Rule Extraction from Neural Networks
187
Fig. 8. One genetic cycle in extended ENZO includes rule extraction and evaluation
Evolution lasts the assumed number of generation, which is the parameter of the application. Comparing to the classical genetic algorithm the evolution cycle besides selection, crossover and mutation contains: training of each neural network, rule extraction and evaluation of rules acquired for each individual (Fig. 8). Similar but extended approach is used by [16] where a genotype of an individual is composed of the connection matrix and the weight matrix. The evolutionary algorithm searches for a minimal structure and weights of neural network. The final values of weights are tuned by RPROP algorithm. The crossover operator is not implemented but there are four mutation operators: insertion of a hidden neuron, deletion of a hidden neuron, insertion of a connection and deletion of a connection. To mutate the weight matrix Gaussian type mutation is used. In comparison to the previous described approach authors applied multiobjective optimization. Optimization of the neural network structure is expressed as biobjective optimization problem: min(f1 , f2 ) where : f1 = E
(12) (13)
f2 = Ω
(14)
where E is the error of neural network and Ω is the regularization term that is expressed by the number of connections in the network. In the paper there is an illustrative example that shows the feasibility study of the idea to search the optimal architecture of neural network and to acquire rules for that network. It is the breast cancer benchmark problem from UCI repository [4]. It contains 699 examples with 9 input features that belong to two classes: benign and malignant. Only the first output of the neural network is considered in the
188
U. Markowska-Kaczmar
Fig. 9. The simplest network found by approach proposed in [16]
work. The simplest network obtained by the authors from 41 networks is shown in Fig. 9. On the basis of this simple network it was very easy to acquire two rules, assuming that the class malignant is decided when y < 0.25 and benign when y > 0.75: R1 : IF x2 (clumpthickness ≥ 0.5 T HEN malignant
(15)
R2 : IF x2 (clumpthickness ≤ 0.2 T HEN benign
(16)
Based on these two rules, only 2 out of 100 test samples were misclassified and 4 of them cannot be decided with predicated value of 0.49. The paper [14] is an example of the evolutionary algorithm application in clustering of the activations of hidden neurons. It uses the simple encoding and the chromosomes has fixed length. It allows to use classical genetic operators. The fitness function is based on the Euclidian distance between activation values of hidden neurons and the number of objects belonging to the given cluster. The rules in this case have the following form: IF v1min ≤ a1 ≤ v1max AN D v2min ≤ a2 ≤ v2max . . . AN D vnmin ≤ an ≤ v1max T HEN class Cj , where ai is the activation function of hidden neuron, vimin and vimax are respectively minimal and maximal values; n is the number of hidden neurons. In order to acquire rules in the first step the equations describing an activation for each hidden neuron are created. Then, during processing training patterns for each neuron the activation value and the label of the class is assigned. Next, the values of activation obtained in the previous step are separated according to the class and the hidden neuron activations with the application of EA are clustered. It means that EA searches for groups of activation values of hidden neurons. Finally, in order to acquire rules expressed in (Eq. 17) minimal and maximal values for each cluster are searched.
Rule Extraction from Neural Networks
189
5.2 Evolutionary algorithms in a global approach to rule extraction from neural networks It seems that the global methods of rule extraction from neural network by offering an independence of the neural network architecture are very attractive proposals. Bellow there are few examples described in more detailed way. Searching for essential dependency We will start with the presentation of the relatively simple method, drawn up by [17], where evolutionary algorithm searches for the path composing of connections from an input neuron to the output neuron. Knowing that the neural network solves classification task the method searches for an essential input feature causing classification of patterns to the given class. Single gene encodes one connection, that means that in order to find path from the input layer to the output layer for two layered network, the chromosome consists of two genes. The fitness function is defined as a product of the connection weights, which belong to the path. They inform about the strength of the connections belonging to the path for producing the output. Classical genetic operators are used. Another idea is shown in [18]. In this paper for each attribute the range of values are divided into 10 subranges encoded by natural numbers. The nominative attributes are substituted by the number of binary attributes that is equal to the number of values for this attribute. The chromosome has as many genes as many inputs has the neural network. The example is shown in Fig. 10. Special meaning has the value of a gene equal to 0. It means that the condition referring to this attribute is passed over in the final form of the rule. The chromosome encodes the premise part of the rule. Conclusion is established on the basis of the neural network response. Fig. 10 explains the way of the chromosome decoding. Knowing that 1 corresponds to the first subrange of the attribute in the rule creating phase the first subrange is taken for the premise referring to the first attribute. In the same way values of the second and third genes are transformed. The pattern is processed by the neural network and in response on the output layer the code of the class is delivered. The output of the neural network is coded in a local way. According to this principle the biggest value in the output layer is searched. In case this value is greater than assumed threshold, the chromosome creates a rule.
Fig. 10. The example of the chromosome in the method described in [18]
190
U. Markowska-Kaczmar
Fig. 11. The idea of the evolutionary algorithm application in a global method of rule extraction
Treating a neural network as a black box Now we focus on the approaches that treat a neural network as a black box that delivers patterns for the method extracting rules. Fig. 11 illustrates the idea of this approach. Evolutionary algorithm works here as a tool extracting rules. In principle, this way of evolutionary algorithm application may be seen in similar way as rule extraction immediately from data. The main difference lies in the form of fitness function that in the case of rule extraction from neural networks has to consider the criteria described in the Section 3.1. Generally speaking there are two possibilities to extract a set of rules that describes the classification of patterns. In so called Michigan [25] approach one individual encodes one rule. In order to obtain the set of rules different techniques are used. As examples we can enumerate here: • sequential covering of examples, the pattern covered by the rule are removed from the set of patterns and for the rest of patterns a new rule is searched. • an application of two hierarchical evolutionary algorithms where one searches for single rules the set of which is optimized then by evolutionary algorithm on the higher level, • special heuristics, applied to define when the rule can be substituted by more general rule in the final set of rules. In the Pitt [25] approach the set of rules is encoded in one individual, which causes that all of the mentioned above solutions are useless, but the chromosome is much more complex. REX methods - fuzzy rule extraction Now Pitt and Michigan approaches to rule extraction from neural networks will be presented on the base of [23]. In this paper two methods of rule
Rule Extraction from Neural Networks
191
extraction from neural networks are described: REX-Pitt and REX-Michigan. Both of them use fuzzy prepositional rules instead of crisp rules as we have concerned so far. The fuzzy rules extracted by REX have the form expressed by (7). REX-Pitt keeps all knowledge extracted from the neural network in the genotype of one individual, however REX Michigan is composed of two evolutionary algorithms, where one of them is responsible for generating of rules (one rule is encoded in the genotype of an individual) and the second one searches for fuzzy set parameters (centers and width). Each fuzzy set in REX will be described by: • F – flag defining whether the given fuzzy set is active or not, • Coding data – one real number corresponding to the apex d of the triangle of a given fuzzy set (Fig. 12). In the case of REX, rules and fuzzy sets are coded independently. According to the form of rule presented in (7) the code of rule is composed of codes of premises and the code of conclusion (Fig. 13). The code of the rule contains a constant number of genes representing premises equal to the number of neural network inputs, but the bit of activation, standing before the premise, reports whether it is active or not. The part of chromosome coding the collection of groups of fuzzy sets is presented in Fig. 12. The number of groups is equal to the number n of the input neurons of the neural network. One group consists of a constant number z of fuzzy sets. Each fuzzy set is coded by the real number representing the apex di and the binary flag F of the activation of the fuzzy set. After decoding, only active fuzzy sets (F = 1) form the group. It is worth noting that the flag of activation of the premise and the flag of activation of the fuzzy set are
G1
FS2 , 1 F2 , 2
Gn
G2
FS2 , 2
FS2 , 3
FS2 , z
D2 , 2
Fig. 12. A part of the chromosome coding fuzzy sets; Gn – gene of n-th group of fuzzy sets; F Sn,l – code of l−th fuzzy set in the n-th group, Fl,k - flag, Dl,k – code of an dk apex of triangle of the F Sl,k fuzzy set
Fig. 13. A part of the chromosome coding rule
192
U. Markowska-Kaczmar
Fig. 14. The scheme of the chromosome in the REX Pitt; F – activation bit, P C – code of premise, CC – code of conclusion, D – the number corresponding to the triangle apex of the fuzzy set
independent. The first one decides which premises are present in the rule, while the second one determines the form of a fuzzy set group (the number of fuzzy sets and their ranges). Fig. 14. presents a general scheme of an individual in REX Pitt. F is the bit of activation. It stands before each rule, premise and fuzzy set. If it is set to 1, it means that the part after this bit is active. P C is a gene of premise and is coded as an integer number indicating the index of the fuzzy set in the group for a given input variable. The description of fuzzy sets included in one individual deals with all rules coded in the individual. To form a new generation, genetic operators are applied. One of them is a mutation, which consists in the random change of chosen elements in the chromosome. There are several mutation operators in the REX algorithm. The mutation of the central point of a fuzzy set (gene D in Fig. 14. which corresponds to d in Fig. 5.) is based on adding or subtracting a random floating–point number (the operation is chosen randomly). It is performed according to (17): d ←d±
rand() · range 10
(17)
where: rand() is the function giving a random number uniformly distributed in the range 0; 1); range is the range of possible values of d. The parameter equal to 10 in the denominator is used in order to ensure a small change of d. The mutation of an integer value (P C) relies on adding a random integer number modulo z, which is formally expressed by (18): i ← (i + rand(z − 1)) mod z
(18)
where z is a maximum number of fuzzy sets in a group and is set by the user. The mutation of a bit (for example bit F , bits in CC) is simply its negation. There is also a mutation operator which relies on changing the sequence of the rules. As it was mentioned, the sequence of rules is essential for the inference process.
Rule Extraction from Neural Networks
193
The second operator is the crossover, which is a uniform one ([25]). Because of the floating point representation and a complex semantic in the chromosome of the REX method, the uniform crossover is performed at the level of genes representing the rules and groups of fuzzy sets (see Fig. 14). This means that the offspring inherits genes of a certain whole rule or a whole group of fuzzy sets from one or another parent. After these genetic operations, the individual that is not consistent has to be repaired. One of the most important elements in the evolutionary approach is the design of the fitness function. Here, the evaluation of individuals is included in the fuzzy reasoning process by using a decoded set of rules with training and/or testing examples, and the calculation of metrics defined below which are used in the fitness function expressed by (19). The fuzzy reasoning result has to be compared with that of the NN, giving the following metrics: • corr – the number of correctly classified patterns – those patterns which fired at least one rule – and the result of reasoning (classification) was equal to the output of the neural network; this parameter determines the fidelity of the rule set; • incorr – the number of incorrectly classified patterns – those patterns for which the reasoning gave results being different than those of the neural network; • unclass – the number of patterns that were not classified – those patterns for which none of the rules fired, and it was impossible to perform the reasoning. There are also metrics describing the complexity of rules and fuzzy sets: • prem – the total number of active premises in active rules, • fsets – the total number of active fuzzy sets in an individual. All the metrics mentioned above were used to create the following evaluation function (19): f (i) = β · corri · Ψ (incorri ) + χ · corri · Ψ (unclassi ) + δ · Ψ (premi ) + · Ψ (f setsi )
(19)
where: i – index of the individual, β, χ, δ, – weight coefficients; Ψ (x) is a function equal to 2 when x = 0, and equal to 1/x when x = 0. Each component of the fitness function expresses one criterion of the evaluation of an individual. The first one assures increasing the ratio between correctly and incorrectly classified patterns. The next one is the greater; the greater is the number of correctly classified patterns and the less is the number of unclassified patterns. The last two components ensure minimizing the number of premises and the number rules. The algorithm terminates when one of the following conditions occur: a maximum number of steps elapsed; there is no progress for a certain number
194
U. Markowska-Kaczmar
Fig. 15. The idea of REX Michigan
of steps; or when the evaluation function for the best individual reaches a certain value. REX Michigan consists of two specialized evolutionary algorithms alternating with each other (see Fig. 15). The first one – EARules searches rules, while the second evolutionary algorithm, which we called EAFuzzySets, optimises the membership functions of fuzzy sets applied to the rules. This approach could be promising when the initial form of fuzzy sets is given by the experts. The role of EAFuzzySets would be only the tuning of the fuzzy sets. One individual in the EARules codes one rule. In one cycle, EARules is operating several times to find a set of rules describing the neural network. Each time patterns covered by the previously found rule are removed from the training set. This method is known as sequential covering. The rules are evaluated by using the simplified fuzzy reasoning, where one of the fuzzy sets found by EAFuzzySets is used. In the first cycle of REX Michigan the set of the fuzzy sets is chosen at random or it can be established by an expert (however, in presented experiments the first case was used). In the evolutionary algorithm optimising fuzzy sets (EAFuzzySets), one individual codes a collection of fuzzy set groups. They are evaluated on the basis of simplified fuzzy reasoning, as well. The EAFuzzySets is processing the given number of generations (GF ), then the best individual is passed to EARules. The cycle of the alternate work
Rule Extraction from Neural Networks
195
of these two stages lasts until the final set of rules with appropriate fuzzy sets represents the knowledge contained in the neural network at the appropriate level or the given number of cycles has elapsed. The evolutionary algorithm EARules concentrates on the optimization of the rules. It is included in periodically searching the best rule describing the neural network, which is then passed to the set of rules. All training examples, for which the result of reasoning with the investigated rule was the same as the neural network answer, are marked as covered by the rule. It allows to search the rule for the uncovered examples in the next cycle. This approach is similar to ([6]). The form of the rule is the same as for REX Pitt. The mutation operator from REX Pitt is also applied in REX Michigan. The crossover is a uniform operator. After the crossover or mutation, the consistency of each rule is tested. The individuals are repaired if necessary. To evaluate an individual for each training example, a simplified fuzzy reasoning is performed and appropriate metrics are calculated on the basis of the activation of the rule. The following metrics are used in the fitness function: • corri – the number of correctly classified examples, it means those examples for which the i–th rule was fired with an appropriate strength (greater than θ) and its conclusion was the same as classification of neural network for that pattern. This metric is applied for all training examples even if they were covered by other rules; • corr newi – the number of new covered examples – the covering of unmarked examples; • incorri – the number of incorrectly classified examples – the i–th rule fired with an appropriate strength (activation of the rule is greater than threshold θ)– but its conclusion was inconsistent with the neural network output. Here, all examples are considered, including examples, which are marked as covered; • unclassi – the number of unclassified examples, for which the i–th rule does not fire, but the conclusion of the rule was consistent with the neural network output for this example. All examples are concerned in this case; • unclass newi – the number of unclassified examples which are not marked as covered by other rule. Additionally, the fitness function applies the metric premi that informs about the number of active premises in the rule. The fitness function used at this level is presented by (20): f (i) = α · corr newi · Ψ (incorri ) &' ( corr newi · (incorri + 1) + β·Ψ + δ · Ψ (premi ) (corr newi + incorri + 1)3
(20)
where i – the number of an individual (rule); α, β, δ – the weight coefficients; corr newi , incorri , premi – metrics for the i-th rule; Ψ is the function described by (19).
196
U. Markowska-Kaczmar
The first component of the fitness function (20) ensures maximizing the number of newly correctly classified patterns while minimizing the number of incorrectly classified patterns. The second one has a strong selective role, because its value for an individual representing a rule that does not cover any new pattern is much less than for that one which covers only one example. The third component minimizes the number of premises in the rule. The optimization of fuzzy sets is performed by an evolutionary algorithm, which is called EAFuzzySets. A collection of groups of fuzzy sets is coded in one individual. One group Gj corresponds to one input attribute that relates to the neural network input. Additionally, a gene RS coding the sequence of fuzzy rules is included. The initial experiments have shown that classical fuzzy reasoning was not an effective solution in this case, so a determination of the sequence of fuzzy rules was necessary. The form of chromosome in EAFuzzySets is presented in Fig. 12. The coding of a group of fuzzy sets is common for both approaches (REX Pitt and REX Michigan). The way of coding a single rule is similar as in REX Pitt. The first number in the premise reports the activity of the premise. The second number is simply the index of a fuzzy set in the group corresponding to the given neural network input. Generalising, we can say that the information contained in one chromosome in REX Pitt is split into two chromosomes in REX Michigan. These two chromosomes represent individuals in EARules and EAFuzzySets. The evaluation of individuals involves their decoding. As a result, in EARules the set of rules is obtained. Then, for each example included in the training set, simplified fuzzy reasoning takes place. For individual j at this level, the following statistics are collected during this process: corrj , incorrj , unclassj . Additionally, the complexity of the collection of fuzzy sets is measured by the parameter f setsj describing the number of active fuzzy sets in the collection of fuzzy sets. The form of fitness function is as follows (21): f (j) = β · corrj · Ψ (incorrj ) + χ · corrj · Ψ (unclassj ) + δ · Ψ (f setsj ) (21) where: j – is an index of an individual; β, χ, δ – weight coefficients; corrj , incorrj , unclassj , f setsj – metrics for the j–th individual. Function Ψ is defined by (19). The first component of the fitness function in (21) ensures the greater value of fitness, the greater is the number of correctly and the less is the number of incorrectly classified patterns. The second one is the greater; the greater is the number of correctly classified patterns and the less is the number of unclassified ones. The last component minimises the number of fuzzy sets in the individual.
Fig. 16. The scheme of the chromosome at the EAFuzzySets level
Rule Extraction from Neural Networks
197
Fig. 17. The comparison of the extracted rules for REX Pitt and Full–RE for the IRIS data set
Experimental studies performed with Iris, Wisconsin Breast Cancer and Wine data sets have shown that the REX methods gave satisfying results small number of comprehensible rules. The comparison with Full–RE [34] made on the basis of the Iris data divided into two partitions – 89 patterns in learning partition and 61 patterns in testing one has shown that REX Pitt and Full–RE gave comparable results in terms of fidelity (Fig. 17). One can observe that the rules extracted by REX Pitt are very similar to those extracted by Full–RE, taking the number of rules and the number of premises into account. Also, the boundary values for each attribute in the rules extracted by Full– RE are very close to the centres and widths of fuzzy sets in premises of rules extracted by REX. Because the REX methods extract fuzzy rules, they express knowledge in the way which is more natural for humans. REX Pitt produces a smaller number of rules than the other methods. REX Michigan seems to be a worse solution. The greatest problem with the approach was creating a too large number of rules. The algorithm does not ensure the optimization of the number of rules. It searches for solutions where a single rule covers as many as possible examples from the proper class, and as few as possible examples from the other classes. GEX - crisp rule extraction Much more successful application of Michigan approach to rule extraction from neural networks may be found in [20] and [21]. Both papers describe GEX - the method, where one individual represents a crisp rule, but the final set of rules is optimized by heuristics and parameters of the method thanks of which one can influence on the number of rules in the final set. The idea of GEX performance is shown in Fig. 18. It is composed of as many evolutionary
198
U. Markowska-Kaczmar
Fig. 18. The idea of GEX performance
algorithms as many classes exists in the classification problem solved by the neural network. The individuals in subpopulation can evolve independently or optionally migration of individuals is possible. Each species contains individuals corresponding to one class, which is recognized by the neural network. One individual in a subpopulation encodes one rule. The form of the rule is described by (6). The premise in a rule expresses a condition, which has to be satisfied by the value of the corresponding input of the neural network in order to classify the pattern to the class indicated by the conclusion of the rule. The form of the premise is depending on the type of attribute, which is included in the pattern. In GEX the following types of attributes (feature Xj ) are concerned: • real Xj ∈ Vr ⇔ Xj ∈ . Between them two types are distinguished: – continuous Vc : their domain is defined by a range of real numbers Xj ∈ Vc ⇔ d(Xj ) = (xjmin ; xjmax ) ∈ . – discrete Vd : the domain creates a countable set Wd of values wi and the order relation is defined on this set. Xj ∈ Vd ⇔ d(Xj ) = {wi ∈ , i = 1, . . . k, k ∈ ℵ}. • nominal Vw : the domain is created by a set of discrete unordered values Xj ∈ Vw ⇔ d(Xj ) = {w1 , w2 , . . . ww }, where wi is a symbolic value • binary Vb : the domain is composed of only two values True and False Xj ∈ Vb ⇔ d(Xj ) = {T rue, F alse} For real type of attribute (discrete and continuous) the following premises are covered: • xi < value1 , • xi < value2 ,
Rule Extraction from Neural Networks
• • • •
199
xi > value1 , xi > value2 ,) value1 < xi * xi < value2 ⇔ xi ∈ (value1 ; value2), xi < value1 value2 < xi .
For a discrete attribute, instead of (<, >) inequalities (≤, ≥) are used. One assumes that value1 < value2 . For enumerative attributes – only two operators of relation are used {=, =}, so the premise has one of the following form: • xi = valuei , • xi =
valuei . For boolean attributes there is only one operator of relation =. It means that the premise can take the following form: • xi = T rue, • xi = F alse. All rules in one subpopulation have identical conclusion. The evolutionary algorithm is performed in a classical way (Fig. 19). The only difference between classical performance of evolutionary algorithm and the proposed one lies in the evaluation of individuals, which requires the existence of decision system based on the processing of rules. After each generation rules are evaluated by using the set of patterns, which are processed by the neural network and the rules from the actual population (Fig. 11). To realize it a decision system
Fig. 19. The schema of evolutionary algorithm in GEX
200
U. Markowska-Kaczmar
consisting in searching rules that cover given pattern is implemented. The rule covers a given example according to the definition Definition 1. Definition 1 The rule ri covers a given example p when for all values of the attributes presented in this pattern the premises in the rule are true. The comparison of the results of classification is the basis for the evaluation of each rule, which is expressed by the value of a fitness function. The evolutionary algorithm performing in the presented way will look for the best rules that cover as many patterns as possible. In this case the risk exists that some patterns never would be covered by any rule. To solve this problem in GEX the niche mechanism is implemented. The final set of rules is created on the basis of best rules found by evolutionary algorithm but also some heuristics are developed in order to optimize it. The details of the method describing evolutionary algorithm and mentioned above heuristics are presented in the following subsections. Figure 20 shows the general scheme of the genotype in GEX. It is composed of the chromosomes corresponding to the inputs of neural network and a single gene of conclusion. A chromosome consists of gene being a flag and genes encoding premises, which are specific for the type of attribute of the premise it refers to. The existence of flag assures that the rules have different length, because the premise is included in the body of the rule if the flag is set to 1, only. The chromosome is designed dependently on the type of attribute (Fig. 21) in order to reflect the condition in the premise. For the real type of the attribute the chromosome consists of the code of relation operator and two values determining the limits of range (Fig. 21c). For the nominal attribute there is a code of the operator and a value (Fig. 21b). Fig. 21a represents a chromosome for the binary attribute. Besides the gene of flag, it consists of one gene referring to the value of the attribute
Fig. 20. Scheme of a chromosome in GEX
Fig. 21. The designed genes in GEX
Rule Extraction from Neural Networks
201
The initial population is created randomly with the number of individuals equal to StartSize. The basic operators used in GEX are a crossover and a mutation. They are applied after a selection of individuals that creates a pool of parents for the offspring population. In the selection the roulette wheel is used. The individuals that are not chosen to become parents are moved to the pool of weak individuals (Fig. 18). In each generation the size of population is decreased by 1. When the population size reaches the value defined by the parameter M inSize migration operator becomes active. It consists in taking individuals from the pool of weak individuals (Fig. 18) to increase the size of the population to N size. In case the migration is inactive a kind of macromutation is used. In the experimental studies the two-points crossover was used. It relies on the choice of a couple of the parent genotypes with the probability pc w, then two points are chosen in random and the information is exchanged. These points can only lie between chromosomes. It is not allowed to cut the individuals between genes in the middle of chromosome. The mutation is specifically designed for each type of a gene and is strongly dependent on the type of the chromosome (premise) it refers to. It changes the information contained in gene. The following parameters define this operator: • pmu−op - the probability of mutation of the relation operator or binary value, • pmu−range - the probability of mutation of the range limits, pmu−act - the probability of mutation of value for genes in chromosomes for nominal attributes, • rch - the range change. The mutation of the flag A relies in the change of its actual value to the opposite one with probability pmu−op . The mutation of the gene containing value in the chromosome of a binary attribute is realized as the change of the gene value to its opposite value with the probability pmu−op (it flips the value True to False or False to ). The mutation of the gene Operator independently on the chromosome consists in the change of the operator to other operator defined for this type of premise with the probability pmu−op . The mutation of gene referring to the value in chromosomes for the nominal attributes is realized as the change of the actual value to the other one specified for this type with the probability pmu−act . The mutation of the gene encoding the limits of a range in chromosomes for real attributes consists in the change value1 and value2 . It is realized distinctly for continuous and discrete values. For continuous attributes the limits are changed into new values by adding a value from the following range (22). (−(ximax − ximin ) · rch ; (ximax − ximin ) · rch ),
(22)
where ximax and ximin are respectively the maximal and minimal values of i-th attribute, rch is the parameter, which defines how much the limits of range
202
U. Markowska-Kaczmar
can be changed. For the discrete type a new value is chosen in random from values defined for this type. The assumed fitness function, is defined as the weighted average of the following parameters: accuracy (acc), classCovering (classCov), inaccuracy (inacc), and comprehensibility (compr): F un =
A ∗ acc + B ∗ inacc + C ∗ classCov + D ∗ compr A+B+C +D
(23)
Weights (A, B, C, D) are implemented as the parameters of the application. Accuracy measures how good the rule mimics knowledge contained in the neural network. It is defined by (24). acc =
correctF ires totalF iresCount
(24)
inacc is a measure of incorrect classification made by the rule. It is expressed by Eq. (25). inacc =
missingF ires totalF iresCount
(25)
Parameter classCovering abbreviated as classcov contains information about the part of all patterns from a given class, which are covered by the evaluated rule. It is formally defined by Eq. (26); classcov =
correctF ires , classExamplesCount
(26)
where classExamplesCount is a number of patterns from a given class. The last parameter - comprehensibility abbreviated as compr is calculated on the basis of Eq. (27). compr =
maxConditionCount − ruleLength , maxConditionCount − 1
(27)
where ruleLength is the number of premises of the rule, maxConditionsCount is the maximal number of premises in the rule. In other words, it is the number of inputs of neural network. During the evolution the set of rules is updated. Some rules are added and some are removed. In each generation individuals with accuracy and classCovering greater then minAccuracy and minClassCovering are the candidates to update the set of rules. The values minAccuracy and minClassCovering are the parameters of the method. Rules are added to the set of rules when they are more general than the rules actually being in the set of rules according to the Definition 2. Definition 2 Rule r1 is more general than rule r2 when the set of examples covered by r2 is a subset of the set of examples covered by r1 . In the case the rules r1 and r2 cover the same examples, the rule that has the bigger fitness value is assumed as more general.
Rule Extraction from Neural Networks
203
Furthermore, the less general rules are removed. After presenting all patterns for each rule usability is calculated according to Eq. (28). usability =
usabilityCount examplesCount
(28)
All rules with usability less then minU sability, which is a parameter set by the user, are removed from the set of rules. We can say that the optimization of the set of rules consists in removing less general and rarely used rules and in supplying them by more general rules from the current generation. The following statistics characterize the quality of the set of rules. The value covering defines the percentage of the classified examples from all examples used in the evaluation of the set of rules (Eq. 29). covering =
classif iedCount examplesCount
(29)
F idelity expressed in (Eq. 30) describes the percentage of correct (according to the neural network answer) classified examples from all examples classified by the set of rules. f idelity =
correctClassif iedCount classif iedCount
(30)
Covering and f idelity are two measures of a quality of the acquired set of rules that say about its accuracy generalization. Additionally, the perf ormance (Eq. 31) is defined, which informs about the percentage of the correct classified examples compared to all examples used in the evaluation process. perf ormance =
correctClassif iedCount examplesCount
(31)
In Table 1 the results of experimental studies for benchmark data files are shown. The aim of experiments was to test the quality of the rule extraction made by GEX for data sets with different types of attributes. The procedure proceed as follows. First, data set was divided into 10 different parts. Each time 9 parts of the training set was used to train and the last part was used to test. It was repeated 25 times (it is equivalent to 2.5 times made 10-fold cross validation). Then, the results were averaged and the mean and the standard deviation was calculated. The 15 tested data sets come from UCI repository. They represent a spectrum of data sets with different types of attributes. The files - Agaricus lepiota, Breast cancer, Tictactoe, Monk1 have the nominal attributes, only. WDBC, Vehicle, Wine and Liver are the examples of files with continuous attributes. Discrete attributes are in Dermatology files, logical in Zoo data file. Mixed attributes are in the files Australian credit, Heart, Ecoli. In these experiments the values of parameters were set as follows: • mutation = 0.2, • crossover = 0.5,
204
U. Markowska-Kaczmar
Table 1. The results of experiments with GEX in terms: the number of generations (Ngenerations), the number of rules (Nrules), covering and fidelity for data sets from UCI repository with different types of attributes using 10 − cross validation (average value ± standard deviation is shown) f ile
N rules
covering
f idelity
Agaricus
111,6 ± 43,2
27,48 ± 5,42 0,985 ± 0,006
Breast Cancer
220,8 ± 200,1
19,64 ± 2,33 0,981 ± 0,02
0,982 ± 0,014
Tictactoe
920,9 ± 218
47,12 ± 7,61 0,790 ± 0,05
0,979 ± 0,022
1,00 ± 0,0000
Monk-1
47.3 ± 18.8
9.16 ± 1.40 0.983 ± 0.041
1.000 ± 0.997
WDBC
1847.9 ± 141.8
28.40 ± 4.51 0.772 ± 0.110
0.977 ± 0.024
Vehicle
1651.8 ± 296.3
16.88 ± 2.0
0.804 ± 0.134
0.723 ± 0.152
27.8 ± 3.56 0,889 ± 0,062
0,970 ± 0,022
Pima
• • • • • • • •
N generations
1459 ± 500.4
Wine
120,4 ± 30,3
10,04 ± 1,84 0,953 ± 0,042
0,972 ± 0,042
Liver
1867,4 ± 127,7
33,68 ± 4,05 0,199 ± 0,080
0,685 ± 0,166
Dermatology
1680,3 ± 344,4
24,8 ± 3,12 0,690 ± 0,061
0,985 ± 0,026
Austral. credit
1660,5 ± 321,9
67,52 ± 4,98 0,643 ± 0,056
0,899 ± 0,045
Heart
1420.8 ± 449.7
17.28 ± 2.01 0.910 ± 0.047
0.877 ± 0.057
E. coli
1488,2 ± 465,2
18,2 ± 1,94 0.914 ± 0.046
0.925 ± 0.055
Hypothyroid
561.7 ± 425.9
17.56 ± 3.36 0.978 ± 0.008
0.996 ± 0.004
Zoo
357.6 ± 433.2
8.36 ± 0.76 0.968 ± 0.059
0.963 ± 0.059
number of individuals = 40, minimal number of individuals = 20, type of crossover = two point, operators for real attributes = 1, 2, 3, 4, 5, minAccuracy = 1, minUsability = 1, max number of generation = 2000, niching and migration = on.
The stopping criteria was set as achieving 98% of performance or 250 generation without any improvement of the performance or 2000 generation was exceeded. The results of this extensive tests allow to conclude that GEX is general method that enables acquiring rules for classification problems with various types of attributes. However, the results are strongly dependent on the parameters of the method and there is the need of tuning them for a given data set. It refers to the number of individuals and the number of generation needed to obtain assumed accuracy and performance of the acquired rules. It can be perceived as a certain drawback of the proposed method.
Rule Extraction from Neural Networks
205
Up left
Up right
Down left
Down right
Other tests made with the following data sets from UCI: SAT (36 continuous attributes, 4435 patterns), Artificial characters (16 discrete attributes, 16 000 patterns), Sonar (60 continuous attributes, 208 patterns), Animals (72 nominal and continuous attributes, 256 patterns), German Credit (21 nominal, discrete and continuous attributes, 1000 patterns) and Shutle (9 continuous, 43500 attributes) confirmed that the method is scalable. A comprehensibility of the rules acquired by GEX were tested, as well. In this case LED data set was chosen, because its patterns are easy visualized. The simple version of this data file contains seven attributes representing the segments of LED as it is shown in the Fig. 22. In the applied file these 7 attributes were increased in additional 17 attributes that are specified in random as 0 or 1. To keep the task nontrivial to the proper attributes representing segments of LED 1% of noise was introduced (1% values of proper attributes has change their values to the opposite one). This problem was not easy to classify for neural network, as well. It was trained with performance - 80%. After several runs of the GEX application with different values of parameters the example of the extracted set of rules is presented in Table 2. It can be noticed that only in the rules for digit 3 there is the premise referring to the additional - artificial attribute. For the rest digits the rules are perfect. They are visualized in Fig. 23 (except of rules for digit 3). During tests one could observe that the time of rule extraction and the final result (rules for all
Fig. 22. The description of segments in LED Table 2. The example of the set of rules for LED data set IF IF IF IF IF IF IF IF IF IF IF IF
not up-left AND mid-center AND down-right AND att16 THEN 3 not up-left AND mid-center AND down-right AND not att16 THEN 3 not up-left AND down-right AND down-center AND att9 THEN 3 not mid-center AND down-left THEN 0 not up-center AND not mid-center THEN 1 not down-right THEN 2 not up-center AND mid-center THEN 4 not up-right AND not down-left THEN 5 not up-right AND down-left THEN 6 up-center AND not down-center THEN 7 up-left AND up-right AND mid-center AND down-left THEN 8 up-left AND up-right AND not down-left AND down-center THEN 9
206
U. Markowska-Kaczmar
Fig. 23. Visualization of LED data set - the first row, visualization of rules extracted for LED data set - the second row Table 3. The comparison of GEX performance to performance of other methods of rule extraction from neural networks for different data sets data set
test method :
Breast cancer wisconsin
10-fold cross validation
data set
test method :
GEX M ozer s method [1] 97.81
GEX Santos s method [31]
Iris
70% patterns = training set 92.48 Breast cancer wisconsin 30%patterns = testing set 90.63 data set
test method :
Breast cancer wisconsin
10-fold cross validation
95.2
93.33 85.32
GEX Bologna s method [5] 97.81
96.68
classes and the number of rules for each class) was strongly dependent on the first found rule describing one from the similar digits (6, 9, 8). During these experiments the crucial were parameters: minAccuracy and minU sability. For noisy data minAccuracy equal to 0.9 or even 0.8 gave good results. The greater is the value minU sability the more general rules are in the final set of rules. Table 3 presents a comparison of results obtained by GEX and other methods of rules extraction from neural network. Obtained results do not allow to state that GEX is always better then its counterpart methods (e.q. see Iris data set for Santos’s method), but it can be observed that in most cases GEX surpasses the other methods. Besides it delivers rules for all classes without necessity of existence the default rule.
6 Conclusion Neural networks, are very efficient in solving various problems but they have no ability of explaining their answers and presenting gathered knowledge in a comprehensible way. In the last years numerous methods of data extraction
Rule Extraction from Neural Networks
207
have been developed. Two main approaches are used, namely the global one that treats a network as a black box and the local one that examines its structure. The purpose of these algorithms is to produce a set of rules that would describe a network’s performance with the highest fidelity possible, taking into account its ability to generalise, in a comprehensible way. Because the problem of rule extraction is very complex, using evolutionary algorithms for this purpose, especially for networks that solve the problem of classification is popular and popular. In the chapter both examples of their usage were presented - for local rule extraction as well as for global one. Effective application of local approach depends on the number of neurons in the net, so existing methods follow to limit an architecture of neural network. Evolutionary algorithm can be applied to cluster hidden neurons activation and substituting a cluster of neurons by one neuron. In other application they optimize the structure of a neural network. Global methods treat a neural network as black box. As such it delivers the training examples for neural network acquiring rules. In this group three different methods based on evolutionary algorithm were presented. The experimental studies presented in this chapter show that the methods of rule extraction from neural networks based on evolutionary algorithms are not only the promising idea but they can deliver rules fuzzy or crisp that are accurate and comprehensible for human in acceptable time. Because of the limited space of this chapter the methods using multiobjective optimization in the Pareto sense in the global group are not presented here. Interested readers can be referred to [24] and [22].
References 1. Alexander JA, Mozer M (1999) Template-based procedures for neural network interpretation. Neural Netw 12:479–498 2. Andrews R, Diedrich J, Tickle A (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389 3. Andrews R, Geva S (1994) Rule extraction from constrained error back propagation MLP. In: Proceedings of 5th Australian conference on neural networks, Brisbane, Quinsland, pp 9–12 4. Blake CC, Merz C (1998) UCI Repository of Machine Learning Databases. University of California, Irvine, Department of Information and Computer Sciences. 5. Bologna G (2000) A study on rule extraction from neural network applied to medical databases. In: The 4th European conference on principles and practice of knowledge discovery 6. Castillo L, Gonz´ alez A, P´erez R (2001) Including a simplicity criterion in the selection of the best rule in a genetic algorithm. Fuzzy Sets Syst 120:309–321 7. Cichocki A, Unbehauen R (1993) Neural networks for optimization and signal processing. Wiley, London 8. Eiben A, Aarts E, Hee K (1991) Parallel problem solving from nature. Chapter Global convergence of genetic algorithms: a Markov chain analysis, Springer, Berlin Heidelberg New York, p 412
208
U. Markowska-Kaczmar
9. Eiben AE, Smith J (2003) Introduction to evolutionary computing. Natural computing series, Springer, Berlin Heidelberg Newyork 10. Fu X, Wang L (2001) Rule extraction by genetic algorithms based on a simplified RBF neural network. In: Proceedings congress on evolutionary computation, pp 753–758 11. Haddadnia J, Ahmadi M, Faez P (2002) A hybrid learning RBF neural network for human face recognition with pseudo Zernike moment invariant. In: IEEE international joint conference on neural network 12. Hayashi Y, Setiono R, Yoshida K (2000) Learning M of N concepts for medical diagnosis using neural networks. J Adv Comput Intell 4:294–301 13. Hayward R, Ho-Stuart C, Diedrich J, et al. (1996) RULENEG: extracting rules from trained ANN by stepwise negation. Technical report, Neurocomputing Research Centre Queensland University Technology Brisbane, Old 14. Hruschka ER, Ebecken N (2000) A clustering genetic algorithm for extracting rules from supervised neural network models in data mining Tasks’. IJCSS 1(1) 15. Jin Y, Sendhoff B, Koerner E (2005a) Evolutionary multi-objective optimization for simultaneous generation of signal-type and symbol-type representations. In: The third international conference on evolutionary multi-criterion optimization, pp 752–766 16. Jin Y, Sendhoff B, Korner E (2005b) Evolutionary multiobjective optimization for simultanous generation of signal type and symbol type representation. In: EMO 2005 LNCS, pp 752–766 17. Keedwell E, Narayanan A, Savic D (1999) Using genetic algorithm to extract rules from trained neural networks. In: Proceedings of the genetic and evolutionary computing conference, pp 793–804 18. Keedwell E, Narayanan A, Savic D (2000) Creating rules from trained neural networks using genetic algorithms. IJCSS 1(1):30–43 19. Lu H, Setiono R, Liu H (1995) NeuroRule: a connectionist approach to data mining. In: Proceedings of the 21th conference on very large databases, Zurich, pp 478–489 20. Markowska-Kaczmar U (2005) The influence of parameters in evolutionary based rule extraction method from neural network. In: Proceedings of 5th international conference on intelligent systems design and applications pp 106– 111 21. Markowska-Kaczmar U, Chumieja M (2004) Discovering the mysteries of neural networks. Int J Hybrid Intell Syst 1(3/4):153–163 22. Markowska-Kaczmar U, Mularczyk K (2006) GA-based pareto optimization, Vol. 16 of Studies in computational intelligence. Springer, Berlin Heidelberg Newyork 23. Markowska-Kaczmar U, Trelak W (2005) Fuzzy logic and evolutionary algorithm - two techniques in rule extraction from neural networks. Neurocomputing 63:359–379 24. Markowska-Kaczmar U, Wnuk-Lipinski P (2004) Rule extraction from neural network by genetic algorithm with pareto optimization. In: Rutkowski L (ed.) Artificial intelligence and soft computing, pp 450–455 25. Michalewicz Z (1996) Genetic algorithms + Data structures = Evolution programs. Springer, Berlin Heidelberg Newyork 26. Mitra S, Yoichi H (2000) Neuro-fuzzy rule generation: survey in soft computing framework. IEEE Trans Neural Netw 11(3):748–768
Rule Extraction from Neural Networks
209
27. Omidvar O, Van der Smagt P (eds.) (1997) Neural systems for robotics. Academic, New York 28. Palade V, Neagu DC, Patton RJ (2001) Interpretation of trained neural networks by rule extraction. Fuzzy days 2001, LNC 2206, pp 152–161 29. Reil T, Husbands P (2002) Evolution of central pattern generators for bipedal walking in a real-time physics environment. IEEE Trans Evol Comput 6(2):159–168 30. Robert C, Gaudy JF, Limoge A (2002) Electroencephalogram processing using neural network. Clin Neurophysiol 113(5):694–701 31. Santos R, Nievola J, Freitas A (2000) Extracting comprehensible rules from neural networks via genetic algorithms Symposium on combinations of evolutionary computation and neural network 1:130–139 32. Setiono R, Leow WK, Zurada J (2002) Extraction of rules from artificial neural networks for nonlinear regression. IEEE Trans Neural Netw 13(3):564–577 33. Siponen M, Vesanto J, Simula O, Vasara P (2001) An approach to automated interpretation of SOM. In: Proceedings of workshop on self-organizing map 2001 (WSOM2001), pp 89–94 34. Taha I, Ghosh J (1999) Symbolic interpretation of artificial neural networks. IEEE Trans Knowl Data Eng 11(3):448–463 35. Thrun SB (1995) Advances in neural information processing systems. MIT, San Mateo, CA 36. Van der Zwaag B-J (2001) Handwritten digit Recognition: a neural network demo. In: Computational intelligence: theory and applications, Vol. 2206 of Springer LNCS. Dortmund, Germany, pp 762–771 37. Widrow B, Rumelhart DE, Lehr M (1994) Neural networks: applications in industry, business and science. Commun ACM 37(3):93–105
Cluster-wise Design of Takagi and Sugeno Approach of Fuzzy Logic Controller Tushar and Dilip Kumar Pratihar
Summary. We have a natural quest to know input-output relationships of a process. A fuzzy logic controller (FLC) will be an appropriate choice to tackle the said problem, if there are some imprecisions and uncertainties associated with the data. The present chapter deals with Takagi and Sugeno approach of FLC, in which a better accuracy is generally obtained compared to Mamdani approach but at the cost of interpretability. For most of the real-world processes, the input-output relationships might be nonlinear in nature. Moreover, the degree of non-linearity could not be the same over the entire range of the variables. Thus, the same set of response equations (obtained through statistical regression analysis for the entire range of the variables) might not hold equally good at different regions of the variable-space. Realizing the above fact, an attempt has been made to cluster the data based on their similarity among themselves and then cluster-wise linear regression analysis has been carried out, to determine the response equations. Moreover, attempts have been made to develop Takagi and Sugeno approach of FLC clusterwise, by utilizing the above linear response equations as the consequent part of the rules, which have been further optimized to improve the performances of the controllers. Moreover, two types of membership function distributions, namely linear (i.e., triangular) and non-linear (i.e., 3rd order polynomial) have been considered for the input variables, which have been optimized during the training. The performances of the developed three approaches (i.e., Approach 1: Cluster-wise linear regression; Approach 2: Cluster-wise Takagi and Sugeno model of FLC with linear membership function distributions for the input variables; Approach 3: Cluster-wise Takagi and Sugeno model of FLC with nonlinear (polynomial) membership function distributions for the input variables) have been tested on two physical problems. Approach 3 is found to outperform the other two approaches.
List of Symbols and Abbreviations a a0 , a1 , . . . , ap A1 , A2 , . . . , Ap
Abrasive mesh size (micro-m) Coefficients Membership function distributions corresponding to the linguistic terms
Tushar and D.K. Pratihar: Cluster-wise Design of Takagi and Sugeno Approach of Fuzzy Logic Controller, Studies in Computational Intelligence (SCI) 82, 211–250 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
212
Tushar and D.K. Pratihar
c C1 C2 C3 C4 C5 COij Dij D Ei gk G hijk n n N O1 O2 O3 O4 p pc pm P q Q R Ra Sij T [T ] T Oij v wi x1 , x2 , . . . , xp x1 , x2 , . . . , xN y
% concentration of the abrasive Welding speed (cm/min) Wire feed rate (cm/min) % cleaning Arc gap (mm) Welding current (Amp) Calculated value of j-th output for i-th case Euclidean distance between i and j Mean distance Entropy of i-th data point Half base-width of membership function distribution of xk variable Generation number Variation of the value of a co-efficient No. of clusters No. of cycles No. of data points Front height of weld bead (mm) Front width of weld bead (mm) Back height of weld bead (mm) Back width of weld bead (mm) No. of input variables Crossover probability Mutation probability Population size No. of output variables No. of fired rules Dimension of data points Surface roughness (micro-m) Similarity between the data points i and j No. of training cases Hyperspace Target value of j-th output for i-th case Flow speed of abrasive media (cm/min) Control action of i-th rule Set of input dimensions of a variable Data points Output
α β γ µ
Constant Threshold for similarity Threshold for determining a valid cluster Membership function value
Takagi and Sugeno Approach of FLC
F LC GA H KB L M M RR
213
Fuzzy logic controller Genetic Algorithm High Knowledge base Low Medium Material removal rate (mg/min)
1 Introduction We, human beings, have a natural thirst to gather input-output relationships of a process, which are necessary, particularly for on-line control of the same. Several attempts were made in the past, to capture the above relationships for a number of processes, by using the tools of traditional mathematics (e.g., differential equations and their solutions), statistical analysis (e.g, regression analysis based on some experimental data collected in a particular fashion, say full factorial design, fractional factorial design, central composite design, and others), and others. Most of the real-world problems are so complex that it might be difficult to formulate them, in the form of differential equations. Moreover, even if it is possible to determine the differential equation, it could be difficult to get its solution. The situation becomes worse, when the input and output variables are associated with imprecision and uncertainty. A fuzzy logic controller (FLC), which works based on Zadeh’s fuzzy set theory [1], could be a natural choice, to solve the above problem, as it is a potential tool for dealing with imprecision and uncertainty. The FLCs are becoming more and more popular, nowadays, due to their simplicity, ease of implementations and ability to tackle complex real-world problems. To design an FLC for controlling a process, we try to model the human reasoning used to solve the said problem, artificially. The variables are expressed in terms of some linguistic terms (such as VN–very near, N–near, and others) and the degree of belongingness of an element to a class is expressed by its membership function value (which lies between 0 and 1). The rules generally express the relationships among the input and output variables of a process. The performance of an FLC depends on its knowledge base (KB), which consists of both data base (i.e., membership function distributions) as well as rule base. Two basic approaches of FLC, namely Mamdani Approach [2] and Takagi and Sugeno Approach [3], are generally available in the literature. An FLC developed based on Mamdani Approach may not be accurate enough but it is interpretable. On the other hand, Takagi and Sugeno Approach can yield a more accurate controller compared to that designed based on the Mamdani Approach but at the cost of interpretability. In Mamdani Approach, the crisp value corresponding to the fuzzified output of the controller is determined by using a defuzzification method, which could be computationally expensive, whereas in Takagi and Sugeno Approach, the output is directly expressed as
214
Tushar and D.K. Pratihar
a function (either linear or nonlinear) of the input variables and as a result of which, the defuzzification is not required at all. The present chapter deals with Takagi and Sugeno approach of FLC. The input-output relationships of a process may not always be linear in nature. Moreover, the degree of non-linearity could be different from one region of the input-output space to another and truly speaking, there might be a number of ups and downs in the above space. Thus, one set of response equations may not be sufficient to represent the input-output relationships for the entire range of the variables. Realizing the above fact, an attempt will be made to cluster the entire space into a number of regions based on similarity. Moreover, FLCs (based on Takagi and Sugeno Approach) will be developed cluster-wise, to hopefully yield better prediction of the outputs. As an FLC does not have an optimization module in-built, an optimization tool is to be used during its training, to improve the performance. Genetic algorithm (GA) [4], a population-based search and optimization tool working based on the mechanics of natural selection and natural genetics, has been utilized extensively, to improve the performance of the FLC and as a result of which, a new field of research called genetic-fuzzy system has emerged. The said field started with the work of Karr [5], Thrift [6] and later on, its principle has been utilized by a number of investigators, to solve a variety of problems. In this system, attempts are made to optimize either the data base or the rule base or the complete KB of the FLC by using a GA. Interested readers may refer to the recent survey on genetic-fuzzy system, carried out by Cordon et al. [7]. There are three basic approaches of the genetic-fuzzy system, such as Pittsburgh Approach [8], Michigan Approach [9] and iterative rule learning approach [10]. In Pittsburgh Approach, a GA-string represents the entire rule base of the FLC, whereas in the Michigan Approach, one rule is indicated by a GA-string. The genetic operators are used to modify the KB of the FLC and ultimately the optimal KB is obtained. In an iterative rule learning approach, the GA-string carries information of the rules. The GA through its iterations tries to add some good rules to the rule base and delete the bad rules. Thus, an optimal rule base of the FLC will be evolved gradually. Several encoding schemes have been developed in this context, some of those are mentioned below. Furuhashi et al. [11] developed a variable length decoding method known as the Nagoya Approach, in which the lengths of the chromosomes are not fixed. But, it suffers from some difficulties in the crossover operation. Yupu et al. [12] used a GA to determine appropriate fuzzy rules and the membership function distributions were optimized by using a neural network. More recently, some trials have also been made to automatically design the FLC by using a GA [13, 14]. It is important to mention that the GA might evolve some redundant rules, due to its iterative nature. Some attempts were also made to remove the redundant rules from the rule base. In this context, the work of Ghosh and Nath [15], Hui and Pratihar [16] are important to mention.
Takagi and Sugeno Approach of FLC
215
Attempts were also made to optimize the Takagi and Sugeno type of FLC. Smith and Comer used the Least Mean Square (LMS) learning algorithm to tune a general Takagi and Sugeno type FLC [17]. The co-efficients of the output functions in the Takagi-Sugeno type FLC had been optimized using a GA by Kim et al. [18]. The structure of ANFIS: adaptive-network-based fuzzy inference system was developed by Jhang [19], in which the main aim was to design an optimal Takagi-Sugeno type FLC by using a neural network. The present chapter deals with GA-based (off-line) tuning of the FLCs working based on Takagi and Sugeno approach. After the training of the FLCs is over, their performances have been tested on two physical problems. The rest of the text is organized as follows: Section 2 explains the principle of Takagi and Sugeno approach of FLC. Section 3 gives a brief introduction to the GA. The issues related to entropy-based fuzzy clustering and clusterwise linear regression are discussed in Section 4. The proposed method for cluster-wise design of the FLCs working based on Takagi and Sugeno approach and their GA-based tuning are explained in Section 5. The performances of the developed approaches are compared among themselves on two physical problems in Section 6. Some concluding remarks are made in Section 7 and the scope for future work is indicated in Section 8.
2 Takagi and Sugeno Approach of FLC In this approach, a rule of an FLC consists of fuzzy antecedent and functional consequent parts. Thus, a rule (say, i-th) can be represented as follows: If x1 is Ai1 and x2 is Ai2 ..... and xp is Aip then y i = ai0 + ai1 x1 + . . . + aip xp , where a0 , a1 , . . . , ap are the coefficients. A nonlinear system is considered in this way, as a combination of several linear systems. Control action of i-th rule can be determined for a set of inputs (x1 , x2 , . . . , xp ) like the following. wi = µiA1 (x1 )µiA2 (x2 ) . . . µiAp (xp ),
(1)
where A1 , A2 , . . . , Ap indicate the membership function distributions of the linguistic terms used to represent the input variables and µ denotes the membership function value. Thus, the combined control action can be determined as follows. k wi y i , (2) y = i=1 k i i=1 w where k is the number of fired rules.
216
Tushar and D.K. Pratihar
3 Genetic Algorithm Genetic algorithm (GA) is a population-based probabilistic search and optimization technique, which works based on the mechanism of natural genetics and Darwin’s principle of natural selection (i.e., survival of the fittest) [4]. The concept of GA was introduced by Prof. John Holland of the University of Michigan, Ann Arbor, USA, in the year 1965, but his seminal book was published in the year 1975 [20]. This book lays the foundation of genetic algorithms. It is basically a heuristic search technique, which works using the concept of probability. The working principle of a GA can be explained briefly with the help of Fig. 1. • A GA starts with a population of initial solutions, chosen at random. • The fitness/goodness value (i.e., objective function value in case of a maximization problem) of each solution in the population is calculated. • The population of solutions is then modified by using different operators, namely reproduction, crossover, mutation, and others. • All the solutions in a population may not be equally good in terms of their fitness values. An operator named reproduction is used to select the good solutions by using their fitness information. Thus, reproduction forms a mating pool, which consists of good solutions hopefully. It is to be
Start
Initialize a population of solutions Gen=0
Gen > Max_gen ?
Yes
No
Assign fitness to all solutions in the population
End
Reproduction
Crossover
Mutation Gen = Gen+1
Fig. 1. A schematic diagram showing the working cycle of a GA
Takagi and Sugeno Approach of FLC
217
mentioned that there could be multiple copies of a particular good solution in the mating pool. The size of the mating pool is kept equal to that of the population of solutions before reproduction. Thus, the average fitness of the mating pool is expected to be higher (for a maximization problem) than that of the pre-reproduction population of solutions. There exists a number of reproduction schemes in the GA-literature, namely proportionate selection (such as roulette-wheel selection), tournament selection, ranking selection, and others [21]. • The mating pairs (also known as parents) are selected at random from the mating pool, which will take part in crossover. In crossover, there is an exchange of properties of the parents and as a result of which, new children solutions are created. It is important to note that if the parents are good, the children are expected to be good. There are various types of crossover operators in the GA-literature, such as single-point crossover, two-point crossover, multi-point crossover, uniform crossover, and others [22]. • The word - mutation means a sudden change of parameter. It is used for achieving a local change around the current solution. Thus, if a solution gets stuck at the local minimum, this operator may help it to come out of that local basin. • After reproduction, crossover and mutation are applied to the whole population of solutions, one generation of a GA is completed. Different criteria are used to terminate the program, such as the maximum number of generations, a desired accuracy in the solution, and others. Various types of the GA are available in the literature, namely binary-coded GA, real-coded GA, micro-GA, messy-GA, and others. In the present work, a binary-coded GA has been used.
4 Clustering and Linear Regression Analysis Using the Clustered Data Clustering is a technique of pushing the similar elements into a group (also known as a cluster) and thus putting the dis-similar elements into different groups. It is important to mention that the clusters could be either hard or soft (fuzzy) in nature. Hard clusters are separated by the well-defined fixed boundaries, whereas there are no fixed boundaries of the fuzzy clusters. There are several methods of fuzzy clustering, such as fuzzy C-means algorithm [23], entropy-based fuzzy clustering algorithm [24], and others. The present chapter deals with the entropy-based fuzzy clustering algorithm, which is explained below. 4.1 Entropy-based Fuzzy Clustering (EFC) Let us consider a set of N data points in an R − D hyperspace. Each data point xi (i = 1, 2, . . . , N ) is represented by a vector of R values,
218
Tushar and D.K. Pratihar
(i.e., xi1 , xi2 , . . . , xiR ). The following steps are followed to determine entropy of each point. • Arrange the data set in rows and columns. Thus, there are N rows and R columns. • Determine Euclidean distance between the points i and j as follows. R (3) Dij = (xik − xjk )2 . k=1
It is to be noted that N 2 distances are possible among N data points. It is also interesting to note that out of N 2 distances, N C2 distances belong to Dij and Dji each and there are N diagonal elements of the distance matrix. As Dij is equal to Dji and each of the diagonal elements of the matrix is equal to zero, it is required to calculate N C2 distances (Dij ) only. • Calculate similarity Sij between two data points (i and j) by using the following expression. (4) Sij = e−αDij , where α is a constant determined as follows. We assume a similarity Sij of 0.5, when the distance between two data points Dij becomes equal to the mean distance of all pairs of data points, i.e., Dij = D, where N j
(6)
From the above equation, it can be observed that entropy E becomes equal to 0.0, for a value of S = 0 and S = 1.0. Moreover, entropy E takes the maximum value of 1.0, corresponding to a value of S = 0.5. Thus, entropy of a point with respect to another point varies between 0.0 and 1.0. • Calculate total entropy value at a data point xi with respect to all other data points by using the expression given below. Ei = −
j=i
(Sij log2 Sij + (1 − Sij ) log2 (1 − Sij ))
(7)
j∈X
It is also important to note that during clustering, the point having minimum total entropy may be selected as the cluster center. It is so, because (1 − E) indicates the probability of a point for getting selected as a cluster center.
Takagi and Sugeno Approach of FLC
219
Clustering Algorithm Let us suppose that [T ] is the data set containing N data points and each data point has R dimensions. Thus, [T ] is an N × R matrix. The clustering algorithm consists of the following steps. • Step 1: Calculate entropy Ei for each data point xi lying in [T ]. • Step 2: Locate xi , which has the minimum Ei value and select it (xi,minimum ) as the cluster center. • Step 3: Put xi,minimum and the data points having similarity with xi,minimum greater than β (a threshold value of similarity) in a cluster and remove them from [T ]. • Step 4: Check if [T ] is empty. If yes, terminate the program, else go Step 2. In the above algorithm, entropy has been defined in such a way that a data point which is far away from the rest of the data may also be selected as a cluster center. It may happen so, because a very distant point (from the rest of the data points) may also have a low value of entropy. To overcome this, another parameter γ (in %) is introduced, which is nothing but a threshold used to declare a cluster to be a valid one. After the clustering is over, we count the number of data points in each cluster and if this number becomes greater than or equal to γ% of the total number of data points, we declare this cluster as a valid one. Otherwise, these data points, which could not form a cluster will be declared as outliers. 4.2 Approach 1: Cluster-wise Linear Regression Let us suppose that the data set related to input-output relationships of a process has been clustered into n groups by using the above entropy-based fuzzy clustering algorithm. Cluster-wise linear regression is then carried out, to determine the relationship among each output and the inputs separately. Thus, each of the outputs will be expressed as a linear function of the input variables. Let us assume that each R-dimensional data consists of two parts – p dimensions are used to indicate the inputs and the outputs are represented by q dimensions. Thus, q response equations will be derived for each cluster and a total of q × n response equations will be obtained for n such clusters. It is to be noted that each of these q × n response equations is a function of p input variables. To determine the outputs for a set of test inputs, we first find the cluster to which this set of inputs will belong, by calculating the Euclidean distance between the said input data and the different cluster centers obtained above. The set of test input data will belong to that cluster, whose center is found to be the closest from that data. Once the belongingness of a set of inputs to a particular cluster is known, the response equations corresponding to that cluster can be utilized to determine the required outputs.
220
Tushar and D.K. Pratihar
5 GA-based Tuning of Takagi and Sugeno Approach of FLC In Takagi and Sugeno approach of FLC, the output of the controller is expressed as a function of the input variables, determination of which might be a difficult task. It is to be noted that the above response equations obtained cluster-wise can be used to determine the appropriate functional form of the outputs of a controller. Thus, a separate FLC will be developed to represent input-output relationships for each cluster. As the linear regression equations have been developed cluster-wise, they will be able to determine the outputs in the optimal sense for the set of inputs, with the help of which the above response equations have been derived. To tune the above FLCs, so that they can perform reasonably well in the wide range of the variables, a genetic algorithm (GA) has been used. As a GA is computationally expensive, the tuning is done off-line, with the help of a large amount of known input-output data following the batch mode of training. Before we go for the GA-based tuning of the FLCs, let us explain the way it has been formulated as an optimization problem. Let us assume that i-th output (i = 1, 2, . . . , q) of the FLC contained in j-th cluster (j = 1, 2, . . . , n) is represented as follows. ij ij ij ij y ij = aij 0 + a1 x1 + a2 x2 + . . . + ak xk + . . . + ap xp ,
(8)
ij ij where aij 0 , a1 , . . . , ap are the co-efficients to be tuned properly during the GA-based learning. Let us also suppose that each of the input dimensions of a variable (i.e., x1 , x2 , . . . , xp ) is expressed by using three linguistic terms – L (Low), M (Medium) and H (high). The membership function distributions of the input variables are assumed to linear in Approach 2, whereas those are expressed by using the third order polynomial functions [25] in Approach 3 (refer to Fig. 2). It is to be noted that the parameter – gk (k = 1, 2, . . . , p) can be varied, while carrying out optimization by using a GA. As three linguistic terms
1.0
L
M
H
µ 0.0
1.0
L
M
H
µ
xk,min
0.0 g
gk
k
xk,min
g
g
k
k
x
k
x
(a)
(b)
k
Fig. 2. Membership function distributions of k-th input variable–(a) linear function, (b) nonlinear (third order polynomial) function
Takagi and Sugeno Approach of FLC
221
(such as L, M and H) have been utilized to represent each input dimension of a variable, its corresponding coefficient used in equation (8) might have three different values. It is assumed that the value of the coefficient (aij k ) will remain the same with the value obtained through the regression analysis, if the input xk is M and it will increase or decrease by an amount hijk , if xk is H or L, respectively. Thus, the following three conditions may occur in the fired rule: ij • If xk is L, then modified a∗ij k = ak − hijk , ∗ij • If xk is M , then modified ak = aij k , ij • If xk is H, then modified a∗ij = a k k + hijk ,
It is important to mention that hijk will be varied during optimization. The GA, in total, will have to deal with (p + q × n × p) real variables during optimization, as there are p values of gk and q × n × p values of hijk . As there are p dimensions of an input variable and each is assumed to be represented by using three linguistic terms – L, M and H, there is a maximum of 3p possible rules. Out of these 3p rules, a maximum of 2p rules could be fired. Now, corresponding to i-th output, the strength of a rule, say q-th, can be determined as follows. wqi = µi (x1 ) × µi (x2 ) × . . . × µi (xp ),
(9)
where µi (x1 ) indicates the membership function value of x1 , and so on. It is important to mention that a rule, say q-th is said to be fired, if the value of wqi comes out to be non-zero. Moreover, the output of a fired rule (say, q-th) can be calculated by using the modified version of equation(8), like the following. ∗ij ∗ij ∗ij ∗ij yqij = aij 0 + a1 x1 + a2 x2 + . . . + ak xk + . . . + ap xp .
(10)
It is important to mention that the constant term aij 0 has been kept unaltered during optimization. Thus, the combined control action corresponding to i-th output of the FLC representing j-th cluster, can be determined as follows. Q i ij q=1 wq yq ij , (11) y = Q i q=1 wq where Q ≤ 2p is the number of fired rules. 5.1 Genetic-Fuzzy System A genetic-fuzzy system has been developed, in which the performances of the FLCs have been improved by using a GA-based tuning of their knowledge bases (KBs) off-line. Fig. 3 shows the schematic diagram of the developed genetic-fuzzy system. It is to be noted that a batch mode of training has been adopted in the present work.
222
Tushar and D.K. Pratihar
GA−based tuning Off−line Knowledge Base
FLC
Inputs
Outputs
On−line
Fig. 3. A schematic diagram of the genetic-fuzzy system
A binary-coded GA is used to carry out the above optimization, in which a particular string will look as follows. . . 10. . . . 0 . . 10. 1 . . 00. . . . 0+ . ,. . 01. . . . 1+ . ,. . 01. 1 . . 01. . . . 1 + . ,+ . ,+ . ,+ . ,g1
gk
gp
h111
hijk
hqnp
The fitness of a GA-string is calculated as the average absolute % deviation in prediction, like the following. q T 1 T Oij − COij f= × 100 , T q i=1 j=1 T Oij
(12)
where T indicates the number of training cases, q represents the number of outputs and the target and calculated values are indicated by T Oij and COij , respectively. To summarize the above discussion, the following steps have been considered in the developed algorithm: • Cluster a large amount of training data (known input-output relationships) using a similarity-based fuzzy clustering algorithm, • Carry out linear regression analysis cluster-wise, to express each output as a linear function of the inputs by utilizing the data falling into that particular cluster. • Design the FLCs based on Takagi and Sugeno’s approach, in which each FLC will represent a particular cluster. • Tune the FLCs by using a GA, off-line. It is to be noted that once optimized, the FLCs can be used for making on-line predictions of the outputs, for a set of inputs. Thus, three different approaches have been developed to establish inputoutput relationships of a process, which are as follows.
Takagi and Sugeno Approach of FLC
223
• Approach 1: Cluster-wise Linear Regression • Approach 2: Cluster-wise Takagi and Sugeno Model of FLC with Linear Membership Function Distributions for the Input Variables • Approach 3: Cluster-wise Takagi and Sugeno Model of FLC with Nonlinear (Polynomial) Membership Function Distributions for the Input Variables
6 Results and Discussion The performances of the above three developed approaches have been tested and compared among themselves on two different problems – one is related to Abrasive Flow Machining (AFM) [26] and the other deals with Tungsten Inert Gas (TIG) welding [27], which are discussed below. 6.1 Modeling of Abrasive Flow Machining (AFM) Process [26] To model input-output relationships in AFM process, four inputs (such as flow speed of abrasive media v, % concentration of abrasive c, abrasive mesh size a and number of cycles n ) and two outputs, namely material removal rate (M RR) and surface roughness (Ra ) have been considered. One thousand data related to the above relationships have been generated artificially by selecting the values of the variables lying within their respective ranges, at random and substituting them in the following empirical relationships: M RR = 5.25 × 10−7 v 1.6469 c3.0776 a−0.9371 n−0.1893 , Ra = 282751v
−1.8221 −1.3222 0.1368 −0.2258
c
a
n
,
(13) (14)
where 40.0 ≤ v ≤ 85.0, 33.0 ≤ c ≤ 45.0, 100.0 ≤ a ≤ 240.0, 20 ≤ n ≤ 120. It is important to note that each of these 1000 data has six dimensions – the first four indicate the input variables (i.e., v, c, a and n ) and the last two represent the output variables, namely M RR and Ra . Thus, the above data set has a size of 1000 × 6. Entropy-based fuzzy clustering algorithm has been used to make the clusters based on similarity. As the performance of the clustering algorithm depends on the threshold value of similarity β, a thorough study is carried out to determine a suitable value of it. Fig. 4 shows the variations of number of clusters and outliers with different values of β. The number of cluster initially increases with the value of β, reaches the maximum value corresponding to certain values of β and then it decreases with a further increase in the value of β, as expected. From the above study, a suitable value of β, i.e., 0.52 has been selected, corresponding to which four clusters have been generated and there are two outliers.
224
Tushar and D.K. Pratihar 10 9 8
No. of clusters
7 6 5 4 3 2 1 0
0.1
0.2
0.3
0.4
0.5 0.6 0.7 Beta value
0.8
0.9
1
(a) No. of cluster vs. beta 1200
1000
No. of outliers
800
600
400
200
0 0.1
0.2
0.3
0.4
0.5 0.6 Beta value
0.7
0.8
0.9
1
(b) No. of outliers vs. beta Fig. 4. Variations of no. of clusters and no. of outliers with beta
Results of Approach 1 To establish input-output relationships cluster-wise, linear regression analysis is carried out using a commercial software package named MINITAB-14. The following response equations are obtained for the different clusters.
Takagi and Sugeno Approach of FLC
225
Cluster 1 M RR = −0.278 + 0.0029v + 0.00852c − 0.000512a − 0.000183n Ra = 4.11 − 0.0306v − 0.0299c + 0.000918a − 0.00235n
(15) (16)
Cluster 2 M RR = −0.47 + 0.0049v + 0.0149c − 0.00137a − 0.000617n
(17)
Ra = 4.37 − 0.0306v − 0.0335c + 0.000845a − 0.00327n
(18)
M RR = −0.328 + 0.00351v + 0.0105c − 0.000637a − 0.000774n
(19)
Cluster 3
Ra = 5.27 − 0.0366v − 0.0441c + 0.00123a − 0.00688n
(20)
Cluster 4 M RR = −0.279 + 0.00372v + 0.011c − 0.00119a − 0.000443n Ra = 4.93 − 0.0326v − 0.0372c − 0.00034a − 0.00406n
(21) (22)
Fig. 5 shows the comparisons among the target (determined by using the empirical relationships) and calculated values (obtained by utilizing the above regression equations) of two outputs, namely M RR and Ra , for 50 randomlygenerated test cases. The belongingness of a test case to a particular cluster is decided by considering the minimum of its Euclidean distances from the four cluster centers obtained above. The model is able to make reasonably good predictions of M RR for a number of test cases (corresponding to which the points are lying on the ideal y = x line) but not all (refer to Fig. 5(a)). The above figure shows that the model has under-estimated the M RR values for a few test cases. It is interesting to notice from Fig. 5(b) that the model has predicted surface roughness values almost accurately for most of the test cases but not all. It is also important to note that the points are found to lie on both the sides of the ideal y = x line. Thus, reasonably good predictions of both the outputs have been made by Approach 1, for most of the random test cases. The actual input-output relationships of the above process may be nonlinear in nature. Moreover, the degree of non-linearity may not be the same throughout the entire input-output space. As the clustering is done based on the concept of similarity, the similar points are expected to form a cluster. Here, the non-linear input-output space has been divided into four clusters based on the similarity and at each cluster, the input-output relationships have been determined using the linear regression analysis. Thus, a non-linear space has been divided into four regions and at each region, the input-output relationships have been approximated as the linear ones. The deviations in predictions as shown in Fig. 5 could be due to the above reason.
226
Tushar and D.K. Pratihar 0.6
Target values of MRR
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Calculated values of MRR
(a) Target vs. calculated values of MRR 2.5
Target values of Ra
2
1.5
1
0.5
0
0
0.5
1 1.5 Calculated values of Ra
2
2.5
(b) Target vs. calculated values of Ra Fig. 5. Performance testing of cluster-wise linear regression – AFM data
Results of Approach 2 Fuzzy logic controllers (FLCs) have been designed based on Takagi and Sugeno’s approach cluster-wise, in which each of the outputs has been expressed as the linear function of inputs, as obtained above. The Knowledge
Takagi and Sugeno Approach of FLC
227
Base (KB) of the FLCs have been tuned to improve their performance by using a GA. As the performance of the GA depends on its parameters, a systematic study is conducted to determine the optimal GA-parameters, in which only one parameter has been varied at a time after keeping the others fixed. Fig. 6 shows the results of the above parametric study. It is important to note that 9.3 9.25
Fitness
9.2 9.15 9.1 9.05 9 8.95 0.65
0.7
0.75 0.8 0.85 0.9 Crossover probability
0.95
1
(a) Fitness vs. pc 9.45 9.4 9.35 9.3 Fitness
9.25 9.2 9.15 9.1 9.05 9 8.95 0
0.0005
0.001 0.0015 0.002 Mutation probability
0.0025
(b) Fitness vs. pm Fig. 6. Results of the GA-parametric study
0.003
228
Tushar and D.K. Pratihar 10.4 10.2 10
Fitness
9.8 9.6 9.4 9.2 9 8.8
0
20
40
60
80 100 120 140 160 180 200 Population size
(c) Fitness vs. population size 10.6 10.4 10.2
Fitness
10 9.8 9.6 9.4 9.2 9 8.8
0
50 100 150 200 250 300 350 400 450 500 No. of generations
(d) Fitness vs. no. of generations Fig. 6. (Continued)
the best results are obtained with the following GA-parameters: probability of crossover pc = 0.87, probability of mutation pm = 0.00223, population size P = 190 and maximum number of generations G = 450. In this problem, there are four input variables (i.e, p = 4), two outputs (i.e., q = 2) and the number of clusters n is set equal to 4. Thus, a total of p +q × n × p = 4 + 2 × 4 × 4 = 36 values are to be varied within their respective ranges (refer to Table 1), during
Takagi and Sugeno Approach of FLC Table 1. Ranges and optimized values of 36 parameters – AFM data Parameter
Range
Optimal value Approach 2
Optimal value Approach 3
g1 g2 g3 g4 h111 h112 h113 h114 h121 h122 h123 h124 h131 h132 h133 h134 h141 h142 h143 h144 h211 h212 h213 h214 h221 h222 h223 h224 h231 h232 h233 h234 h241 h242 h243 h244
10,35 2,12 20,100 10,70 0.00001,0.0006 0.00001,0.0006 0.000001,0.00006 0.000001,0.00006 0.001,0.006 0.001,0.006 0.000001,0.00006 0.0001,0.0006 0.00001,0.0006 0.0001,0.006 0.00001,0.0006 0.00001,0.00008 0.001,0.006 0.001,0.006 0.000001,0.00006 0.0001,0.0006 0.0001,0.0006 0.0001,0.006 0.00001,0.00006 0.00001,0.00006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.0001,0.0006 0.00001,0.0006 0.0001,0.006 0.0001,0.0006 0.00001,0.00006 0.001,0.006 0.001,0.006 0.00001,0.00006 0.0001,0.0006
28.890518 11.824047 59.960899 43.782991 0.00001 0.00001 0.000001 0.000001 0.00405 0.002711 0.000001 0.000286 0.00001 0.0001 0.000147 0.00008 0.004094 0.003502 0.000001 0.000506 0.0001 0.0001 0.00001 0.00006 0.004436 0.00316 0.000113 0.000579 0.00001 0.000152 0.0001 0.000016 0.003498 0.003761 0.000058 0.000592
33.191593 9.507331 59.960899 37.689284 0.00001 0.00001 0.000001 0.000001 0.004363 0.001371 0.000001 0.00035 0.00001 0.0001 0.000158 0.00008 0.004754 0.001938 0.000001 0.000599 0.0001 0.0001 0.00001 0.00006 0.004651 0.002237 0.000209 0.0006 0.000064 0.0001 0.0001 0.00001 0.004123 0.00221 0.00005 0.0006
229
230
Tushar and D.K. Pratihar
optimization. A batch mode of training has been provided to the FLCs by the GA, with the help of 1000 artificial training data generated earlier. The optimized values of the above 36 parameters (i.e., 4 values of gk and 32 values of hijk ) obtained during the training, are also shown in Table 1. Fig. 7 compares the model-predicted (i.e., Approach 2) values of M RR and Ra with
Target values of MRR
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2 0.3 0.4 Calculated values of MRR
0.5
(a) Target vs. calculated values of MRR
Target values of Ra
2
1.5
1
0.5
0
0
0.5
1 1.5 Calculated values of Ra
2
(b) Target vs. calculated values of Ra Fig. 7. Performance testing of the FLC having linear membership function distributions – AFM data
Takagi and Sugeno Approach of FLC
231
their respective target vaues (obtained from the empirical expressions). It is interesting to note that Approach 2 has yielded slightly better predictions of M RR compared to those obtained by Approach 1. Moreover, Approach 2 has shown a tendency to shift the points (shown in Fig. 7(b)), to-wards one side of the ideal line y = x, i.e., to underestimate the surface roughness values. Approach 2 has performed slightly better than Approach 1 and it is due to the fact that in Approach 2, a GA has been used to fine-tune the coefficients of the linear regression equations and membership function distributions of the variables. The GA is able to inject an intelligent module into the FLC and as a result of which, Approach 2 is expected to show its adaptability. Results of Approach 3 A parametric study has been carried out for this approach by following the same procedure explained above and the following GA-parameters – pc = 0.93, pm = 0.00133, P = 190 and G = 430 are found to yield the best results. The optimal values of gk and hijk obtained by using this approach are shown in Table 1. The values of M RR and Ra , predicted by using Approach 3, have been compared with their respective target values in Fig. 8. It is interesting to note that Approach 3 has predicted both M RR as well as Ra values almost with the same accuracy level as that obtained by Approach 2. Comparisons The above three approaches have been compared in terms of % deviation in prediction of M RR and Ra , as shown in Fig. 9. It is interesting to notice that the values of % deviation in prediction of M RR and Ra , yielded by Approaches 2 and 3, are found to vary within the shorter ranges compared to those obtained by Approach 1. Thus, both Approaches 2 as well as 3 are seen to outperform Approach 1, in most of the test cases. It could be due to the GA-based optimization of the membership function distributions of the input variables and the co-efficients of the linear regression equations. The values of average absolute % deviation in prediction of the outputs have been calculated for the three approaches and those are found to be equal to 0.1949155, 0.1565955 and 0.151972, for Approaches 1, 2 and 3, respectively. Approach 3 has shown a slightly better performance compared to Approach 2. It may happen due to the fact that linear distributions of the membership function used in Approach 2 have been replaced by the non-linear distributions in Approach 3. The supremacy of Approach 3 over Approach 2 indicates that the input-output relationships of the process might be nonlinear in nature. 6.2 Modeling of Tungsten Inert Gas (TIG) Process [27] In the present work, five inputs, such as welding speed (C1 ), wire feed rate (C2 ), % cleaning (C3 ), arc gap (C4 ), welding current (C5 ) and four outputs,
232
Tushar and D.K. Pratihar
Target values of MRR
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2 0.3 0.4 Calculated values of MRR
0.5
(a) Target vs. calculated values of MRR
Target values of Ra
2
1.5
1
0.5
0
0
0.5
1 1.5 Calculated values of Ra
2
(b) Target vs. calculated values of Ra Fig. 8. Performance testing of the FLC having nonlinear membership function distributions – AFM data
namely weld bead front height (O1 ), front width (O2 ), back height (O3 ), back width (O4 ), have been considered to model the TIG welding process. The ranges of different input parameters have been set as follows: C1 (24.0 to 46.0 cm/min), C2 (1.5 to 2.5 cm/min), C3 (30.0 to 70.0), C4 (2.4 to 3.2 mm) and C5 (80.0 to 110.0 amp). To establish input-output relationships of this
Takagi and Sugeno Approach of FLC 1
233
’CWLRMRR’ ’LRFLCMRR’ ’NONLRFLCMRR’
% deviation in prediction of MRR
0.5
0
-0.5
-1
-1.5
-2
-2.5
5
10
15
20
25 30 Test Cases
35
40
45
50
45
50
(a) Output 1: MRR 1.2
’CWLRRa’ ’LRFLCRa’ ’NONLRFLCRa’
% deviation in prediction of Ra
1
0.8
0.6
0.4
0.2
0
-0.2
5
10
15
20 25 30 Test Cases
35
40
(b) Output 2: Surface roughness Ra Fig. 9. Comparison of three approaches in terms of % deviation in prediction of different outputs – AFM
234
Tushar and D.K. Pratihar
process by using statistical regression analysis, the data collected as per full factorial design of experiments (as there are five input variables, the number of experiments will be equal to 25 = 32) [27], have been used. Table 2 shows the set of above 32 data involving input-output relationships of the process. The following response equations have been obtained by using MINITAB14 software package on the above 32 data. O1 = − 17.2504 + 0.6202C1 + 4.6762C2 + 0.0866C3 + 7.4479C4 + 0.0431C5 − 0.1870C1 C2 − 0.0058C1 C3 − 0.2210C1 C4 − 0.0029C1 C5 + 0.0018C2 C3 − 1.8396C2 C4 + 0.0191C2 C5 − 0.0586C3 C4 + 0.0018C3 C5 − 0.0352C4 C5 + 0.014C1 C2 C3 + 0.0623C1 C2 C4 + 0.0002C1 C2 C5 + 0.0022C1 C3 C4 − 0.0070 × 10−3 C1 C3 C5 + 0.0011C1 C4 C5 + 0.0061C2 C3 C4 − 0.0014C2 C3 C5 − 0.0030C2 C4 C5 − 0.0003C3 C4 C5 − 0.0004C1 C2 C3 C4 + 0.0189 × 10−3 C1 C2 C3 C5 − 0.0460 × 10−3 C1 C2 C4 C5 − 0.0009C1 C3 C4 C5 − 0.0004C2 C3 C4 C5 − 0.0069C1 C2 C3 C4 C5 ,
(23)
O2 = − 329.6758 + 8.2539C1 + 167.1041C2 + 5.8187C3 + 101.462C4 + 3.9953C5 − 4.0707C1 C2 − 0.1414C1 C3 − 2.5489C1 C4 − 0.0991C1 C5 − 2.9150C2 C3 − 54.1378C2 C4 − 1.9883C2 C5 − 1.8510C3 C4 − 0.0686C3 C5 − 1.2150C4 C5 + 0.07C1 C2 C3 + 1.3175C1 C2 C4 + 0.0486C1 C2 C5 + 0.0441C1 C3 C4 + 0.0017C1 C3 C5 + 0.0308C1 C4 C5 + 0.9399C2 C3 C4 + 0.0345C2 C3 C5 + 0.6524C2 C4 C5 + 0.0223C3 C4 C5 − 0.0223C1 C2 C3 C4 − 0.8392 × 10−3 C1 C2 C3 C5 − 0.0159C1 C2 C4 C5 − 0.5426 × 10−3 C1 C3 C4 C5 + 0.0113C2 C3 C4 C5 + 0.2718 × 10−3 C1 C2 C3 C4 C5 ,
(24)
O3 = 20.7999 − 0.3831C1 − 3.5745C2 + 0.1079C3 − 9.3284C4 − 0.0924C5 + 0.0058C1 C2 − 0.0054C1 C3 + 0.1665C1 C4 + 0.0005C1 C5 − 0.1114C2 C3 + 2.2936C2 C4 − 0.0168C2 C5 − 0.0092C3 C4 − 0.0037C3 C5 + 0.0576C4 C5 + 0.0044C1 C2 C3 − 0.0237C1 C2 C4 + 0.0015C1 C2 C5 + 0.0016C1 C3 C4 + 0.0001C1 C3 C5 − 0.0006C1 C4 C5 + 0.0262C2 C3 C4 + 0.0024C2 C3 C5 − 0.0041C2 C4 C5 + 0.0009C3 C4 C5 − 0.0013C1 C2 C3 C4 − 0.769 × 10−3 C1 C2 C3 C5 − 0.0003C1 C2 C4 C5 − 0.0393C1 C3 C4 C5 − 0.0007C2 C3 C4 C5 + 0.028 × 10−3 C1 C2 C3 C4 C5 ,
(25)
O4 = − 179.4354 + 4.1209C1 + 104.7708C2 + 4.1113C3 + 52.8753C4 + 2.4368C5 − 2.5474C1 C2 − 0.0946C1 C3 − 1.2695C1 C4 − 0.0573C1 C5 − 2.2272C2 C3 − 34.1677C2 C4 − 1.2973C2 C5 − 1.3856C3 C4 − 0.0508C3 C5 − 0.7198C4 C5 + 0.0520C1 C2 C3 + 0.8188C1 C2 C4 + 0.0321C1 C2 C5 + 0.0318C1 C3 C4 + 0.0012C1 C3 C5 + 0.0182C1 C4 C5 + 0.7684C2 C3 C4 + 0.0268C2 C3 C5 + 0.04353C2 C4 C5 + 0.0176C3 C4 C5 − 0.0178C1 C2 C3 C4 − 0.6427 × 10−3 C1 C2 C3 C5 − 0.0108C1 C2 C4 C5 − 0.4212C1 C3 C4 C5 − 0.0094C2 C3 C4 C5 + 0.2246 × 10−3 C1 C2 C3 C4 C5 ,
(26)
Takagi and Sugeno Approach of FLC
235
Table 2. Data (as per full factorial design of experiments) used to carry out regression analysis Sl. No.
Treatment Level of the factors combination C1 C2 C3 C4 C5
Responses values O1 (mm)
O2 (mm)
O3 (mm)
O4 (mm)
1
1
− − − − −
−0.149
6.090
0.672
5.664
2
C1
+ − − − −
0.357
4.982
0.001
2.255
3
C2
− + − − −
0.155
6.676
0.743
5.960
4
C3
− − + − −
−0.179
7.432
0.593
7.058
5
C4
− − − + −
0.027
6.411
0.412
5.197
6
C5
− − − − +
−0.599
11.348
0.805
11.679
7
C1 C2
+ + − − −
0.390
4.780
0.062
1.330
8
C1 C3
+ − + − −
0.088
5.020
0.281
3.302
9
C1 C4
+ − − + −
0.168
4.898
0.277
2.998
10 C1 C5
+ − − − +
−0.217
6.092
0.359
6.419
11 C2 C3
− + + − −
−0.129
7.009
0.878
6.989
12 C2 C4
− + − + −
0.099
6.824
0.803
5.732
13 C2 C5
− + − − +
−0.232
9.338
0.866
10.611
14 C3 C4
− − + + −
−0.306
7.287
0.630
6.895
15 C3 C5
− − + − +
−0.254
11.237
0.470
12.000
16 C4 C5
− − − + +
−0.745
11.491
1.100
11.848
17 C1 C2 C3
+ + + − −
0.380
5.231
0.397
2.817
18 C1 C2 C4
+ + − + −
0.487
4.992
0.139
1.600
19 C1 C2 C5
+ + − − +
−0.010
6.396
0.536
6.197
20 C1 C3 C4
+ − + + −
0.090
4.423
0.420
3.172
21 C1 C3 C5
+ − + − +
−0.249
7.719
0.492
7.706
22 C1 C4 C5
+ − − + +
−0.339
7.335
0.619
7.520
23 C2 C3 C4
− + + + −
−0.077
7.460
0.820
7.809
24 C2 C3 C5
− + + − +
−0.623
11.767
1.128
12.860
25 C2 C4 C5
− + − + +
−0.557
12.348
1.139
12.403
26 C3 C4 C5
− − + + +
−0.683
12.946
0.945
13.921
27 C1 C2 C3 C4
+ + + + −
0.394
5.337
0.378
3.041
28 C1 C2 C3 C5
+ + + − +
−0.201
7.052
0.658
7.480
29 C1 C2 C4 C5
+ + − + +
0.074
6.863
0.484
6.072
30 C1 C3 C4 C5
+ − + + +
−0.396
7.633
0.458
7.601
31 C2 C3 C4 C5
− + + + +
−0.617
12.533
1.084
13.346
32 C1 C2 C3 C4 C5
+ + + + +
−0.358
7.759
0.798
7.917
236
Tushar and D.K. Pratihar
One thousand data have been generated artificially by using the above regression equations and those are grouped into a number of clusters based on similarity by utilizing the entropy-based fuzzy clustering algorithm. The best set of four clusters with zero outlier is obtained, corresponding to a threshold of similarity β = 0.43. Three approaches have been developed with the best set of clusters obtained above, the results of which are explained below. Results of Approach 1 To re-establish the input-output relationships of the process, linear regression analysis is carried out cluster-wise using the software – MINITAB-14. The cluster-wise response equations have been obtained as follows. Cluster 1: O1 O2 O3 O4
= = = =
0.918 + 0.0168C1 + 0.138C2 − 0.00378C3 − 0.0806C4 − 0.0157C5 (27) −2.51 − 0.134C1 + 0.0608C2 + 0.0157C3 + 0.516C4 + 0.133C5 (28) − 0.231 − 0.0180C1 + 0.182C2 + 0.00257C3 + 0.0806C4 + 0.00790C5 (29) −5.98 − 0.204C1 − 0.0169C2 + 0.0342C3 + 0.496C4 + 0.181C5 (30)
Cluster 2: O1 O2 O3 O4
= = = =
0.795 + 0.0164C1 + 0.208C2 − 0.00364C3 − 0.0555C4 − 0.0167C5 −0.146 − 0.126C1 + 0.0560C2 + 0.0178C3 + 0.576C4 + 0.100C5 −0.282 − 0.0222C1 + 0.0862C2 + 0.00338C3 + 0.114C4 + 0.0108C5 −4.23 − 0.194C1 − 0.419C2 + 0.0327C3 + 0.427C4 + 0.169C5
(31) (32) (33) (34)
Cluster 3: O1 O2 O3 O4
= = = =
1.30 + 0.0145C1 + 0.0987C2 − 0.00319C3 − 0.158C4 − 0.0161C5 1.83 − 0.192C1 − 0.129C2 + 0.0236C3 + 0.859C4 + 0.0989C5 −0.443 − 0.0175C1 + 0.195C2 + 0.00175C3 + 0.144C4 + 0.00830C5 −3.34 − 0.234C1 − 0.177C2 + 0.0321C3 + 0.655C4 + 0.165C5
(35) (36) (37) (38)
Cluster 4: O1 O2 O3 O4
= = = =
1.07 + 0.0170C1 + 0.200C2 − 0.00233C3 − 0.165C4 − 0.0169C5 (39) −3.43 − 0.190C1 − 0.258C2 + 0.0257C3 + 1.32C4 + 0.137C5 (40) − 0.410 − 0.0207C1 + 0.120C2 + 0.000108C3 + 0.193C4 + 0.00988C5 (41) −6.43 − 0.227C1 − 0.301C2 + 0.0358C3 + 0.811C4 + 0.188C5 (42)
The performances of this approach have been tested on 50 randomly-generated cases, the results of which are shown in Fig. 10. The calculated values of the outputs have been compared with their respective target values. It is to be mentioned that good predictions are obtained by Approach 1, for all the four
Takagi and Sugeno Approach of FLC
237
outputs. The points on Fig. 10 are seen to lie on either the ideal y = x line or both the sides of it. A few points are found to lie on both the sides of the ideal y = x line and it could be due to the fact that at each cluster, the responses have been determined as the linear functions of the input process parameters but their interaction and non-linear terms have been neglected for simplicity. 0.3
Target value of Output 1
0.2 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 Calculated value of Output 1
0.3
(a) Target vs. calculated values of output 1 11
Target value of Output 2
10
9
8
7
6
5
5
6
7 8 9 10 Calculated value of Output 2
11
(b) Target vs. calculated values of output 2 Fig. 10. Performance testing of cluster-wise linear regression – TIG data
238
Tushar and D.K. Pratihar 1
Target value of Output 3
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Calculated value of Output 3
(c) Target vs. calculated values of output 3 11
Target value of Output 4
10 9 8 7 6 5 4 3
3
4
5 6 7 8 9 Calculated value of Output 4
10
11
(d) Target vs. calculated values of output 4 Fig. 10. (Continued)
Results of Approach 2 In this approach, FLCs have been developed cluster-wise by following the Takagi and Sugeno approach and utilizing the linear response equations obtained
Takagi and Sugeno Approach of FLC
239
above. A binary-coded GA has been used to improve their KBs. The GA is found to perform in the optimal sense, with the following parameters: pc = 0.89, pm = 0.00097, P = 100, G = 500. During optimization, the values of p + q × n × p = 5 + 4 × 4 × 5 = 85 parameters have been varied within their respective ranges as shown in Table 3. A batch mode of training has been adopted using the artificially-generated same set of 1000 data. The optimal values of the above 85 parameters obtained using Approach 2 are also shown in Table 3. The calculated values of four outputs obtained by utilizing Approach 2 have been compared with their respective target values in Fig. 11. It is interesting to note that Approach 2 has yielded slightly better predictions of the outputs compared to those obtained by Approach 1 and as a result of which, the points on the above figure are seen to move closer to the ideal y = x line. It is important to note that Approach 2 is found to be more adaptable to the test cases (as the adaptability has been injected by the GA) compared to Approach 1 is. Results of Approach 3 In this approach, linear membership function distributions of the input variables adopted in Approach 2, have been replaced by the nonlinear distributions, such as third order polynomial function. A systematic parametric study has been carried out, to decide the optimal GA-parameters, which are found to be as follows: pc = 0.78, pm = 0.00133, P = 100, G = 500. During the GA-based tuning of the FLCs, a batch mode of training (with the help of the same set of 1000 data) has been provided. Fig. 12 shows the comparisons of the calculated values of the outputs (obtained by using Approach 3) with their respective target values. A close watch on Figures 10 through 12 reveals that Approach 3 has shown better predictions compared to those obtained by Approach 1. Moreover, Approach 3 is seen to be marginally better than Approach 2, in predicting the outputs. It could be due to the reason that the membership function distributions of the variables have been assumed to be non-linear in Approach 3. Comparisons Fig. 13 compares the performances of three approaches, in terms of % deviation in prediction of different outputs. For the test cases, the values of average absolute % deviation in prediction of the outputs have been determined for the three approaches. Approach 1 has yielded a value of 10.910930%, whereas these are found to be equal to 8.082850% and 8.058067% for Approaches 2 and 3, respectively. Thus, Approach 3 has proved its supremacy over the other approaches. The performances of three developed approaches have been checked on the data related to two physical problems – Abrasive Flow Machining (AFM) and Tungsten Inert Gas (TIG) welding. In both the problems, Approaches 2
240
Tushar and D.K. Pratihar Table 3. Ranges and optimized values of 85 parameters – TIG data Parameter
Range
Optimal value Approach 2
Optimal value Approach 3
g1 g2 g3 g4 g5 h111 h112 h113 h114 h115 h121 h122 h123 h124 h125 h131 h132 h133 h134 h135 h141 h142 h143 h144 h145 h211 h212 h213 h214 h215 h221 h222 h223 h224 h225 h231 h232 h233 h234 h235 h241 h242 h243 h244
1,20 0.05,0.8 1,35 0.05,0.6 1,25 0.0001,0.0006 0.001,0.006 0.00001,0.00006 0.0001,0.0006 0.0001,0.0006 0.001,0.006 0.0001,0.0006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.00001,0.00006 0.0001,0.0006 0.00001,0.00006 0.001,0.006 0.0001,0.0006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.00001,0.00006 0.0001,0.0006 0.0001,0.0006 0.001,0.006 0.0001,0.0006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.0001,0.0006 0.00001,0.00006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006
1 0.706891 35 0.36129 25 0.0001 0.006 0.00001 0.0006 0.000133 0.002256 0.000572 0.000447 0.001313 0.002261 0.000103 0.001 0.000054 0.000134 0.000052 0.003146 0.000268 0.000104 0.001166 0.002261 0.000225 0.001 0.000011 0.000599 0.0001 0.003502 0.000393 0.000586 0.001132 0.001362 0.000164 0.000129 0.000059 0.001039 0.000173 0.003507 0.002075 0.000594 0.001
1.928641 0.244282 35 0.342473 24.131966 0.0001 0.005541 0.00001 0.0006 0.000444 0.003517 0.000256 0.000105 0.001039 0.003498 0.0001 0.001 0.00006 0.00013 0.00006 0.001029 0.000582 0.000101 0.001293 0.00122 0.000227 0.001 0.000041 0.000368 0.000124 0.002706 0.0003 0.000592 0.001034 0.001318 0.0001 0.000103 0.000056 0.001073 0.000167 0.002667 0.001088 0.000151 0.001015
Takagi and Sugeno Approach of FLC Table 3. (Continued) Parameter
Range
Optimal value Approach 2
Optimal value Approach 3
h245 h311 h312 h313 h314 h315 h321 h322 h323 h324 h325 h331 h332 h333 h334 h335 h341 h342 h343 h344 h345 h411 h412 h413 h414 h415 h421 h422 h423 h424 h425 h431 h432 h433 h434 h435 h441 h442 h443 h444 h445
0.001,0.006 0.0001,0.0006 0.0001,0.0006 0.00001,0.00006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.0001,0.0006 0.0001,0.0006 0.001,0.006 0.00001,0.00006 0.001,0.006 0.00001,0.00006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.00001,0.00006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.000001,0.000006 0.001,0.006 0.00001,0.00006 0.001,0.006 0.001,0.006 0.0001,0.0006 0.001,0.006 0.001,0.006
0.001073 0.0001 0.0006 0.000012 0.004451 0.000101 0.001 0.005242 0.000589 0.00102 0.00011 0.0001 0.001029 0.000058 0.001039 0.000052 0.001 0.005936 0.000598 0.00101 0.001 0.00021 0.006 0.000059 0.005976 0.000349 0.001024 0.005189 0.000576 0.001015 0.001005 0.0001 0.001274 0.000004 0.001 0.000059 0.001543 0.002017 0.000115 0.001406 0.001
0.001005 0.0001 0.000106 0.000012 0.004148 0.0001 0.001 0.001191 0.000586 0.00101 0.000104 0.0001 0.001 0.000059 0.00101 0.000016 0.001 0.001083 0.000595 0.001073 0.001 0.000104 0.00101 0.00006 0.004636 0.000132 0.00101 0.003639 0.00057 0.001029 0.001 0.0001 0.001 0.000003 0.001005 0.000057 0.001078 0.001538 0.000103 0.001406 0.001
241
242
Tushar and D.K. Pratihar 0.3
Target value of Output 1
0.2 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5 0 0.1 −0.5 −0.4 −0.3 −0.2 −0.1 Calculated value of Output 1
0.2
0.3
(a) Target vs. calculated values of output 1 11
Target values of output 2
10
9
8
7
6
5
5
6
7 8 9 10 Calculated values of output 2
11
(b) Target vs. calculated values of output 2 Fig. 11. Performance testing of the FLC having linear membership function distributions – TIG data
Takagi and Sugeno Approach of FLC
243
1
Target value of Output 3
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Calculated value of Output 3
(c) Target vs. calculated values of output 3 12 11
Target value of Output 4
10 9 8 7 6 5 4 3
3
4
5
6
7
8
9
10
11
12
Calculated value of Output 4
(d) Target vs. calculated values of output 4 Fig. 11. (Continued)
and 3 are found to perform better than Approach 1, i.e., GA-tuned FLCs (developed cluster-wise) are seen to outperform the cluster-wise linear regression analysis. It could be due to the fact that the input-output relationships of the above processes are not exactly the linear in nature, although those have
244
Tushar and D.K. Pratihar
been assumed to be so. In both Approach 2 as well as Approach 3, the GA has brought the necessary adaptability during the training phase. Thus, they are able to tackle the situations (i.e., testing on some random test cases) more efficiently. On the other hand, Approach 1 lacks that adaptability. Moreover, 0.3
Target value of Output 1
0.2 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 Calculated value of Output 1
0.2
0.3
(a) Target vs. calculated values of output 1 11
Target value of Output 2
10
9
8
7
6
5
5
6
7 8 9 10 Calculated value of Output 2
11
(b) Target vs. calculated values of output 2 Fig. 12. Performance testing of the FLC having nonlinear membership function distributions – TIG data
Takagi and Sugeno Approach of FLC
245
1
Target values of Output 3
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.2
0.3
0.4 0.5 0.6 0.7 0.8 Calculated values of Output 3
0.9
1
(c) Target vs. calculated values of output 3 12 11
Target values of Output 4
10 9 8 7 6 5 4 3 3
4
5 6 7 8 9 10 Calculated values of Output 4
11
12
(d) Target vs. calculated values of output 4 Fig. 12. (Continued)
a close watch on the above results indicates that Approach 3 has performed slightly better than Approach 2. It might be due to the reason that the linear membership function distributions of the input variables used in Approach 2, have been replaced by the nonlinear ones in Approach 3. Thus, Approach 3
246
Tushar and D.K. Pratihar 300 ’CWLR1’ ’LRFLC1’ ’NLRFLC1’
% deviation in prediction of output 1
200 100 0 −100 −200 −300 −400 −500 −600 −700
5
10
15
20
25 30 Test Cases
35
40
45
50
45
50
(a) Output 1: Weld bead front height 10
’CWLR2’ ’LRFLC2’ ’NLRFLC2’
% deviation in prediction of output 2
8 6 4 2 0 −2 −4 −6
5
10
15
20
25 30 Test Cases
35
40
(b) Output 2: Weld bead front width Fig. 13. Comparison of three approaches in terms of % deviation in prediction of different outputs – TIG
Takagi and Sugeno Approach of FLC
% deviation in prediction of output 3
10
’CWLR3’ ’LRFLC3’ ’NLRFLC3’
5
0
−5
−10
−15
−20
5
10
15
20
25 30 Test Cases
35
40
45
50
45
50
(c) Output 3: Weld bead back height 10
’CWLR4’ ’LRFLC4’ ’NLRFLC4’
% deviation in prediction of output 4
8 6 4 2 0 −2 −4 −6
5
10
15
20
25 30 Test Cases
35
40
(d) Output 4: Weld bead back width Fig. 13. (Continued)
247
248
Tushar and D.K. Pratihar
has recorded the better performance compared to Approach 2, in grabbing the non-linear relationships of the processes.
7 Concluding Remarks To establish input-output relationships of two physical processes, three approaches (one is a statistical regression analysis and the other two deal with GA-tuned FLCs) have been developed cluster-wise and their performances are tested on 50 randomly-generated new cases. From the above study, the following conclusions have been drawn: 1. GA-tuned FLCs (both Approaches 2 as well as 3) developed based on Takagi and Sugeno approach have outperformed the linear regression analysis, i.e., Approach 1, for the random test cases. It might be due to the reason that adaptability has been injected into the FLCs, during their training carried out by using a GA, whereas Approach 1 does not have the provision to gain such a property. 2. Approach 3 has yielded better results compared to Approach 2. It could be due to the fact that the membership function distributions of the input variables have been assumed to be nonlinear in Approach 3, whereas those in Approach 2 are linear in nature. Thus, Approach 3 is able to capture the nonlinearity of the processes in a more effective way. In fact, Approach 3 is found to be the best of all. 3. Performances of the approaches are problem-dependent.
8 Scope for Future Work The present chapter deals with an effective way of designing FLCs cluster-wise, based on Takagi and Sugeno approach. However, there are chances of further improvement of the developed algorithms. For example, the consequent parts of the FLCs have been considered as the linear functions of the input variables. It may not work well for a highly nonlinear process, for which second or higher order functions are to be tried. Moreover, some sort of sensitivity analysis may be carried out for the developed approaches, in future.
References 1. Zadeh LA (1965) Fuzzy Sets, Information and Control, 8(3):338–353 2. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Man Mach Stud 7:1–13 3. Takagi T, Sugeno M (1985) Fuzzy identification of systems and its application to modeling and control. IEEE Trans Syst Man Cybern, SMC-15:116–132
Takagi and Sugeno Approach of FLC
249
4. Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA 5. Karr C (1991) Genetic algorithms for fuzzy controllers. AI Expert, pp 26–33 6. Thrift P (1991) Fuzzy logic synthesis with genetic algorithms. In: Proceedings of 4th International Conference on Genetic algorithms (ICGA’91), pp 509–513 7. Cordon O, Gomide F, Herrera F, Hoffmann F, Magdalena L (2004) Ten years of genetic-fuzzy system: current framework and new trends, Fuzzy Set Syst, 141:5–31 8. Hoffman F, Pfister G (1997) Evolutionary design of a fuzzy knowledge base for a mobile robot. Int J Approx Reason 17(4):447–469 9. Ishibuchi H, Nakashima T, Murata T (1999) Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems IEEE Trans Syst Man Cybern 29:601–618 10. Cordon O, DeeJesus M, Herrera F, Lozano M (1999) MOGUL: a methodology to obtain genetic fuzzy rule-based systems under the iterative rule learning approach. Int J Intell Syst 14(11):1123–1153 11. Furuhashi T, Miyata Y, Nakaoka K, Uchikawa Y (1995) A new approach to genetic based machine learning and an efficient finding of fuzzy rules – proposal of Nagoya Approach, Lecture notes on artificial intelligence 101:178–189 12. Yupu Y, Xiaoming X, Wengyuan Z (1998) Real-time stable self-learning FNN controller using genetic algorithm. Fuzzy Set Syst 100:173–178 13. Angelov P, Buswell R (2003) Automatic generation of fuzzy rule-based models from data by genetic algorithms. Inf Sci 50:17–131 14. Nandi A, Pratihar D (2004) Automatic design of fuzzy logic controller using a genetic algorithm–to predict power requirement and surface finish in grinding. J Mater Process Technol 148:288–300 15. Ghosh A, Nath B (2004) Multi-objective rule mining using genetic algorithms. Inf Sci 163:123–133 16. Hui N, Pratihar D (2005) Automatic design of fuzzy logic controller using a genetic algorithm for collision-free, time-optimal navigation of a car-like robot. Int J Hybrid Intell Syst 2:161–187 17. Smith SM, Comer DJ (1992) An algorithm for automated fuzzy controller tuning. In: Proceedings of IEEE conference on fuzzy systems, pp 615–622 18. Kim JH, Cho HC, Choi YK, Jeon HT (1995) On design of the self-organizing fuzzy logic system based on genetic algorithms. In: Proceedings of the fifth IFSA world congress on fuzzy logic and its applications to engineering. In: Bien Z, Min KC (eds.) Information sciences and intelligent systems, Kluwer, Dordrecht, Netherlands, pp 101–110 19. Jhang JS (1993) ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans Syst Man Cybern 23(3):665–685 20. Holland J (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor, USA 21. Goldberg DE, Deb K (1991) A comparison of selection schemes used in genetic algorithms. In: Rawlins GJE (ed.) Proceedings of foundations of genetic algorithms, pp 69–93 22. Spears WM, De Jong KA (1991) An analysis of multi-point crossover. In: Rawlins GJE (ed.) Proceedings of foundations of genetic algorithms, pp 301–315 23. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell, MA, USA
250
Tushar and D.K. Pratihar
24. Yao J, Dash M, Tan ST (2000) Entropy-based fuzzy clustering and fuzzy modeling. Fuzzy Set Syst 113:381–388 25. Nandi AK, Pratihar DK (2004) Design of a genetic-fuzzy system to predict surface finish and power requirement in grinding. Fuzzy Set Syst 148:487–504 26. Jain RK, Jain VK (2000) Optimum selection of machining conditions in abrasive flow machining using neural networks. J Mater Process Technol 108:62–67 27. Juang SC, Tarng YS, Lii HR (1998) A comparison between the back-propagation and counter-propagation networks in the modeling of the TIG welding process. J Mater Process Technol 75:54–62
Evolutionary Fuzzy Modelling for Drug Resistant HIV-1 Treatment Optimization Mattia Prosperi and Giovanni Ulivi
Summary. Fuzzy relational models for genotypic drug resistance analysis in Human Immunodeficiency Virus type 1 (HIV-1) are discussed. Fuzzy logic is introduced to model high-level medical language, viral and pharmacological dynamics. In-vitro experiments of genotype/phenotype pairs and in-vivo clinical data bases are the base for the knowledge mining. Fuzzy evolutionary algorithms and fuzzy evaluation functions are proposed to mine resistance rules, to improve computational performances and to select relevant features.
1 Introduction 1.1 Artificial Intelligence in Medicine Recent years have seen medicine and artificial intelligence cross their roads and proceed together: statistical analysis has been a valid support on epidemiology or diagnosis, but after first genomic regions were sequenced and interpreted the scenario and the needs became more complex, involving biology and biochemistry. Today computer science -through machine learning and intelligent systems- integrates medicine and biology in several fields, from sequence analysis to protein structure and function prediction, to gene regulatory networks modelling, to molecular design, to medical diagnosis. Medical and biological data bases started to grow up and assume standard structures. Biological systems are complex systems, medical measures are extremely variable even under the same conditions and indirect indicators of real processes: one way to handle them is to use uncertainty and vagueness concepts of Fuzzy Logic, which thereby is a suitable modelling framework. The HIV treatment optimization scenario in fact sees the drug resistance development by its genomic variation under drug pressure: the high mutation rate determines a huge state variable space. There is a large number of different drugs that attack different viral targets genes and have to be combined in order to control the viral suppression and the chance of resistance rise. Moreover, in the human body M. Prosperi and G. Ulivi: Evolutionary Fuzzy Modelling for Drug Resistant HIV-1 Treatment Optimization, Studies in Computational Intelligence (SCI) 82, 251–287 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
252
M. Prosperi and G. Ulivi
virus/drug interactions are affected by a host of cofactors, either uncontrollable or unobservable, thus requiring approximate reasoning models. 1.2 Road Map The first sections (2 and 3) of this chapter are intended to provide necessary introduction and literature references to be probe further: in detail, biological background on antiviral treatments, resistance developing, viral sequence analysis and data collection is given in section 2. The following section 3 is an overview on the current machine learning approaches for drug resistant HIV-1 treatment optimization. Section 4 introduces fuzzy modelling for medical science: from previous studies on the field continues defining the fuzzy relational system for in-vitro and in-vivo modelling. In section 5 optimization techniques are discussed, describing Fuzzy Genetic Algorithms, Random Searches and Fuzzy Feature Selection criteria. Section 6 presents application results for phenotype and in-vivo clinical prediction, with conclusions and future perspectives.
2 Background on HIV Treatment and Drug Resistance Onset Many microorganisms can enter the human body and cause harm, including viruses, fungi, bacteria, protozoa. Once inside the body, the primary goal of a microorganism is to survive and reproduce itself. Most antimicrobial agents are designed to kill these pathogens or prevent them from reproducing. When a microorganism as a virus continues to replicate despite the pressure of a drug, mutants are selected that more efficiently adapt themselves to grow in the presence of a certain drug concentration: this results in the phenomenon of drug resistance. When drug resistance occurs, the drug -or combination of drugs- can’t keep the microorganism. Over time, the treatment can stop working completely. Evolution consists of a selective pressure from the environment that acts on organisms: it selects the best individuals from populations, appearing random mutations on the gene pool; advantages acquired from mutations will be transmitted to progeny. The Human Immunodeficiency Virus (a Lentivirus divided in two major families, HIV-1 and HIV-2) has a rapid rate of mutation and has developed through this resistance to antivirals. A brief introduction for non-biologists is given in [18], which will be the reference for the following description. HIV-1 causes a progressive deterioration of immune system leading almost relentlessly to AIDS (Acquired Immune Deficiency Syndrome) and death due to opportunistic infections. Modelling mechanisms of resistance requires to investigate the viral genome (which is in the form of RNA) and genes encoded within. A gene is a sequence of nucleotides (four varieties), while the genome produces proteins, important in virus life cycle. A protein is a sequence of
Evolutionary Fuzzy HIV Modelling
253
amino acids, which are encoded by blocks of three adjacent nucleotides in the genome, called codons. Genomic sequences are the building blocks of biological mechanisms: computer science is today necessary to investigate the genes and their functions; even simple organisms like viruses are characterized by long character sequences. The base for sequence analysis is [5], which is also a complete and generic guide for the whole set of derived subtasks. 2.1 HIV Replication and Treatment Design In the virus life cycle (when it reproduces) the genome string has to be copied from a generation to the next one. Soon after HIV enters the body, the virus begins reproducing at a rapid rate and billions of new viruses are produced every day. In the process, HIV produces both perfect copies of itself (wild type) and copies containing errors (mutants): copying errors occur frequently. Mutations can change virus structure or functions and then modify its interaction in the environment: the high mutation rate of HIV (combined with the fact that it attacks the immune system) leads to difficulties in the design of a vaccine, and rapid selection of mutant strains resistant to drugs. At present three classes of drugs are approved from FDA (Food and Drug Administration of USA) as antiviral treatment against HIV: these are Reverse Transcriptase inhibitors (RTi), Protease inhibitors (PRi) and Fusion inhibitors (Fi); each class acts against a step of the viral replication process, and there are around 15 different molecules in commerce. The viral genotype is a RNA 4-character sequence, from which usually are extracted mutations comparing the sequence with the wild type. Usually mutations are identified with a number representing a codon (a position in the genomic sequence), headed by a letter that indicates the amino acid present in the wild type (i.e. the standard virus, without mutations) and followed by another letter that describes the amino acid replaced in the mutant. For instance, a mutation that usually confers resistance to Lamivudine (3TC) is the M184V: it indicates that in codon 184 amino acid Methonine (M) has been replaced by Valine (V). During infection there’s no single virus in the body, but a large population of mixed viruses called quasispecies. Wild type virus is the one naturally evolved with highest replicative capacity: before therapy is started, it is the most abundant in the body and dominates all other quasispecies. Mutant variants are too weak to survive and/or can’t reproduce. Others are strong enough to reproduce but still can’t compete with the more fit wild type; as a result, their numbers are less than the wild type ones in the body. A drug usually works blocking a key role in the virus life cycle. Some variants have mutations that allow the virus to partly, or even fully, resist an antiretroviral drug. In a constant therapy mutant resistant strains can become dominant (though having lower replicative capacity or fitness) in the patient. This is called selective resistance, because the mutant is selected by the drug.
254
M. Prosperi and G. Ulivi
If it is not recognized, treatment loses its efficacy. Selected mutants are more challenging to treat because therapy options are reduced. If drug regimen is changed, new mutations can be selected and furthermore there are mutations (such as the insertion at codon 69 in the Reverse Transcriptase gene) that cause cross-resistance to a whole class of antiretrovirals. Treatment interruptions showed that in a couple of months HIV reverts to the wild type, but maintains low concentrations of resistant mutants, so if a heavily experience drug is reused resistance arises shortly. Combined therapies that involve multiple drugs are an approach to avoid resistance. If virus changes to resist against a drug, but it’s inhibited by many different others, it can be suppressed to undetectable levels (even though complete eradication is not possible). Combined treatments can contain from three to five different drugs, but lead often to tolerability and toxicity problems. Such therapies (usually two or three RTis and at least a PRi) are called HAARTs (Highly Active Anti Retroviral Therapies) or cARTs (combined Anti Retroviral Therapies): usually HAARTs produce a sensible reduction of viral load within three-four weeks, and can be sustained in a long time window. Unfortunately mutations occur also under HAARTs, even though at a lower rate. 2.2 Experimental Settings and Data Collection Before being approved and commercialized from the FDA, drugs follow a long iter in which their efficacies are tested through different phases (namely, from phase I to phase IV): first they’re designed, synthesized and put in viral cultures; when they’re proven to be effective in-vitro, they start to be tested for adsorption levels and toxicity in-vivo, until they are judged to be relatively safe for the human body and effective in viral eradication. In-vitro studies however are always carried on -even after the commercialization- in order to point out further resistance development. In-vitro and in-vivo studies are the data available for modelling. In-vitro studies are collections of experiments that measure how a mutated virus responds in a culture to a single drug inhibition, compared with the replication of the wild type under the same drug pressure: the phenotype is a numeric indicator of viral replication power, expressed as Fold Change of the drug concentration needed to inhibit 50% of the viral replication as compared to a wild type drug-susceptible reference viral strain: the data sets are pairs of genotype sequences and Fold Changes values. At now, for each drug there are thousands of such pairs freely available and the quality -being fixed environment conditions and repeatability- is fairly high. These tests are expensive compared to the cost of sequencing a viral strain, so a first attempt is to define models that give phenotype prediction from genotype sequences. In-vivo studies usually are data bases collecting patients’ Follow Ups, i.e. analyses carried out before and after a therapy switch: usually a therapy
Evolutionary Fuzzy HIV Modelling
255
is stopped and considered failing when the Viral Load1 is detectable in the patient’s blood and/or the CD4+ T2 cell counts are very low; when the virus is detectable, it can be also sequenced and so it is possible to find which are the selected mutations. Unfortunately, not always it is possible to obtain clean and huge data sets: perspective cohorts (called clinical trials, studies on precise therapeutic protocols lead by an equipe of physicians, in which patients are controlled weekly) are the most reliable ones, but not always free and not so large; retrospective cohorts (collections of clinical reports from the hospitals) are larger in size, but suffer from noises like time delays, missing data and unadherent patients. In addition, in-vivo measures are biased by instruments’ systematic errors: viral load measures are reliable within 1 Log and can’t detect copies under certain limits (500 or 50 cp/ml); genotype sequencing methods have an accuracy of 90% in revealing mutations, but performances decrease using plasma samples with low viral concentrations. Even input errors are not negligible: data bases are not automated and have bad relational structures and implementations, mostly data are recorded manually from paper clinical reports to spreadsheets. There are thousands of instances available, but the variability of in vivo data is extremely high and the space of 400 possible genotypes (in the investigation is huge: in$ theory there % $15%about 20 15 sequenced region) and 1 + · · · + 5 = 4943 therapeutic combinations.
3 Machine Learning for Drug Resistant HIV CARTs with three or more antiretroviral drugs have lead to significant decreases in HIV-related morbidity and mortality by reducing HIV replication. The goal of therapy is to suppress the plasma viral load as much as possible for as long as possible. Given the fact that viral eradication is not feasible with the current treatment armamentarium, HIV mutants ultimately are developed, with different degrees of decreased susceptibility to the ongoing treatment regimen and of cross-resistance to other agents. This results in virologic rebound and eventually disease progression: for these patients, the aim is to build a therapy optimization tool, that will explore efficacies among a set of possible cARTs and assure the best subsequent Viral Load reduction, taking into account input attributes among the viral genotype, Baseline Viral Load (virion counts in the plasma taken in a therapy switch), CD4+ T cell counts, pharmacodynamics-kinetics, viral drug resistance mechanisms, et cetera. One of the first studies published on drug resistant HIV treatment optimization was the CTSHIV (Customized Treatment Strategy for HIV, see [18]): the system operated on a set of a predetermined fuzzy-like set of in vivo viral drug resistance rules applied to a patient’s viral genotype and nearby mutants, 1 2
Viral Load is the virion count in the plasma CD4+ T are immune cells targeted and infected by the virus, so a measure for the immune response to the infection
256
M. Prosperi and G. Ulivi
and a branch-and-bound algorithm to find optimal drug combinations. However, this study did not focused on mining knowledge from data (i.e. finding the rules), which became a must in the latter studies. Nowadays -combined with the choice of an appropriate model- the main issue is to mine drug resistance mechanisms from the clinical data. The support for this task are in-vitro (coming from cell cultures) and in-vivo (coming from plasma analyses on patients) data sets. Predicting single in-vitro phenotypes from viral genotypic data is a widely explored task. Linear Regressors, Decision Trees and Support Vector Machines applied to genotype-phenotype pairs are able to perform predictions that explains correctly 30 to 79% of phenotypic variance (depending on the drugs) [20, 24–26, 41]. Nevertheless, the final goal is to predict the real (in-vivo) viral load change (and consequently the CD4+ T cell count response) for cART regimens, either analyzing associations of genotypes with phenotypes and subsequently phenotypes with treatment response, either directly associations of genotypes with treatment response. Various and different algorithms for viral genotype interpretation are currently available that give indications about drug resistance: some base their inferences on Neural Networks [30–32, 42] other are based on Case Based Reasoning (CBR) and k-Nearest Neighbor (kNN) algorithm [33], other are Rule-Based [37, 43–45] other are Fuzzy-RuleBased [29], other take into account evolutionary pathways modelling, or compute genetic barriers probabilities [7, 21–23, 27, 28, 34, 35]. The Rule-Based algorithms display predictions in terms of resistance classes (such as high, intermediate, low), while the Neural Network approach and CBR-kNN attempt to predict actual viral load changes by regression. Each approach has advantages and disadvantages: the rule based ones are easy to interpret but the tuning and update is difficult to achieve in an automated way, moreover, correlating the real outcomes with the resistance classes involves arbitrary steps. Neural Networks on the other hand are powerful in finding non linear interactions and easy to train, but act as black boxes and biological mechanisms can’t be easily extracted. Finally, for CBR-kNN systems ad-hoc similarity functions are needed, no model is given and every time the Case Base must be scanned, but accurate estimate of the error and an easy to understand collection of similar cases are provided. Furthermore, while the CBR-kNN approach provides a desirable local optimization, searching the local neighborhood of input instances, it lacks of generalization for unseen input space regions; the other methods are capable of making extrapolations, though assuming global optimization.
4 Fuzzy Modelling for HIV Drug Resistance Interpretation Human Immunodeficiency Virus develops resistance to various drug combinations through mutations positively selected. It has been shown that a mono- or bi-therapeutic regimen is not able to keep low viral loads and leads in a short
Evolutionary Fuzzy HIV Modelling
257
time to the selection of resistant strains. With the introduction of HAART techniques we can see a sensible reduction in viral loads, capacity of replication and mutation. Unfortunately resistance still occurs under these multiple regimens (especially if they are carried out after long single-drug therapies) even though in a greater time scale. There are clear difficulties for physicians when they have to determine a certain drug cocktail among many combinations for each patient, considering viral genotype, tolerability conditions, toxicity. Actually, the present medical knowledge on the relations between mutations and drug resistance is inadequate: rules for therapy optimization are mostly those derived from in vitro studies, but the scenario changes when drugs act in the human body (absorbtion, immune response, cytotoxic response, latent reservoirs in which the virus escapes. . . ) and drug combinations are given to patients; anyway, functions (fairly linear) that were feasible in the phenotype prediction can be starting point for in vivo strategies, as the statistical analyses provided for clinical trials (even though often under restricted backgrounds and small instances). In this section a fuzzy representation models of such a complex system is introduced. For fuzzy sets, norms and relations theory references are in [1,2,11,13]. Thread of the plot will be the attempt to rediscover (to confirm or to reject) medical hypotheses on the mechanisms involved between mutation selection and resistance to antiretrovirals, to find out new associations. The aim is to define a decision system that associates to each pathological instance (i.e. a patient in a certain clinic state, defined by an input attribute set) an appropriate therapeutic regimen, possibly the best, which ensures the maximum viral load reduction and which is capable to bring itself up to date when new medical knowledge or data are available. First task is to represent the biological behavior of the virus under drug pressure with fuzzy terms. It has been observed from in vitro studies that, as mutations accumulate in time under drug pressure, the resistance increases. There are primary mutations that confer suddenly high-level resistance and secondary mutations that contributes the increment. Some mutations confer cross-resistance to the drugs of the same class, other can be susceptible (i.e. the treatment is more effective compared to the wild type response) for a drug and resistant for another. In vivo viral behavior is different: first of all, the virus is under multiple drug pressures; secondly, drugs must be adsorbed by the body and concentrations vary; third, HIV attacks different cells and hides in latent reservoirs; fourth, depending on drug combinations and many other factors, evolutionary mutational pathways are different. Moreover, mutations do cluster: among polymorphisms and many uncorrelated single mutations, some aggregate in patterns. Clusters are usually overlapping (i.e. a mutation can be present in different patterns), but there are situations in which mutations antagonist among each other determine mutually excluding clusters: models for evolutionary pathways of the virus under selective drug pressure are presented in [7, 21–23, 27, 28, 34].
258
M. Prosperi and G. Ulivi
The following subsections will present a fuzzy relational framework that models the in-vitro experiments (for genotype→phenotype prediction) and -with an extension- translates a limited set of viral-host dynamics for in-vivo settings. The model will take into account the effect of increased resistance through accumulation of mutations, the possibility to have susceptibility mutations and the difference between major and minor mutations. Mutations will be examined singularly, and using interactive operators weighted functions of them will calculate resistance and susceptibility for each single drug. Combined therapies will be viewed as function of single drug effects. In Section 6.3 a deeper description of fuzzy terms, rules, relations and input space partitioning will be presented, suitable for handling a wider set of biological behavior, but practically limited due to the unavailability of sufficient amount of training data. 4.1 Fuzzy Medical Diagnosis Advantages for the fuzzy modelling come from the strong flexibility gained using linguistic variables, i.e. the possibility of modelling hypotheses in complex natural language, like the medical one. However, due to the multi-dimensional and heterogeneous characteristics of the input attribute space (discrete character viral genotype mutational sequences, cARTs and real valued plasma analyses) define linguistic variables is complex, and even worse mine them. Before introducing the HIV models, It’s worth here to cite the CADIAG system, a fuzzy inference relational model proposed by Sanchez, Adlassnig and Gupta [9, 10] developed as automated tool for medical diagnosis: this study in fact inspired the HIV model. In the Sanchez approach the medical knowledge is represented as a fuzzy relation between symptoms and diseases. So, given the fuzzy set A of the symptoms observed in the patient3 and the fuzzy relation R = (A → B) representing the medical knowledge that relates symptoms s ∈ S and diseases d ∈ D, then a fuzzy set B of the patient’s possible diseases can be calculated through B = A ◦ R
(1)
or equivalently µB (d) = maxs∈S {min{µA (s), µR (s, d)}}
∀d ∈ D
(2)
The fuzzy relation R should be found solving T =Q◦R 3
(3)
here membership functions are set up on healty/ill distributions from real-valued analyses
Evolutionary Fuzzy HIV Modelling
259
where R is unknown and Q, T represent respectively the symptoms and the diagnoses made for a set of known cases. Methods to solve the above equation (and generalizations) are in [1, 8, 9, 12, 14] and the maximal solution was taken. Regarding fuzzy relational equations solving, there are exact solution algorithms (and solution lattices) only for a restricted set of compositional operators (like max-min and max-product ): Genetic Algorithms are used to approximate solutions for general -⊥ norm composition. The model was actually made building various R relations (somewhat related to plausibility and belief concept). The CADIAG system (and its refinements) reached great performances, yielding 93% accuracy compared with the clinical evidences. 4.2 Fuzzy Relational System for In-Vitro Cultures Genotype→phenotype prediction will be modelled using a simple fuzzy relational composition: maintaining this structure it will be possible to extend the system in order to handle more complex behaviors. Define a first relation M , that represents the mutations from the wild type virus present in a viral genotype (sequenced from a patient’s plasma sample or cell culture). This relation is a matrix in which each row corresponds to an observation, i.e. to a sequenced genotype that has been selected for a phenotypic test. The generic element mi,j of M contains the fraction of the mutation j in the genotype i. At now the sequencing instruments are not able to measure the exact prevalence of an amino acid in the total population, they only retrieve most prevalent mixtures, so -for instance- if viral genotype sequence i reveals the M 184M V I, it means that at position 184 there can be (can be means that either the instrument was not able to discriminate which nucleotide was present or how many) two amino acidic substitutions V, I living with the wild type M : a-priori equal fraction values are assigned (i.e. mi,184M = mi,184V = mi,184I = 13 ) to them. The m variable is assumed to be a single amino acidic substitution in a codon, but can be as well a mixture of amino acids in the same codon or more generally a set of codons (i.e. a cluster, but different distance measures for membership functions then are needed). Let then W be a weight matrix in which the generic element wj,d is the degree of resistance (in [0,1]) shown by the single mutation j to the drug d. The matrix R is the fuzzy composition M ◦ W , yielding the overall resistance values ri,d for each viral genotype i and drug d. M ◦W =R ⎛ ⎜ ⎜ ⎜ ⎝
m1,1 · · · m1,M .. . . .. . . . mN,1 · · · mN,M
⎞ ⎛ ⎟ ⎜ ⎟ ⎜ ⎟◦⎜ ⎠ ⎝
w1,AZT · · · w1,3T C .. .. .. . . . wM,AZT · · · wM,3T C
⎞
⎛
⎟ ⎜ ⎟ ⎜ ⎟=⎜ ⎠ ⎝
r1,AZT · · · r1,3T C .. .. .. . . . rN,AZT · · · rN,3T C
(4) ⎞ ⎟ ⎟ ⎟ ⎠
260
M. Prosperi and G. Ulivi
Define furthermore another weight matrix W , in which there are the degrees of susceptibility (i.e. how much a mutation increases the power of a drug), obtaining a result matrix S: M ◦ W = S
(5)
If the fuzzy -norm ⊥-conorm compositional operators used are interactive ones (like bounded sum and bounded product or algebraic norms), the accumulation effect is explained. The two R and S relations give information about two opposite viral behaviors: in fact it’s not assured that a resistant virus is not susceptible for a certain drug and vice-versa. For instance, suppose that the drug d select m1 and m2 resistance mutations. Drug d selects m1 as well as resistance mutation, but not m2 , that alone confers instead susceptibility to d . So, if a mutated genotype with m1 and m2 (selected by d) is grown under d pressure, the corresponding cell culture will see growth of a specie that shares both resistance and susceptibility characteristics in the phenotype. Something will dominate, or maybe the effect will be compensatory. There are several ways to interpret this contradiction, that anyway remains well depicted by the two matrices R and S: the ratio between resistance and susceptibility, the max value (domination effect), the mean, a formula like ¬susceptible ∧ resistant. . . The resistance R and the susceptibility S result matrices need to be combined and transformed in order to yield a Log Fold Change value for phenotype prediction. The transforming function has to be a general f : [0, 1] → [−∞, ∞], but must take into account actual bounds registered by the experiments. A solution is to calculate the phenotypic matrix P as pi,d = tanh(ri,d − si,d ) for each element: in section 6.1, which presents results, this function is shown to provide accurate predictions. Then relations W and W have to be estimated from the data. Given a genotype and a drug, there is a corresponding Log Fold Change value: in order to estimate the weight matrices, the M relation is filled with the values coming from each sequenced genotype in the available data, with the corresponding phenotypic values stored in a matrix P : define a loss function L(f (R = M ◦ W, S = M ◦ W ), P ) = L(P , P ) and minimize it with an optimization algorithm. In section 5 fast optimization algorithms (Fuzzy Genetic Algorithms and Random Searches) and a set of suitable loss functions (that take into account also Feature Selection) will be discussed. 4.3 Models for In-Vivo Clinical Data Facing in vivo drug resistance and viral outcome prediction is a harder task. The best way probably should be to have a predetermined rule system and
Evolutionary Fuzzy HIV Modelling
261
optimize the parameters: the problem is that the rules have be found too. Relying on the existing rules would avoid the input space partition search problem: however the optimization would be biased and would not permit to find new associations. Moreover, data bases collecting in-vivo analyses are still fragmented, small in size and incomplete. The following models are a compromise due to this lack of training data. Two methods will be presented: the first one -already known in literature- uses the in-vitro predictor to infer conclusions for in-vivo treatments; the second is an extension of the in-vitro fuzzy relational predictor above introduced, modified to keep in-vivo dynamics, but does not need in-vitro knowledge and results. Existing Model This model, proposed by Beerenwinkel in [7], is one of the few complete and well documented studies in literature that can be compared and/or integrated to the system presented in the previous subsection. It’s worth to be cited here because it’s not a standard machine learning technique and possesses a lot of common points with the modelling assumptions of the fuzzy system. The author assumes that in vitro genotype→phenotype predictions are a reliable starting point to predict in vivo cART efficacies. This is an arbitrary hypothesis and statistical studies between phenotype and clinical response today are still not giving precise confidences, but this application revealed anyway promising results. Assume to have a genotype→phenotype in vitro predictor: use for instance the Fuzzy above described, a Linear Regressor or a Support Vector Machine regressor (this latter was the one chosen in [7]). Analysis of phenotype predictions (among naive and treated patients) shows large differences in range, location and deviation of models, but in general reveals a bimodal nature of distributions (resistance and susceptibility) among the whole set of drugs (even though for some drug not clearly), as can be seen in Figure 1. Thus Beerenwinkel models the probability density of predicted phenotypes (y) using a two-component gaussian mixture model for each drug: αφ(y, µ1 , σ1 ) + (1 − α)φ(y, µ2 , σ2 ) where φ(y, µi , σi ) is the density of the normal distribution i with mean µi and standard deviation σi ; α is the mixing parameter. Parameters are estimated by Expectation Maximization (EM) algorithm. Then a log-likelihood ratio is defined to decide whether a given phenotype was more likely belonging to the resistant or susceptible subpopulation: l(y) = log
Pr(res|y) Pr(sus|y)
(6)
262
M. Prosperi and G. Ulivi
Fig. 1. Two-Component Gaussian Mixture Model for a Generic Drug - Phenotype Log Fold Change on x axis, Probability of Resistance on y
From Bayes’ formula it follows l(y) = log
l(y) = log
Pr(y|res) Pr(res) Pr(y) Pr(y|sus) Pr(sus) Pr(y)
φ(y, µ1 , σ1 ) Pr(res) + log φ(y, µ2 , σ2 ) Pr(sus)
(7)
(8)
Then l(y) is approximated with its tangent l (y) in y0 zero. Finally the probability score ps is introduced as the logistic function of l (y) for a given genotype with respect to a drug: ps =
1 ≈ Pr(res|y) 1 + e−l (y)
(9)
A scoring function can be defined as estimation of activity of a treatment against a given viral strain. activity(d, x) =
1 1+
el (fd (x))
(10)
where d is a drug, x is a genotype sequence and yd = fd (x) is the Log Fold Change phenotype (prediction) for sequence x and drug d, having activity(d, x) ≈ Pr(sus|yd ). To calculate a score that takes into account drug combinations, it is important to note that treatments with drugs coming from different drug classes (at now there are the NRTi, NNRTi, PRi and Fi classes) benefit from synergic effects, while combinations restricted to a single drug class are in general less potent. Thus the score will be additive for drugs coming from different classes and taken as the maximum for drugs in the same class: having a mono-class restricted drug combination Ci = {d1 . . . dn } ⊂ Di , where Di is the set of all available drugs in the class i activity(Ci , x) = max{activity(d1 , x) . . . activity(dn , x)}
(11)
Evolutionary Fuzzy HIV Modelling
263
and letting C = {C1 . . . Ck } a multi-class cART activity(C, x) =
k
(activity(Ck , x))
(12)
1
This system is designed to use the overall activity score as a predictor of short-term (4-weeks) virological response. A refinement of this model is then discussed towards long-term response prediction (from 12-weeks). It is pointed out that in a prolonged treatment the success or failure will depend not only on the initial (mutant) viral population (and corresponding phenotype) but also on the ability of the virus in escaping the drug pressure through selecting new mutants. Resistant mutations will be closely related to the starting mutational configuration of the viral population. Thus the model was modified exploring activity prediction against a worst case mutant in different mutational neighborhoods of a given genotype sequence. Fixing a drug class i, for a given genotype sequence g0 , let Nr (x0 ) the mutational neighborhood of x0 at distance r, i.e. the set of all sequences distant less than r from x0 (Hamming distance is used). A worst case mutant for a fixed cART C is characterized by attaining the minimum minx∈Nr (x0 ) {activity(Ci , x)} where Ci ⊂ C is the subset of drugs in the class i. Since sequence space is huge, exhaustive searches are practical only for very restricted neighborhoods. Instead, Beerenwinkel proposes a Beam Search algorithm (BS, explained in [7]). Note that the concept of mutational neighborhood used is a way to handle uncertainty and vagueness: the search in the variable space however is huge (even if the Beam Search improves sensibly computation without losing much optimality) and must be executed each time for each patient. Here see instead the suitability of fuzzy logic to represent such a concept. Extension of the Fuzzy System for cART optimization Starting from the matrix compositions M ◦ W = R and M ◦ W = S described in section 4.2, suppose that relevant mutations and weights (maybe different from in-vitro ones) are known. Indicators of viral efficacy (resistance or susceptibility) to a single drug for a given genotype can be calculated. The need is to find a function that combines single-drug efficacies into an overall combined value for a cART, then transform this value in order to regress the viral load changes in the body. Single-drug therapies have been used widely before 1996, because there were only a few drugs in commerce, but they were not able to keep undetectable viral loads and resistance was arising in short times. As a larger set of drugs was available, HAARTs became the accepted protocol and AIDS
264
M. Prosperi and G. Ulivi
progression was significantly shifted through time: in fact the more drugs you keep, the less are chances for the virus to develop resistance, because it is attacked in different replicative steps (depending on the drug class) and must select cross-resistance mutations being also in low concentrations. Roughly speaking, drugs have different power : an old drug like Zidovudine (AZT) is able to bring the viral load of a naive patient (i.e. wild type) down by 1 Log from the baseline after three months of single-drug therapy, while Lopinavir (LPV) can provide a 2 Log reduction. Drug power is closely related to pharmacodynamics and pharmacokinetics, but no precise values can be obtained: think about Saquinavir (SQV), a Protease inhibitor that shows high viral load reduction in vitro, but is badly adsorbed by human body; its power however increases when taken together with small doses of Ritonavir (RTV). A combined therapy of AZT+LPV for a naive patient won’t lead to a simple 1 + 2 Log reduction, but rather a slightly lower value. The goal is to obtain an unique indicator of cART activity taking into account genotypic resistance and susceptibility, drug powers and combination effects. This leads again to a fuzzy formula: 1 ((¬rd ∧ pd ) ∨ sd ) (13) aD = d∈D
where aD is the overall activity for the D cART, d is a drug included in the cART, rd is the viral genotypic resistance to drug d (the negation has the meaning of an efficacy), pd is the single d drug power and sd is the viral genotypic susceptibility to drug d. Clearly r and s are given by the above matrix compositions. The power is coupled first with the efficacy and then with the susceptibility, in order to take into account the fact that a drug can be more powerful (thus efficient) than usual in presence of hypersusceptibility mutations. * The ∧ and ∨ operators are algebraic norm and conorm, while the is the Hamacher conorm: ⊥Hamacher (a, b, γ) =
a + b − ab − (1 − γ)ab 1 − (1 − γ)ab
(14)
where a, b are two generic single-drug activities and γ > 0 is a parameter that takes into account drug synergies (and if γ < 1 then ⊥Hamacher < ⊥Algebraic ). The γ and pd parameters were not estimated, but provided by physicians and virologists according to their experimental evidences. Now there is a unique indicator in [0,1]. Next step is to transform it (as it was made for the phenotype prediction) in a Viral Load value: help comes from [19], in which a system of differential equations -parameterized by drug activities- models HIV-1 replication in the human body. This permits again to define a loss function and optimize it.
Evolutionary Fuzzy HIV Modelling
265
Fig. 2. HIV-1 Schematic summary of the dynamics of HIV-1 infection in vivo: shown in the center is the cell-free virion population sampled in the plasma - Image taken from [46]
Differential Equations for HIV-1 Dynamics HIV-1 replication in the human body follows the process described in figure 2. Drugs target different replication steps, as it can be seen in figure 3. Several mathematical models have been proposed, the one coming from Perelson [19] is: dT dt dT ∗ dt dVi dt dVni dt
= s + pT (1 −
T ) − dT T − KVi T Tmax
(15)
= (1 − ηRT )kVi T − δT ∗
(16)
= (1 − ηP R )N δT ∗ − cVi
(17)
= ηP R N δT ∗ − cVni
(18)
where T are uninfected CD4+ T cells, T ∗ are infected CD4+ T cells, Vi and Vni are infectious and non-infectious virions respectively, ηi are the drug class efficacies. Cytotoxic response (CD8), latent and long-lived cells are not included in the model, as the presence of multiple viral strains. Unfortunately, informations about T cells too often are missing. A simplified model is: dV = (1 − α)cVeq − cV (19) dt
266
M. Prosperi and G. Ulivi
Fig. 3. HIV-1 Drug Targets - Image taken from [47]
where α is the cART overall activity, c is the viral clearance rate (about three days), Veq is the Steady State Viral Load. Solution of this equation is V (t) = V0 e−ct + (1 − α)Veq (1 − e−ct )
(20)
where V0 is the Baseline Viral Load. After three months of therapy the Viral Load Log change from Baseline can be assumed equal to ∆Log (12-weeks) = Log
V0 (1 − α)Veq
(21)
5 Optimization Techniques Simple optimization algorithms are often limited to regular convex functions. Actually, most real problems lead to face multi-modal, discontinuous, non-differentiable functions. To optimize such functions traditional research techniques use gradient-based algorithms [4], while new approaches rely on stochastic mechanisms: these latter base the search of the next point basing on stochastic decision rules, rather than deterministic processes. Genetic Algorithms, Simulated Annealing and Random Searches (see again [4]) are among these, and often are used either when the problems are difficult to be defined, either when “comfortable” properties -such as differentiability
Evolutionary Fuzzy HIV Modelling
267
or continuity- are missing. Genetic Algorithms search the solution space of a function through evolutionary processes; they maintain and manipulate a population of solutions, implementing a strategy of survival of the best : in general the best individuals within a population can reproduce and survive to the next generation, improving through time the descent in researching the best solution. In addition, they are implicitly suitable for parallelism, permitting exploration of different solution sets, with advantages in computational issues. Complete reviews are [3] and [4], in which are examined solutions to linear and non-linear problems. Aim of this section is not to describe the GA in detail, for which references are given, but to review fuzzy tools used to improve performances of GAs and fuzzy evaluation functions defined for optimization and Feature Selection. Then the implementation settings for the Fuzzy Relational system will be discussed. 5.1 Fuzzy Genetic Algorithms There are two possible ways to integrate Fuzzy Logic and Genetic Algorithms. The first convolves application of GAs to solve optimization and search problems referring to fuzzy sets and rule systems, the second utilizes fuzzy tools to model different components of GA in order to improve computational performances. In this chapter both ways are discussed: in fact a GA is used to mine fuzzy rules and the same GA is optimized through fuzzy tools. Fuzzy Logic can integrate GAs in: Chromosome Representation Classical binary or real valued representations can be generalized into fuzzy set and membership functions Crossover Operators Fuzzy operators can be considered to design crossover operators able to guarantee an adequate and parameterizable level of diversity in population, in order to avoid problems of premature convergence Evaluation Criteria Uncertainty, vagueness, belief, plausibility measures can be introduced to define more powerful fitness functions. GA Components Based on Fuzzy Tools The behavior of a GA is strongly determined from the balance between exploitation and exploration. This balance between creation and reduction of diversity is essential to reach good performances of GAs under complex problems. When such an equilibrium is not compensated, premature convergence problems appear and cause stagnation of GA in local optima. A collection of crossover functions, proposed in [17], called Fuzzy Connective Based (FCB) crossovers will be described in this section. These fuzzy crossovers were implemented in the application. F, S, M Crossover Operators Let X = (x1 . . . xn ) and Y = (y1 . . . yn ) be two real valued chromosomes, with xi < yi ∈ [ai , bi ] and i = 1 . . . n. There are three intervals [ai , xi ], [xi , yi ], [yi , bi ]
268
M. Prosperi and G. Ulivi
in which will be genes resulting from combinations of xi and yi . These intervals can be classified as exploration ([ai , xi ], [yi , bi ]) or exploitation ([xi , yi ]) zones. Define then three monotonic non-decreasing functions F, S, M : [a, b]×[a, b] → [a, b]. ∀c, c ∈ [a, b] holds: F (c, c ) ≤ min{c, c} S(c, c ) ≥ max{c, c } min{c, c } ≤ M (c, c ) ≤ max{c, c } Fuzzy connectives (-norm ⊥-conorm and mean operators) can be used for F, S, M respectively, having [a, b] = [0, 1]. Tables 1 and 2 present a few examples. Furthermore, the following conditions hold: E ≤ A ≤ Z ≤ ΥA ≤ ΥE ≤ ΥZ ≤ ⊥Z ≤ ⊥A ≤ ⊥E Dynamic FCB Crossover Operators An idea to avoid premature convergence problems consists in preferring the exploration at the beginning of the search process and the exploitation at the end. Starting from FCB crossovers to guarantee desiderated levels of diversity, dynamic FCB s can be defined using parameterized fuzzy connectives. Consider two genes xi < yi ∈ [ai , bi ] to be recombined during a t generation and be gmax the maximum number of generations. F, S, M crossovers can be Table 1. F, S Crossover Operators -norm
family
⊥-conorm
Z (x, y) = min{x, y} ⊥Z (x, y) = max{x, y}
Zadeh Algebraic
A (x, y) = xy
Einstein E (x, y) =
⊥A (x, y) = x + y − xy
xy 1+(1−x)(1−y)
⊥E (x, y) =
Table 2. M Crossover Operators family
mean operator (0 ≤ λ ≤ 1)
Zadeh
ΥZ (x, y) = (1 − λ)x + λy
Algebraic
ΥA (x, y) = x1−λ y λ
Einstein ΥE (x, y) =
2 1+( 2−x )1−λ +( 2−y )λ x y
x+y 1+xy
Evolutionary Fuzzy HIV Modelling
269
Table 3. F family type Frank
qF (x, y)
Dubois
-norm param (q x −1)(q y −1) = logq 1 + q > 0, q = 1 q−1
qD (x, y) =
xy max{x,y,q}
0≤q≤1
Table 4. S family type Frank Dubois
⊥qF
⊥-conorm param (q 1−x −1)(q 1−y −1) = 1 − logq 1 + q > 0, q = 1 q−1 ⊥qD = 1 −
(1−x)(1−y) max{(1−x),(1−y),q}
0≤q≤1
extended in time series to F = (F 1 . . . F gmax ), S = (S 1 . . . S gmax ) and M = (M 1 . . . M gmax ), which (being 1 ≤ t ≤ gmax , ∀c, c ∈ [a, b]) satisfy: F t (c, c ) ≤ F t+1 (c, c ) ∧ F gmax (c, c ) ∼ min{c, c } S t (c, c ) ≥ S t+1 (c, c ) ∧ S gmax (c, c ) ∼ max{c, c } M t (c, c ) ≥ M t+1 (c, c ) ∨ M t (c, c ) ≤ M t+1 (c, c )∀t ∧M gmax (c, c ) ∼ Mlim (c, c ) where Mlim is an M limit function. Denote furthermore M + and M − two families of the M functions that fulfill respectively the first and the second part of the last property. F and S families can be built using a parameterized -norm and ⊥-conorm that converge to the Zadeh’s. Tables 3 and 4 depict different choices. M family can be obtained using mean parameterized operators, like 2 xq + y q q ; −∞ ≤ q ≤ ∞ ∀x, y ∈ [0, 1], Υ (x, y) = q 2 and Mlim =
x+y 2 .
Soft Genetic Operators and Template Fuzzy Crossover Other approaches for crossover operators have been presented that are worth to be cited. In [16], Soft Genetic Operators are introduced, while in [15] Sanchez proposes Template Fuzzy Crossover. Performance Comparison of GAs and FGAs FGAs were proven to be more efficient than GAs in different scenarios. Tests of performances in the minimization of non-linear and non-differentiable problems for GAs and FGAs have been showed in [4, 16, 17].
270
M. Prosperi and G. Ulivi
5.2 Random Searches In GAs, when the mutation rate is high (above 0.1), performances approach that of a primitive random search. The advantage to use such an optimization algorithm is that still remains derivative-free and does not need to keep alive a large population of solutions: actually just one is iteratively modified and evaluated. As described in [4] Random Search explores the parameter space of an objective function sequentially adding random values to the solution vector in order to find an optimal point that minimizes or maximizes the objective function. Despite its simplicity, it has been proved that converges to the global optimum. Let be f (x) an objective function to be minimized and x the current vector point considered. The basic algorithm iterates the following steps: 1. 2. 3. 4.
x is the current starting point add a random vector dx to x and evaluate f (x + dx) IF f (x + dx) < f (x) THEN set x = x + dx IF (optimal f (x) or maximum number of iterations is reached) THEN stop ELSE go to 2
Improvements to this primitive version are suggested in [4]. 5.3 Implementation In the previous sections the models for in-vitro and in-vivo prediction were defined. Aim of this section is the estimation of parameters. For each model there is a loss function to be minimized: L(f (R = M ◦ W, S = M ◦ W ), P ) = L(P , P ) For the in-vitro model, f = f (R, S) = f (W, W ) and pi,d = tanh(ri,d − si,d ). The same is for the in-vivo model: there are the relational compositions M ◦ W = R and M ◦ W , then R and S are combined through equation 13 (to get the overall cART activity α) and then through equation 21 to get Viral Loads. The parameters to be estimated are -for both models- the values in relations W and W . The idea is to use either a FGA or a RS to minimize the loss function (which will be in the fitness function) and compare different approximate solutions through different runs. Unlike a derivative-based approach (for instance Gradient Descent) in which the solutions can stuck in local optima, an evolutionary algorithm in this scenario can explore a wider set of solutions with less constraints and still contemplate reasonable computational times. In order to define the chromosome coding, note that W and W relations are matrix of real numbers in [0, 1]: these matrices can be directly used as chromosomes, being possible to apply directly fuzzy crossover operators: thus the FGA searches solutions within a population of weight matrices.
Evolutionary Fuzzy HIV Modelling
271
Fuzzy Connective Based crossovers are used (specifically, algebraic norms and weighted mean for F, S, M families), while the fitness function is the N L(P , P ) = (P − P )T (P − P ) = i=1 (pi − pi )2 (Squared Error Loss) for $ %2 phenotype prediction and L = g(h(W, W ), O) = g(O , O) = σσo σoo = ρ2 o (Squared Linear Correlation, where σo2 , σo2 are the variances and σo o is the covariance) for in-vivo Viral Load outcome prediction, where h(W, W ) is the serialization of functions given in equations 13 and 21. In the following section 5.4 a Fuzzy Feature Selection function will be introduced to modify the fitness function in order to select relevant variables and compact solutions within the same error losses. For the RS, a slightly different implementation is made. The solution vector x = [x0 . . . xi−1 , xi , xi+1 . . . xM ] is iteratively added different dxj = [0 . . . 0, di,j , 0 . . . 0], i.e. just the ith coordinate of x is explored, such that f (x + dxj ) is the minimal (for Squared Error Loss) among different js. However it’s not proven if this affects the optimality property, because it’s a greedy search on one direction of the space. 5.4 Feature Selection The loss functions above described tend to minimize the error or maximize the correlation between observed and predicted vectors. They do not take account for the number of parameters used. In usual engineering scenarios the parameters to be optimized are few and related to significant variables as position, speed, acceleration: in this biological framework instead there is a huge number of variables (all the mutations in the viral genotype) and a corresponding large parameter space. Many of the input variables can be notsignificant for the model and many parameters mean that the system can be easily overfitted. Feature Selection is closely related to the Occam’s principle, probabilistically interpreted as the Minimum Description Length (MDL) principle (see [6]), for which models that use a minor number of parameters are preferred under the same prediction performances. This is useful when dealing with high-dimensional data sets, where many input attributes could be irrelevant and redundant to the dependant variables and act just as a noise. By allowing learning algorithms to focus only on highly predictive variables, their accuracy can be even improved. Feature Selection methods can be classified in two groups: Filter and Wrapper methods (see again [6]). Filter methods usually rank each attribute individually by different quality criteria (for example p-value of t-tests, mutual information values et cetera) and then select the subset of best ranked attributes. Wrapper methods evaluate performance of the learning algorithms using different subsets of the attributes: an exhaustive search through all subsets is clearly not possible, leading to 2n variable subsets to be evaluated, so different search algorithms use heuristics (greedy) to direct towards promising subsets. One such search method can be the Genetic Algorithm.
272
M. Prosperi and G. Ulivi
The approach suggested here is relying on the fact that in the GA it is possible to define in the fitness function either mechanisms to evaluate the prediction performances or to evaluate attribute subset characteristics (like the number of variables included, the adjusted-ρ2 . . . ). Akaike Information Criterion The Akaike Information Criterion (AIC) is a statistical model fit measure. It quantifies the relative goodness of fit of various previously derived statistical models, given a sample of data. It uses a rigorous framework of information analysis based on the concept of entropy. The driving idea behind the AIC is to examine the complexity of the model together with goodness of its fit to the sample data, and to produce a measure which balances between the two. The formula is AIC = 2k − 2ln(L) (22) where k is the number of parameters, and L is the likelihood function. When errors are assumed to be normally distributed, AIC is computed as AIC = 2k + n · ln(RSS/n), where n is the number of observations and RSS is the residual sum of squares. A model with many parameters will provide a very good fit to the data, but will have few degrees of freedom and be of limited utility. This balanced approach discourages overfitting. The preferred model is that with the lowest AIC value. Fuzzy Feature Selection Functions The idea for Fuzzy Feature Selection rises from the AIC definition and its extended in fuzzy terms. While the AIC formula is fixed and selects variables only based on their statistical significance, families of parameterized fuzzy functions that take into account the number of parameters and the loss function can be designed, in order to decide with more flexibility how much the model has to be simple (i.e. how many parameters are included) joined with its goodness of fit. Fuzzy formulae are set up in order to select models with high squared correlation ρ2 (or low (Mean) Squared Error SE) and a few parameters: (Error is low) ∧ (v is low) (ρ2 is high) ∧ (v is low) where v is related to the number of active parameters. The fuzzy set for v can be defined as a parameterized function of the variable weights: the more they’re close to zero, the better is, because they do not participate to the model. Excluding non-interactive operators (like min or drastic product )
Evolutionary Fuzzy HIV Modelling
273
every fuzzy -norm is admitted: the algebraic product was the choice for the application. Two simple examples for membership functions are: M
2 wi
e− 2σ2 µv (w) = 1 − M |{w ∈ W : abs(w) < σ}| µv (w) = 1 − M i=1
(23) (24)
where | · | is the cardinality of the set. A reasonable value for σ is 0.01, used in the application. The first function is smoother, while the second cut variables regardless their weight, just fixing a bound. For SE, the same holds: N µError (SE) = 1 −
i=1
e− N
(xi −xi )2 2σ2
(25)
where xi and xi are predicted and observed value respectively; here σ can be set on 0.5 (the choice rose from variability of plasma analyses within ±0.5 Log). Finally, ρ2 is by itself a goodness of fit indicator defined in [0,1]. Results in section 6 will show the advantages gained with the Fuzzy Feature Selection.
6 Application 6.1 Phenotype Prediction Data Set Description and System Setting Genotype/phenotype pairs were collected from public Stanford data base [37] and from VIRCO Laboratories [38]. Available data set sizes were ranging from 700 to 1000 pairs for the whole set of drugs {AZT, DDI, DDC, 3TC, TDF, NVP, ABC, NFV, EFV, SQV, IDV, LPV, APV, D4T, DLV, RTV}, except for {TPV, ATV, FTC} in which sizes were in the order of 70 to 300. Data sets were split in training (90%) and validation (10%) in order to assess robustness of results. Viral nucleotide sequences were aligned to consensus B wild type viral reference strain with a global alignment algorithm (CLUSTALW), taking into account high gap penalties for insertions and deletions. Mutations were extracted consequently, handling also ambiguous sequencing. M relation was filled according to the definition given in section 4.2. All the mutational positions were included in the system (550 in RT gene and 330 in PR gene), but only positions related to the corresponding drug target were allowed to have a weight, i.e. only mutations in the Protease gene were considered for Protease inhibitors and the same for Reverse Transcriptase; this was in order to respect real biological mechanisms. The Fuzzy Feature Selection function used in conjunction either with the FGA or the RS was (Error is low) ∧ (v is low) as proposed in section 5.4.
274
M. Prosperi and G. Ulivi
Results In order to assess performances, the Fuzzy System was trained and validated on the whole set of commercial drugs for which phenotypic tests are available. The system was then compared with a Linear Regressor, a literature standard for genotype→phenotype prediction. Being a huge number of input variables, the Linear Regressor was enhanced in two ways: first a Singular Value Decomposition (SVD) cut the quasi-colinear attributes; secondly, a stepwise selection heuristic method (starting from an input variable subset, variables are added or removed according to the Akaike Information Criterion) was used to reduce the number of variables. Performance indicator was the squared linear correlation ρ2 between predicted and observed vector, a widely used measure in biology. While the validation performances between the three models were not significantly different (Kruskal-Wallis rank sum test), i.e. the models have the same predictive power, the Fuzzy Feature Selection method selected significantly a lower number of input variables (p < 4 · 10−7 , Wilcoxon rank sum test). Table 5 summarizes the results. The SVD LR produced robust models, but the weight interpretation is difficult, because (even though cut in the decomposition) they’re re-projected in the original attribute space. The Table 5. Validation Results and Model Comparisons for Phenotype prediction Lin Reg SVD drug
n= no. var
AZT DDI TDF TPV NVP ABC NFV ATV FTC 3TC EFV SQV IDV LPV APV D4T DLV DDC RTV
94 89 77 07 97 89 94 35 28 93 95 95 94 75 94 92 96 91 94
537 526 530 126 537 529 332 253 374 538 531 333 332 327 331 534 532 523 330
ρ
2
0.7619 0.7926 0.7468 0.4414 0.8045 0.8378 0.8906 0.8649 0.7693 0.8080 0.7665 0.9333 0.9124 0.9730 0.7841 0.8527 0.7006 0.8619 0.9174
Lin Reg + Stepwise 2
no. var
ρ
138 138 87 107 160 134 101 117 0 121 138 98 86 113 109 158 126 114 89
0.8183 0.8075 0.7707 0.4408 0.7076 0.8694 0.8854 0.8995 0 0.9409 0.8591 0.9409 0.9178 0.9783 0.8081 0.8778 0.7239 0.9218 0.9444
Fuzzy no. var
ρ2
7 4 6 3 15 6 20 5 8 10 14 15 6 17 6 10 2 4 9
0.8756 0.8609 0.8148 0.9426 0.7960 0.7730 0.8912 0.9222 0.6939 0.9234 0.8373 0.9276 0.828 0.8566 0.8606 0.6455 0.8441 0.7788 0.8682
Evolutionary Fuzzy HIV Modelling
275
stepwise heuristic function reduced the input attribute space, but the resulting models possessed a higher number of variables (order 10 of magnitude) than the Fuzzy engine. A simpler model has the advantage to be more understandable and to point out features that can have real biological meaning. In fact, for the whole drugs, the Fuzzy model optimized with the Fuzzy Feature Selection yielded a set of weights that resemble with high accuracy medical hypotheses. Table 6 shows estimated weights for three drugs that completely agree the list of resistance/susceptibility mutations approved by IAS/USA [36]. Note that the Fuzzy Feature Selection function independently selected these among more than 500 input variables. Usually Machine Learners are trained only using the IAS/USA list. Table 6. Weight Estimation for Phenotype Prediction AZT RT mutation weight 67N 70R 116Y 184V 210W 215F 215Y
0.35 0.35 0.85 0.1 0.6 0.6 0.6
IDV
effect resistance resistance resistance susceptibility resistance resistance resistance
PR mutation weight 46I 46L 48V 54V 84V 90M
EFV RT mutation weight 100I 101E 101P 101Q 103N 103S 108I 181C 184V 188L 190A 190S 221Y 225H
0.85 0.55 0.9 0.5 0.9 0.8 0.45 0.35 0.05 0.95 0.85 0.95 0.3 0.7
effect resistance resistance resistance resistance resistance resistance resistance resistance resistance resistance resistance resistance resistance resistance
0.5 0.55 0.55 0.55 0.4 0.45
effect resistance resistance resistance resistance resistance resistance
276
M. Prosperi and G. Ulivi
Fig. 4. Log Fold Change regression - Fuzzy Relational System + Feature Selection - AZT Validation Set
Fig. 5. Log Fold Change regression - Fuzzy Relational System + Feature Selection - ATV Validation Set
Regarding the optimization algorithms performances, different executions of FGA and RS did not show differences in the time needed to find an acceptable solution. Figures 4, 5, 6 and 7 depict validation results for different drugs. 6.2 In-Vivo Prediction Clinical Data Sets The data bases available were the five clinical trials {GART, HAVANA, ARGENTA, ACTG 320, ACTG 364} and the retrospective cohort ARCA (taken from [37, 39]): 1329 instances were selected, according to the following constraints: • Viral Load Equilibrium was the maximum viral load value ever observed in a patient
Evolutionary Fuzzy HIV Modelling
277
Fig. 6. Log Fold Change regression - Fuzzy Relational System + Feature Selection - SQV Validation Set
Fig. 7. Log Fold Change regression - Fuzzy Relational System + Feature Selection - IDV Validation Set
• Baseline Viral Load had to be collected in the interval [−15, 7] days from the therapy switch date • Viral Genotype sequenced in the interval [−90, 30] days from therapy switch date • 12-Weeks Viral Load taken from 8 to 16 weeks after the therapy switch date For the clinical trials, the equilibrium viral load was not really reliable, because patients were enrolled being treated and little information about patient’s history were available. Retrospective cohorts, on the other hand, often were missing baseline measure. Mutations were extracted aligning each patient’s viral genotype with the consensus B wild type reference strain as for the invitro tests. 12 ARVs were included in the model: {AZT, 3TC, D4T, DDI, ABC, EFV, NVP, NFV, SQV, LPV, RTV, IDV}. Data were split in training (90%) and validation (10%) sets. Furthermore, a blind-validation set of 42
278
M. Prosperi and G. Ulivi Table 7. Drug Powers NRTi or NNRTi
power
AZT 3TC DDI D4T ABC EFV NVP
0.8 0.95 0.95 0.8 0.96 0.98 0.96
PRi
power
IDV NFV RTV SQV LPV RTV BOOSTER
0.97 0.97 0.97 0.9 0.99 0.088
observations from patients recently recorded in ARCA, coming from different clinics (with complete information about Baseline, Equilibrium Viral Load, Genotype and Follow Up), was considered as additional test set. The Fuzzy Feature Selection function was (ρ2 is high) ∧ (v is low) as proposed in 5.4: this time the squared correlation ρ2 was preferred to the MSE because the assumptions made on drug synergies and powers are not precise. Drug powers are summarized in Table 7, given by physicians after pharmacokinetics studies.
Variance and Bias Estimation Unlike the in-vitro scenario, in-vivo clinical data sets possess high variability: this is due to patients’ unadherence, different drug adsorption levels, different contour conditions (psychological state, co-infections . . . ), even instruments’ systematic errors and wrong data insertions. Furthermore, Viral Loads (and CD4+) are just a surrogate of the real disease progression in the body, because they reflect only the viral strains present in the peripheral blood (and mostly the infection acts in the lymphatic tissues). Input attributes are chosen among a set of variables that have been showed relevant in-vitro and among limited in-vivo statistical tests, so they could be not the most predictive ones or predictive at all. For instance, an important information as the therapy history is rarely recorded, and thus cannot be contemplated in a model. Before presenting the results, it’s useful to show the high variance in the follow-up Viral Loads among observations that share the same input attributes. Figures 8 and 9 explain clearly the situation for two selected observation subsets: the outcome distribution for patients that are wild-type (do not possess mutations), have the same Baseline Viral Loads and take the same cARTs are almost flat. Whatever the Machine Learning techniques is used, for such biased data the results will be poor.
Evolutionary Fuzzy HIV Modelling
279
Fig. 8. 12-Weeks Viral Load Log distribution for 6 Wild Type patients under AZT+3TC+SQV with 5 Baseline Viral Load Log
Fig. 9. 12-Weeks Viral Load Log distribution for 4 Wild Type patients under AZT+ 3TC with 4,75 Baseline Viral Load Log
Results The Fuzzy system was trained and validated under different parametric conditions: • zero-resistance/susceptibility model: a null model in which mutations are assumed not to contribute to resistance or susceptibility (useful to compare performances of the others) • input variables taken from IAS/USA [36] list of relevant mutations (with or without Fuzzy Feature Selection) • input variables on the entire RT + PR genes (all mutational positions, all amino acidic substitutions) with Fuzzy Feature Selection All the results are summarized in Table 8. The zero-resistance/susceptibility model, that assumes perfect inhibition despite mutational profiles and just
280
M. Prosperi and G. Ulivi Table 8. In-Vivo Prediction Performances fuzzy model
data set type
n=
ρ2
zero-resistance/susceptibility (null model)
validation
133
0.2045
all mutations + Feature Selection all mutations + Feature Selection
training validation
1329 133
0.6672 0.3553
IAS mutations no Feature Selection IAS mutations no Feature Selection
training validation
1329 133
0.36 0.2488
training 1329 validation 133 validation (diff. clinics) 42
0.3971 0.3762 0.3899
IAS mutations + Feature Selection IAS mutations + Feature Selection IAS mutations + Feature Selection Beerenwinkel’s model (trained in-vitro)
validation
96
0.368
Fig. 10. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model, no Feature Selection - Mutations from IAS/USA - Validation Set - observed values on x axis, predictions on y axis
relies on drug powers and Viral Load exponential decrease, was the poorest predictor: the Fuzzy system explains better the data. Figure 10 depicts validation results (real outcomes vs prediction) having trained system on the IAS/USA list without Feature Selection: validation ρ2 was poor, only 0.2488 (training yielded ρ2 = 0.36) and weight matrices were quite unstable executing algorithms different times with different starting points. Note that the perfectly aligned bunch of points is due to the undetectable saturation. Results started to be better using the Fuzzy Feature Selection and the IAS/USA list: Figures 11, 12 and 13 show training performances and validation performances (using the blind-validation set coming from the different clinics), for which ρ2 was always above 0.37. Different runs with perturbations on the starting points yielded slightly different weight matrices and variable included, but with low variability. Furthermore, weights were resembling often
Evolutionary Fuzzy HIV Modelling
281
Fig. 11. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model + Feature Selection - Mutations from IAS/USA - Training Set - observed values on x axis, predictions on y axis
Fig. 12. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model + Feature Selection - Mutations from IAS/USA - Validation set - observed values on x axis, predictions on x axis
Fig. 13. In Vivo 12-Weeks Viral Load Log regression - Fuzzy model + Feature Selection - Mutations from IAS/USA - Validation set from different clinics - observed values on x axis, predictions on y axis
282
M. Prosperi and G. Ulivi
Table 9. Weight Matrices for Fuzzy System. NRT is for mutations in Reverse Trancriptase targeted by Nucleoside/Nucleotide analogues, NNRT is for Non-Nucleoside RT inhibitors, PR is for Protease inhibitors. Weights were forced to assume zero value for mutations that were in regions not targeted by the corresponding drug class in order to resemble physical behaviors. mutation drug weight NRT 41 NRT 41 NRT 69 NRT 77 NRT 115 NRT 151 NRT 184 NRT 219
DDI ABC ABC 3TC 3TC DDI 3TC DDI
0.1 resistance 0.95 resistance 0.45 resistance 0.95 susceptibility 0.95 susceptibility 1.0 susceptibility 0.95 resistance 0.5 resistance
mutation drug weight PR PR PR PR PR PR
10 10 10 30 32 46
NFV RTV SQV SQV SQV RTV
effect
effect
0.1 resistance 0.9 resistance 0.7 resistance 1.0 susceptibility 0.95 resistance 0.95 resistance
mutation drug weight NNRT NNRT NNRT NNRT
106 181 181 190
NVP EFV NVP NVP
0.95 0.6 0.95 0.95
mutation drug weight PR PR PR PR PR PR
54 54 71 90 90 90
NFV LPV NFV IDV NFV SQV
0.95 0.95 0.95 0.7 0.95 0.95
effect resistance resistance resistance resistance
effect resistance resistance resistance resistance resistance resistance
medical hypotheses. Table 9 reports a small set of weights (amino acidic substitutions are not shown for clearness) that is capable of handle ρ2 = 0.39 in validation: emphasized terms disagree medical hypotheses. Differently from the phenotype prediction, the FGA did not improve and escape fast from the zero-resistance/susceptibility model, being stuck in this local optimum, while the RS was able to find better solutions in a shorter time. The last was test made using the complete mutational regions in RT and PR, forcing just mutations in RT not to interact with PR drugs and vice-versa, relying on the Fuzzy Feature Selection function for the feature selection. The parameter search space was huge: around 400 mutations and 3000 weights to be estimated (not considering amino acidic substitutions and having just a thousand of training examples). Training performances were optimal, yielding ρ2 = 0.6672, but validation results did not increase, yielding ρ2 = 0.3553. The system was obviously over-parameterized and, even if the Feature Selection was used, the weights in the relational matrices were unstable, changing their values largely among different executions of the FGA and RS algorithms. Final comparison was made with the Beerenwinkel’s model described in section 4.3, that uses an in-vitro predictor: this system was tested on a set of
Evolutionary Fuzzy HIV Modelling
283
96 therapy switch episodes (but not coming from the sets here used), using the overall activity score as a predictor of 4-weeks (28 ± 10 days) virological response (very short term response). Linear least squares regression analysis gave a ρ2 = 0.368. Being different the validation settings, it’s not possible to compare directly the models: however, being this model trained with in-vitro data, it’s at least an indication that the two experimental settings are related. 6.3 Conclusions The Fuzzy Relational system has been shown to be accurate, robust and compact for in-vitro prediction: differently from other models as Linear Regressors or SVM, has the advantage to provide a meaningful explanation of biological mechanisms and joined with the Fuzzy Feature Selection function selects the best models according to the Occam’s principle. Its extension for in-vivo cART optimization and Viral Load prediction gives encouraging results, providing still a compact model: moreover, its derivation from the in-vitro framework and the comparison with Beerenwinkel’s model emphasize the relationships between in-vitro tests and in-vivo treatments. However, in a mere therapy optimization purpose, the correlations results are still not satisfactory: the variance estimation anyway shows how much the clinical data are biased, either due to the limited attribute recording or the intrinsic variabilities in the human body. Future Perspectives for In-Vivo Modelling The Fuzzy Relational system described in this chapter is designed under a limited data scenario. For instance, the differential equation system that models the viral reproduction had to be simplified ignoring the contribution of CD4+ T cells, because they were too often missing in the data bases. Mutations were treated separately, because a preliminary clustering did not worked due to the large number of therapy combinations compared to the small number of training instances. Therapeutic history, that could have a crucial role in the learning process, is missing as well. However, public availability of clinical data bases is today increasing, as confirmed by the EuResist data base [40] (an European project that aims to integrate several clinical data bases on HIV and build a treatment decision system), as well quality and additional attribute recording in the data sets. In this perspective, it’s possible to design more complex models, still maintaining the aim to model meaningfully biological mechanisms and handle uncertainty and vagueness. The Fuzzy Relational system can be modified and extended in order to produce a rule set capable to infer prediction, at least eliminating the noise produced by (previously) unseen significant attributes. An appropriate rule base for an in vivo HIV resistance/susceptibility FIS (Mamdami. . . ), that still waits to be trained with a sufficient amount of data, is defined in Table 10.
284
M. Prosperi and G. Ulivi Table 10. Fuzzy System for In-Vivo cART Optimization premise 0 (fact)
pm is M and pd is D and pv is V and . . . and ph is H and pp is P
premise 1 (rule)
if pm is M1 and pd is D1 and pv is V1 and . . . and ph is H1 and pp is P1 then po is O1
premise 2 (rule)
if pm is M2 and pd is D2 and pv is V2 and . . . and ph is H2 and pp is P2 then po is O2
...
...
premise i (rule)
if pm is Mi and pd is Di and pv is Vi and . . . and ph is Hi and pp is Pi then po is Oi
consequence (conclusion)
po is O
In detail, pk is the patient’s state on the K fuzzy sets: Mi is a set of viral mutational clusters, Di is a cART, while Vi plasma analyses (Viral Load, CD4+), Pi is the phenotype corresponding to the Mi set, Hi is designed on therapy history. Additional input features (like viral subtype, risk factor, coinfections) can be included. O is the 12-Weeks follow up. The above inferences can be represented through compositions of fuzzy relations. The importance of this equivalence is that often relations can be treated easier in a computational approach: basically they’re n-dimensional matrices, so their manipulation is possible also through many algebraic properties; moreover, they are suitable for evolutionary algorithms application. The problem is to define membership functions and then input space partitions. While for Viral Loads and CD4+ this is feasible, for patients’ viral genotypes there is a vector of mutations revealed from the wild type: a distance function has to be defined to calculate neighborhoods (Hamming, Jaccard, Levehnstein, phylogenetic distances. . . ), remembering that in each position there can be a mixture of amino acids (for instance M184MVI). For therapy history, an exponential decreasing function parameterized on time of exposure and time of interruption could be suitable. Having defined such an enlarged input attribute set and the corresponding similarity functions, membership functions can be tuned and rules discovered with an heuristic algorithm.
7 Acknowledgements We want to thank physician Andrea De Luca (Institute of Clinical Infection Disease - Catholic University of Rome - UCSC) and virologist Maurizio Zazzi (University of Siena, Italy) who collaborated actively to this study. Furthermore we want to thank the ARCA consortium [39] that gave the in-vivo
Evolutionary Fuzzy HIV Modelling
285
retrospective data sets, Stanford University [37] for in-vivo clinical trials and VIRCO [38] for in-vitro genotype/phenotype data sets.
References 1. Klir G (1988) Fuzzy sets, uncertainty and information. Prentice-Hall, Englewood Cliffs, NJ 2. Bandemer H (1992) Fuzzy data analysis. Kluwer Academic, Dordrecht 3. Michalewicz Z (1994) Genetic algorithms + data structures = evolution programs. AI Series. Springer, Berlin Heidelberg New York 4. Jang JSR, Sun CT, Mizutani E (1997) Neuro-fuzzy and soft computing. Prentice Hall, Englewood Cliffs, NJ 5. Brunak S, Baldi P (2001) Bioinformatics: the machine learning approach. MIT, Cambridge, MA 6. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kauffmann, Los Altos, CA 7. Beerenwinkel N (2003) Computational analysis of HIV drug resistant data. PhD Thesis, MPS-MPI for Informatics, University of Saarland, Saarbruecken, Germany 8. Sanchez E (1977) Solutions in composite fuzzy relation equations. In: Gupta, Saridis, Gaines (eds) Fuzzy automata and decision processes. North-Holland, New York, pp 221–234 9. Sanchez E (1979) Medical diagnosis and composite fuzzy relations. In: Gupta, Ragade, Yager (eds) Advances in fuzzy set theory and applications. NorthHolland, New York, pp 437–444 10. Adlassnig K, Kolarz G (1982) CADIAG-2: Computer-assisted medical diagnosis using fuzzy subsets. In: Gupta, Sanchez (eds) Approximate reasoning in decision analysis. North-Holland, New York, pp 203–217, 219–247 11. Zadeh L (1965) Fuzzy sets. Inf Control 8:338–353 12. Sanchez E (1984) Solution of fuzzy equations with extended operations. Fuzzy Sets Syst 12:237–248 13. Mizumoto M (1989) Pictorial representations of fuzzy connectives, part II: cases of compensatory operators and self-dual operators. Fuzzy Sets Syst 32:45–79 14. Pedrycz W (1993) Fuzzy relational equations. Fuzzy Sets Syst 59:189–195 15. Sanchez E (1993) Fuzzy genetic algorithms in soft computing enviroment. Invited Plenary Lecture in Fifth IFSA World Congress, Seoul 16. Voigt HM (1995) Soft genetic operators in evolutionary algorithms. In: Banzhaf W, Eeckman FH (eds). Evolution and biocomputation. Lecture notes in computer science, vol 899. Springer, Berlin Heidelberg New York, pp 123–121 17. Herrera F (1996) Dynamic and heuristic fuzzy connective based crossover operators for controlling the diversity and the convergence of real coded algorithms. Int J Intell Syst 11:1013–1041 18. Lathrop R, Pazzani MJ (1999) Combinatorial optimization in rapidly mutating drug-resistant viruses. J Combinatorial Optimiz 3:301–320 19. Perelson AS, Nelson PW (1999) Mathematical analysis of HIV-1 dynamics in vivo. SIAM Rev 41:3–44 20. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, Korn K, Selbig J (2001) Geno2pheno: interpreting genotypic HIV drug resistance tests. IEEE Intell Syst Biol 16(6):35–41
286
M. Prosperi and G. Ulivi
21. Beerenwinkel N, Kaiser R, Schmidt B, Walter H, Korn K, Hoffmann D, Lengauer T, Selbig J (2001) Clustering resistance factors: identification of complex categories of drug resistance. Antiviral Ther 6(1):105 22. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, Korn K, Selbig J (2001) Identifying drug resistance-associated patterns in HIV genotypes. Proceedings of the German conference on bioinformatics, October 7–10, 2001, pp 126–130 23. Beerenwinkel N, Sing T, Daumer M, Kaiser R, Lengauer T (2004) Computing the genetic barrier. Antiviral Ther 9:S125 24. Beerenwinkel N, Daumer M, Sierra S, Schmidt B, Walter H, Korn K, Oette M et al. (2002) Geno2pheno is predictive of short-term virological response. Antiviral Ther 7:S74 25. Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D et al. (2002) Diversity and complexity of HIV-1 rrug resistance: a bioinformatics approach to predict phenotype from genotype. Proc Natl Acad Sci USA 99(12):8271–8276 26. Beerenwinkel N, Daumer M et al. (2003) Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes. Nucleic Acids Res 31(13):3850–3855 27. Beerenwinkel N, Lengauer T, Daumer M, Kaiser R, Walter H, Korn K, Hoffmann D, Selbig J (2003) Methods for optimizing antiviral combination therapies. Bioinf 19(1)(ISMB ’03):i16–i25 28. Beerenwinkel N, Kaiser R, Rahnenfuhrer J, Daumer M, Hoffmann D, Selbig J, Lengauer T (2003) Tree models for the evolution of drug resistance. Antiviral Ther 8:S107 29. De Luca A, Vendittelli M, Baldini F, Di Giambenedetto S, Trotta MP, Cingolani A, Bacarelli A, Gori C, Perno C F, Antinori A, Ulivi G (2004) Construction, training and clinical validation of an interpretation system for genotypic HIV-1 drug resistance based on fuzzy rules revised by virological outcomes. Antiviral Ther 9(4) 30. Larder BA, Revell A, Wang D, Harrigan R, Montaner J, Wegner S, Lane C (2005) Neural networks are more accurate predictors of virological response to HAART than rules-based genotype interpretation systems. Poster presentation at 10th european AIDS conference/EACS, Dublin Ireland, 17–20 November 31. Wang D, Larder BA, Revell A, Harrigan R, Montaner J, Wegner S, Lane C (2005) Treatment history improves the accuracy of neural networks predicting virologic response to HIV therapy. Abstract & Poster presentation at BioSapiens – viRgil workshop on bioinformatics for infectious diseases, Caesar Bonn, Germany, September 21–23 32. Revell A, Larder BA, Wang D, Wegner S, Harrigan R, Montaner J, Lane C (2005) Global neural network models are superior to single clinic models as general quantitative predictors of virologic treatment response. Poster presentation at third IAS conference on HIV pathogenesis and treatment 24–27 July, Rio de Janeiro, Brazil 33. Prosperi M, Zazzi M, De Luca A, Di Giambenedetto S, Ulivi G et al. (2005) Common law applied to treatment decisions for drug resistant HIV – antiviral therapy 10:S62 (abstract & poster at XIV International HIV Drug Resistance Workshop, Quebec City) 34. Prosperi M, Zazzi M, Gonnelli A, Trezzi M, Corsi P, Morfini M, Nerli A, De Gennaro M, Giacometti A, Ulivi G, Di Giambenedetto S, De Luca A
Evolutionary Fuzzy HIV Modelling
35.
36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47.
287
(2005) Modelling in vivo HIV evolutionary mutational pathways under AZT3TC regimen through Markov chains – abstract, poster & selected lecture at BioSapiens-viRgil Workshop on bioinformatics for infectious diseases, Caesar Bonn, Germany, September 21–23 Savenkov I, Beerenwinkel N, Sing T, Daumer M, Rhee SY, Horberg M, Scarsella A, Zolopa A, Lee S Y, Hurley L, Fessel WJ, Shafer RW, Kaiser R, Lengauer T (2005) Probabilistic modelling of genetic barriers enables reliable prediction of HAART outcome at below 15% error rate. Abstract at BioSapiens-viRgil Workshop on bioinformatics for infectious diseases, Caesar Bonn, Germany, September 21–23 IAS-USA. http://iasusa.org Stanford HIV data base. http://hivdb.stanford.edu VIRCO Labs. http://vircolab.com ARCA consortium. http://www.hivarca.net EuResist. http://www.euresist.org Geno2Pheno system. http://www.genafor.org RDI. http://www.hivrdi.org/ ANRS. http://www.hivfrenchresistance.org/ ANRS. http://pugliese.club.fr/index.htm HIV-1 Genotypic Drug Resistance rules from REGA Institute. http://www.kuleuven.be/rega/cev/pdf/ResistanceAlgorithm6 22.pdf The cellML – http://www.cellml.org The body – complete HIV resource http://bbs.thebody.com/index.html
A New Genetic Approach for Neural Network Design Antonia Azzini and Andrea G.B. Tettamanzi
Summary. Neuro-genetic systems, a particular type of evolving systems, have become a very important topic of study in evolutionary design. They are biologicallyinspired computational models that use evolutionary algorithms (EAs) in conjunction with neural networks (NNs) to solve problems. EAs are based on natural genetic evolution of individuals in a defined environment and they are useful for complex optimization problems with huge number of parameters and where the analytical solutions are difficult to obtain. This work present an approach to the joint optimization of neural network structure and weights, using backpropagation algorithm as a specialized decoder, and defining a simultaneous evolution of architecture and weights of neural networks.
1 Introduction Evolutionary algorithms (EAs) are models based on natural evolution of individuals in a defined environment. They are especially useful for complex optimization problems where the number of parameters is large and the analytical solutions are difficult to obtain. Texts of reference and synthesis in the field of evolutionary algorithms are [8,36], and recent advances in evolutionary computation are described in [60], in which these approaches attract increasing interest from both academia and industrial society. EAs can help to find out the optimal solution globally over a domain, although convergence to the global optimum may only be guaranteed in probability, provided certain rather mild assumptions are met [1]. They have been applied in different areas such as fuzzy control, path planning, modeling and classification etc. Their strength is essentially due to their updating of a whole population of possible solutions at each iteration of evolving individuals; this is equivalent to carry out parallel explorations of the overall search space in a problem. New evolving systems, neuro-genetic systems, have become a very important topic of study in evolutionary computation. As indicated by Yao et al. A. Azzini and A.G.B. Tettamanzi: A New Genetic Approach for Neural Network Design, Studies in Computational Intelligence (SCI) 82, 289–323 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
290
A. Azzini and A.G.B. Tettamanzi
in [61], they are biologically-inspired computational models that use evolutionary algorithms in conjunction with neural networks (NNs) to solve problems. Such an evolutionary algorithm is a more integrated way of designing artificial neural networks (ANNs) since it allows all aspects of NN design to be taken into account at once and does not require expert knowledge of the problem. An important issue in the ANN design considers the training process, carried out by adjusting the connection weights iteratively, so that learned ANNs can perform the desired task. Weight training is usually formulated as minimization of an error function, such as the mean square error between target and actual outputs averaged over all examples, by iteratively adjusting connection weights. In the most frequently used methods to train neural networks, the BackPropagation algorithm (BP) has emerged as the standard algorithm for finding a set of good connection weights and biases [50]. As conjugate gradient, BP is based on gradient descent [38]. It generally uses a least-squares optimality criterion, defining a method for calculating the gradient of the error with respect to the weights for a given input, by propagating error backwards through the network. There have been some successful applications of BP in various areas [28], but BP has drawbacks due to its use of gradient descent. Usually, a gradient descent algorithm is used to adapt the weights based on a comparison between the desired and actual network response to a given input stimulus. In each iteration of backpropagation, the gradient of the search surface is calculated and network weights are changed in a direction opposite to the gradient. This can be computationally expensive if a large number of iterations is required to find an acceptable network and to avoid local minima entrapment. Consequently, it often gets trapped in a local minimum of the error function and it is not able to finding a global minimum if the error function is multimodal and/or non- differentiable. Moreover, BP is sensitive to initial conditions and it can become slow. Oscillations may occur during learning, and, if the error function is shallow, the gradient is very small leading to small weight changes. One way to overcome gradient-descent-based training algorithms shortcomings is to adopt EAs in order to formulate the training process as the evolution of connection weights in the environment determined by the architecture and the learning task. For this reason, much research has been undertaken on the combination of EAs and NNs. Through the use of EAs, the problem of designing a NN is regarded as an optimization problem. Some EAs have implemented search over the topology space, or a search for the optimal learning parameters. Some others focus on weight optimization: these can be regarded as alternative training algorithms, and in this case the evolution of weights assumes that the architecture of the network must be static. Evolutionary algorithm and state of the art design of Evolutionary Artificial Neural Network (EANN) were also introduced by Abraham [2], followed by the proposed MLEANN framework. In that framework, in addition to the evolutionary search of the connection weights and architectures, local search techniques were used to fine-tune the weights (meta-learning).
A New Genetic Approach for Neural Network Design
291
This work presents an approach to the joint optimization of neural network structure and weights which can take advantage of BP as a specialized decoder. In our approach we decode the initial settings of the weights, that will become the inputs of backpropagation algorithm, in order to obtain a neural network with its connection weights, that corresponds to the ‘phenotype’. Backpropagation will decode the initial values into the weights of the neural network. This is then confirmed by the reproduction operator, in which the backpropagation will be not applied to the weights, since only the initial values of a parent, that correspond to the ‘genotype’, are assigned to the connection weights of its offspring. The approach is successfully applied to three different real-world applications. The first regards an electrical engine fault diagnosis problem in order to estimate the fault probability of an engine with particular attention to reduced power consumption and silicon area occupation. The second describes an application to brain wave signal processing, in particular as a classification algorithm in the analysis of P300 Evoked Potential. Finally, the third application considers a financial problem, whereby a factor model capturing the mutual relationship among several financial instruments is sought for. This chapter is organized as follows. Section 2 discusses different approaches to evolving ANNs and indicates correlated work shown in literature. Section 3 describes the new evolutionary approach to the simultaneous evolution of neural network structure and weights which can take advantage of backpropagation as a specialized decoder. Genetic operators implemented are presented in detail. Finally three real world application are discussed in Section 4, considering different problems: electrical engine fault diagnosis, brain wave signal classification and financial modeling.
2 Evolving ANNs There are several approaches to evolving ANNs and EAs are used to perform various tasks, such as connection weight training, architecture design, learning rule adaptation, connection weight initialization, rule extraction from ANNs, etc. Three of them are considered as the most popular approaches at these levels: • Connection weights, that concentrates just on the weights optimization, assuming that the architecture of the network must be static. The evolution of connection weights introduces an adaptive and global approach to training, especially in the reinforcement learning and recurrent network learning paradigm where gradient-based training algorithms often experience great difficulties. • Learning rules, that can be regarded as a process of learning to learn in ANNs where the adaptation of learning rules is achieved through evolution. It can also be regarded as an adaptive process of automatic discovery of novel learning rules.
292
A. Azzini and A.G.B. Tettamanzi
• Architecture evolution, that enables ANNs to adapt their topologies to different tasks without human intervention and thus provides an approach to automatic ANN design as both ANN connection weights and structures can be evolved. In this case a further subdivision can be made by defining a ‘pure’ architecture evolution and a simultaneous evolution of both architectures and weights. Other approaches consider the evolution of transfer function of a neural network, but it is usually applied in conjunction with one of the three methods above in order to obtain better results. Different types of evolutionary algorithms are defined in literature: the first proposal comes from Holland, with genetic algorithms (GA) [25]. Fogel’s evolutionary programming (EP) [20] is a technique for searching through a space of finite-state machines. The evolutionary strategies (ES), introduced by Rechenberg [46], are algorithms that imitate the principles of natural evolution for parameter optimization problems. The last development is so-called genetic programming (GP), proposed by Koza [29] to search for the most fit computer program to solve a particular task. Genetic algorithms are based on a representation independent of the problem, like a string of binary, integer or real numbers. This representation (the genotype) encodes a network (the phenotype) and gives rise to a dual representation scheme. The ability to create better solutions in a genetic algorithm relies mainly on the genetic recombination operator. The benefits of the genetic operators come from the ability of forming connected substrings of the representation that correspond to problem solutions. Evolutionary programming and genetic programming use the same paradigm as genetic algorithm, but they use different representations for individuals in the population, and they put emphasis on different operators. For particular tasks, they use specialized data structures, like finite state machines and tree-structured computer programs, and specialized genetic operators, in order to perform evolution. Evolutionary strategies were developed as a method to solve parameter optimization problems with continuosly changeable parameters, and then they were extended also for discrete problems. They differ from genetic algorithm for several aspects. Indeed, they apply deterministic selection after reproduction, and a solution is not coded and together with probabilities of mutation and crossover, constitute chromosomes. These probabilities are different for all individuals and they change during the evolution. In this approach, wrong solutions are eliminated from the population. Evolution programs are particular kind of modified genetic algorithms and they are introduced by Michalevicz [36]. They consider the idea of Davis [16], who believes that: ‘a genetic algorithm should incorporate real-world knowledge in one’s algorithm by adding it to one’s decoder or by expanding one’s operator set’.
A New Genetic Approach for Neural Network Design
293
Evolution programs would leave the problem unchanged, modifying a chromosome representation of a potential solution, using natural data structure, and applying appropriate genetic operators. In this algorithm a possible solution is directly mapped in an encoding scheme. They offer a major advantage over genetic algorithms when evolving ANNs since the representation scheme allows manipulating networks directly, avoiding the problems associated with a dual representation. The use of evolutionary learning for designing neural networks dates from no more than two decades. However, a lot of work has been made in these years, which has produced many approaches and working models in different ANNs optimizations. Some of these are reported below. Weight Optimization The evolution of weights can be regarded as an alternative to training algorithms, and it assumes that the architecture of the network must be static. The primary motivation for using evolutionary techniques to establish the weighting values rather than traditional gradient descent techniques such as BP [49], lies in the trapping in local minima and in the requirement that the function is differentiable. For this reason, rather than adapting weights based on local improvement only, EAs evolve weights based on the whole network fitness. Several works in this direction have been described by Montana and Davis [39] and by Whitley and colleagues [55]; in [56], they also implemented a purely evolutionary approach using binary codings of weights. In other cases an EA and a gradient descent algorithm have been combined [27]. Few years ago, Yang and colleagues [58] proposed an improved genetic algorithm based on a kind of evolutionary strategy. Often, during application of GAs, some problems of premature convergence and stagnation of solution can occur [21]. This algorithm was implemented to keeping the balance between population diversity and convergent speed during the evolution. This is carried out by means of a kind of a mutation operator in conjunction with a controller stable factor. Zalzala and Mordaunt [40] studied an evolutionary NN suitable for gait analysis of human motion evolving the connection weights of a predefined NN structure. The evolution was explored with mutation and a multi-point crossover separately implemented as the best combination search mechanism. Learning Rules Optimization Several standard learning rules govern the speed and the accuracy with which the network will be trained and, when there is a little knowledge about the most suitable architecture for a given problem, the possibly dynamic adaptation of the learning rules becomes very useful. Examples of these are the
294
A. Azzini and A.G.B. Tettamanzi
learning rate and momentum, that can be difficult to assign by hand, and therefore become good candidates for evolutionary adaptation. One of the first studies in this field was conducted by Chalmers [13]. The aim of his work was to see if the well-known delta rule, or a fitter variant, could be evolved automatically by a genetic algorithm. With a suitable chromosome encoding and using a number of linearly separable mappings for training, Chalmers was able to evolve a rule analogous to the delta rule, as well as some of its variants. Although this study was limited to somewhat constrained network and parameter spaces, it paved the way for further progress. Several studies have been carried out in this direction: Merelo and colleagues [35], present a search for the optimal learning parameters of multilayer competitive learning neural networks. Another work, based on simulated annealing, is proposed by Castillo et al. in [12]. Architeture Optimization There are two major ways in which EAs have been used for searching network topologies: either all aspects of a network architecture are encoded into an individual or a compressed description of the network is evolved. The first case defines a direct encoding, while the second leads to an indirect encoding. • Direct Encoding specifies each parameter of the neural network and a little effort in decoding is required, since a direct transformation of genotypes into phenotypes is defined. Several examples of this approach are shown in the literature, like in [37, 56] and in [61], in which the direct encoding scheme is used to represent ANN architectures and connection weights (including biases). EP-Net [61] is based on evolutionary programming with several different sophisticated mutation operators. • Indirect Encoding requires a considerable effort for the neural network decoding, but, in some cases, the network can be pre-structured, using restrictions in order to rule out undesirable architectures, which makes the searching space much smaller. A few sophisticated encoding method is implemented based on network parameter definitions. These parameters may represent the number of layers, the size of the layers, i.e., the number of neurons in each layer, the bias of each neuron and the connections among them. This kind of encoding is an interesting idea that has been further pursued by other researchers, like Filho and colleagues [19], and Harp and colleagues [23]. Their method is aimed at the choice of the architecture and connections, and uses a representation which describes the main components of the networks, dividing them in two classes, i.e., parameter and layer sections. The design of an optimal NN architecture can be formulated as a search problem in the architecture space, where each point represents an architecture. As pointed out by Yao [59, 61, 62], given some performance (optimality) criteria, e.g., minimum error, learning speed, lower complexity, etc., about
A New Genetic Approach for Neural Network Design
295
architectures, the performance level of all these forms a surface in the design space. Determining the optimal architecture design is equivalent to finding the highest point on this surface. There are several arguments which make the case for using EAs for searching for the best network topology [37, 53]. Pattern classification approaches [54] can also be found to design the network structure, and constructive and destructive algorithms can be implemented [59]. The constructive algorithm starts with a small network. Hidden layers, nodes, and connections are added to expand the network dinamically [61]. The destructive algorithm starts with a large network. Hidden layers, nodes, and connections are then deleted to contract the network dynamically [41]. Stanley and Miikkulainen in [53] presented a neuro evolutionary method through augmenting topologies (NEAT). This algorithm outperforms solutions employing a principled method of crossover of different topologies, protecting structural innovation using speciation, and incrementally growing from minimal structure. One of the most important forms of deception in ANNs structure optimization arises from the many-to-one and from one-to-many mapping from genotypes in the representation space to phenotypes in the evaluation space. The existence of networks functionally equivalent and with different encodings makes the evolution inefficient. This problem is usually termed as the permutation problem [22] or the competing convention problem [51]. It is clear that the evolution of pure architectures has difficulties in evaluating fitness accurately. As a result, the evolution would be very inefficient. Simultaneous Evolution of Architecture and Weights One solution to decrease the noisy fitness evaluation problem in ANNs structure optimization is to consider a one-to-one mapping between genotypes and phenotypes of each individual. This is possible by considering a simultaneous evolution of the architecture and the network weights. The advantage of combining these two basic elements of a NN is that a completely functioning network can be evolved without any intervention by an expert. Some methods that evolve both the network structure and connection weights were proposed in the literature. In the ANNA ELEONORA algorithm [34] new genetic operator, called GA-simplex [9], an encoding procedure, called granularity encoding [32, 33], that allows the algorithm to autonomously identify an appropriate suitable lenght of the coding string, were introduced. Each gene consists of two parts: the connectivity bits and the connection weight bits. The former indicates the absence or the presence of a link, and the latter indicates the value of the weight of a link. This approach employs the four genetic operators, reproduction, crossover, mutation and GA-simplex, and it has been shown in a parallel and sequential version.
296
A. Azzini and A.G.B. Tettamanzi
In another evolutionary approach, named GNARL algorithm [3], the number of hidden nodes and connection links for each network is first randomly chosen within some defined ranges. Three steps are then implemented to generate an offspring: copying the parents, determining the mutations to be performed, and mutating the copy. The mutation of a copy is separated into two classes, defining a parametric mutation that can alter the connection weights, and a structural mutation that modifies the number of hidden nodes and links of the considered network. An evolutionary system, EPNet [61], was also been presented for evolving feedforward ANNs. The idea behind EPNet is to put more emphasis on evolving ANN behaviors. For this reason a number of techniques have been adopted to maintain a close behavioral link between parents and their offspring. For example, the training mutation is always attempted first before any architectural mutation since it causes less behavioral disruption and a hidden node is not added to an exisiting ANN at random, but through splitting an existing node. EPNet evolves ANN architectures and weights simultaneously in order to reduce the noise in fitness evaluation even though the evolution simulated by this approach is closer to the Lamarckian evolution rather than to the Darwinian one. Leung and colleagues developed a new system [30] for tuning of the structure and parameters of a neural network in a simple manner. A given fully connected feedforward neural network may become a partially connected network after training. The topology and weights are tuned simultaneously using a proposed improved GA. In this approach the weights of the network links govern the input-output relationship of the NN, while the structure of the neural network is governed by introducing switches elements for each NN connection. Great attention to simultaneous evolution of architecture and weights of a network was also given in a new kind of models for evolving ANNs, named Cooperative Coevolutionary models. These are based on the coevolution of several species of subnetworks (called nodules) that must cooperate to form networks for solving a given problem. In the work by Pedrajas and colleagues [45] an example of this approach has been implemented in the COVNET model. Here a population of networks that was evolved by means of a steadystate genetic algorithm kept track of the best combinations of nodules for solving the problem. Further works were carried out in order to address drawbacks of BP gradient descent approach introduced in Section 1. P.P. Palmes and colleagues [44] implemented a mutation-based genetic NN (MGNN) to replace BP by using the mutation strategy of local adaptation of evolutionary programming (EP) to effect weight learning. This algorithm also dinamically evolved structure and weights at the same time. In MGNN a gaussian perturbation was implemented together with a first stochastic (GA-inspired) and a scheduled stochastic (EP-inspired) mutation.
A New Genetic Approach for Neural Network Design
297
3 Neuro-Genetic Approach This work presents an approach to the design of NNs based on EAs, whose aim is both to find an optimal network architecture and to train the network on a given data set. The approach is designed to be able to take advantage of BP if that is possible and beneficial; however, it can also do without it. The basic idea is to exploit the ability of the EA to find a solution close enough to the global optimum, together with the ability of the BP algorithm to finely tune a solution and reach the nearest local minimum. As indicated in Section 1, this research is primed by an industrial application [4,5], in which it is required to design neural engine controllers to be implemented in hardware, with particular attention to reduced power consumption and silicon area occupation. The validity of resulting approach, however, is by no means limited to hardware implementations of NNs. Further application describes a brain wave signal processing [7], in particular as a classification algorithm in the analysis of P300 Evoked Potential. Finally, the third application considers a financial problem, whereby a factor model capturing the mutual relationships among several financial instruments is sought for [6]. The attention is restricted to a specific subset of the NN architectures, namely the Multi-Layer Perceptron (MLP). MLPs are feedforward NNs with one layer of input neurons, one layer of one or more output neurons and zero or more ‘hidden’ (i.e., internal) layers of neurons in between; neurons in a layer can take inputs from the previous layer only. A peculiar aspect of this approach is that BP is not used as some genetic operator, as is the case in some related work [11]. Instead, the EA optimizes both the topology and the weights of the networks; BP is optionally used to decode a genotype into a phenotype NN. Accordingly, it is the genotype which undergoes the genetic operators and which reproduces itself, whereas the phenotype is used only for calculating the genotype’s fitness. The idea proposed in this work is near to the solution presented in EPNet [61], a new evolutionary system for evolving feedforward ANNs, that puts emphasis on evolving ANNs behaviors. This neuro-genetic approach evolves ANNs architecture and connection weights simultaneously, as EPNet, in order to reduce the noise in fitness evaluation. Close behavioral link between parent and offspring is also maintained by applying different techniques, like weight mutation and partial training, in order to reduce behavioral disruption. The first technique is attempted before any architecural mutation; the second is employed after each architectural mutation. Moreover, a hidden node is not added to an existing ANN at random, but through splitting and existing node. In our work we carried out weight mutation before topology mutation since we want to perturb the connection weights of the neurons in a neural network, and then carry out a weight control in order to delete neurons whose contribution is negligible resptect to the overall network output. This allows to implement, if it is
298
A. Azzini and A.G.B. Tettamanzi
possible, a reduction of the computational cost of the entire network before each architecture mutation. 3.1 Evolutionary Algorithm The overall evolutionary process can be described by the following pseudocode: 1. Initialize the population, either by generating new random individuals or by loading a previously saved population. 2. Create for each genotype the corresponding MLP, and calculate its mean square error (mse), its cost and its fitness values. 3. Save the best individual as the best-so-far individual. 4. While not termination condition do a) Apply genetic operators to each network. b) Decode each new genotype into the corresponding network. c) Compute the fitness value for each network. d) Save statistics. The application of the genetic operators to each network is described by the following pseudo-code. 1. Select from the population (of size n) n/2 individuals by truncation and create a new population of size n with copies of the selected individuals. 2. For all individuals in the population: a) Perform crossover. b) Mutate the weights and the topology of the offspring. c) Train the resulting network using the training and validation sets if bp = 1. d) Calculate f and fˆ (see Section 3.3). e) Save the individual with lowest fˆ as the best-so-far individual if the fˆ of the previously saved best-so-far individual is higher (worse). 3. Save statistics. If a new population is to be generated, the corresponding networks will be initialized with different hidden layer sizes, using two exponential distributions to determine the number of hidden layers and neurons for each individual, and a normal distribution to determine the weights and bias values. Variance matrices will be also defined for all weights and bias matrices, that will be applied in conjunction with evolutionary strategies in order to perturbe network weights and bias. Variance matrices will be initialized with matrices of all ones. In both cases, unlike other approaches like [62], the maximum size and number of the hidden layers is not determined in advance, nor bounded, even though the fitness function may penalize large networks.
A New Genetic Approach for Neural Network Design
299
3.2 Individual Encoding Each individual is encoded in a structure that maintains basic information on the net as illustrated in Table 1. The values of all these parameters are affected by the genetic operators during evolution, in order to perform incremental (adding hidden neurons or hidden layers) and decremental (pruning hidden neurons or hidden layers) learning. The use of the bp parameter explains the two different types of genetic encoding schemes: direct and indirect encodings, already defined in Section 2. Generally, indiret encoding allows a more compact representation than direct encoding, because every connection and node are not specified in the genome, although they can be derived from it. On the other side, the major drawback of indirect schemes is that they require more detailed genetic and neural knowledge. In this approach, if no BP-based network training is employed, a direct encoding is defined, in which the network structure is directly translated into the corresponding phenotype; otherwise, there is an indirect encoding of networks, where the phenotype is obtained by the training of an initial (embryonic) network using BP. While promising results can be obtained by combining backpropagation and evolutionary search, fast variants of backpropagation are sometimes required to speed up the effciency of these algorithms. We use a darwinian model of the evolution, in which the influences of the environment over the phenotype, corresponding to the BP application, do not perturb the genotype, as we have previously indicated, and consequently, do not affect the features inherited by the offspring. In this work, considering the computational trade-offs between local and evolutionary search, the Resilient backpropagation algorithm RPROP [47] is adopted as the local search method, since it is useful for fine-tuning of the solution. Although the evolutionary learning can be slow for some problems in comparison with fast variants of BP, evolutionary Table 1. Individual Representation Element Description l Lenght of the topology string, corresponding to the number of layers. topology String of integer values that represent the number of neurons in each layer. W (0) Weights matrix of the input layer neurons of the network. Var(0) Variance matrix of the input layer neurons of the network. W (i) Weights matrix for the ith layer, i = 1, . . . , l. Var(i) Variance matrix for the ith layer, i = 1, . . . , l. bij Bias of the jth neuron in the ith layer. Var(bij ) Variance of the bias of the jth neuron in the ith layer.
300
A. Azzini and A.G.B. Tettamanzi Table 2. Parameters of the Algorithm Symbol n seed bp p+ layer p− layer p+ neuron pcross r
Meaning Default Value Population size 60 Previously saved population none Backpropagation selection 1 Probability to insert a hidden layer 0.1 Probability to delete a hidden layer 0.05 Probability to insert a neuron in a hidden layer 0.05 Probability to crossover 0.2 Parameter for use in weight mutation for neuron 1.5 elimination Threshold for alternative use for neuron 0 elimination h Mean for the exponential distribution 3 Nin Number of network inputs * Nout Number of network outputs * α Cost of single neuron 2 β Cost of single synapsis 4 λ Desired tradeoff between network cost and 0.5 accuracy k Constant for scaling cost and mse in the same 10−5 range * depends on the problem.
algorithms are generally much less sensitive to initial condition of training and they are useful for searching for a globally optimal solution. Some problem-specific parameters of the algorithm are the cost α of a neuron and β of a synapsis, used to establish a parsimony criterion for the network architecture; a bp parameter, which enables the use of BP if set to 1, and other parameters like probability values, used to define topology, weight distribution and ad hoc genetic operators. Table 2 lists all the parameters of the algorithm, and specifies the default values that they assume in this work. 3.3 Fitness Function The fitness of an individual depends both on its accuracy (i.e., its mse) and on its cost. Although it is customary in EAs to assume that better individuals have higher fitness, the convention that a lower fitness means a better NN is adopted in this work. This maps directly to the objective function of the genetic problem, which is a cost minimization problem. Therefore, the fitness is proportional to the value of the mse and to the cost of the considered network. It is defined as f = λkc + (1 − λ)mse,
(1)
A New Genetic Approach for Neural Network Design
301
where λ ∈ [0, 1] is a parameter which specifies the desired trade-off between network cost and accuracy, k is a constant for scaling the cost and the mse of the network to a comparable scale, and c is the overall cost of the considered network, defined as (2) c = αNhn + βNsyn , where Nhn is the number of hidden neurons, and Nsyn is the number of synapses. The mse depends on the Activation Function, that calculates all the output values for each single layer of the neural network. In this work the tangent sigmoid transfer function y=
2 −1 1 + e−2x
(3)
is implemented. The rationale behind introducing a cost term in the objective function is that there is a search for networks which use a reasonable amount of resources (neurons and synapses), which makes sense in particular when a hardware implementation is envisaged. To be more precise, two fitness values are actually calculated for each individual: the fitness f , used by the selection operator, and a test fitness fˆ. Following the commonly accepted practice of machine learning, the problem data are partitioned into three sets: • training set, used to train the network; • test set, used to decide when to stop the training and avoid overfitting; • validation set, used to test the generalization capabilities of a network. It is important to stress that no thikness is given to these dataset definitions in the literature. Now, fˆ is calculated according to Equation 1 by using the mse over the test set, while f is calculated according the same equation by using the mse over the training set. When BP is used, i.e., if bp = 1, f = fˆ, since in our approach we want to define as better an individual that outperforms good results over input data, that are not used during the learning phase. Otherwise (bp = 0), f is calculated according to Equation 1 by using the mse over the training and test sets together, maintaining the best individual over the test set. 3.4 Selection In the evolution process two important and strongly related issues are the population diversity and the selective pressure. Indeed, an increase in the selective pressure decreases the diversity of the population, and vice versa. Like indicated by Michalewicz [36], it is important to strike a balance between these two factors, and sampling mechanisms are attempt to achieve this goal. As observed in that work, many of the parameters used in the genetic search affect these factors. In this sense as selective pressure is increased, the search
302
A. Azzini and A.G.B. Tettamanzi
focuses on the top individuals in the population, causing a loss of diversity population. Using larger population, or reducing the selective pressure, increases exploration, since more genotypes are involved in the search. In the De Jong work [26], several variations ot the simple selection method were considered; the first variation, named elitist model , enforces the genetic algorithm by preserving the best chromosome during the evolution. An important result by G. Rudolph [48], is that elitism is a necessary condition for convergence of an evolutionary algorithm; of course, convergence is only probabilistic, and there is no guarantee that just one run of an evolutionary algorithm for a given number of generations will yield the globally optimal solution. The selection method implemented in this work is based on breeder genetic algorithm approach [42], that differs from natural probabilistic selection since the evolution of a population considers only the individuals that better adapt themself to the environment. Elitism is also considered in this work, allowing the survival of the best individual unchanged into the next generation and the solutions to get better over time. The selection strategy used in this genetic algorithm is truncation: starting from a population of n individuals, the worst n/2 (with respect to f ) are eliminated. The remaining individuals are duplicated in order to replace those eliminated. Finally, the population is randomly permuted. 3.5 Mutation Two types of mutation operators are used: a general random perturbation of weights, applied before the BP learning rule, and three mutation operators which affect the network architecture. The weight mutation is applied first, followed by the topology mutations, as follows: 1. Weight mutation: all the weight matrices W (i) , i = 0, . . . , l and the biases are perturbed by using variance matrices and evolutionary strategies applied to the number of synapses of the entire neural network Nsyn . This mutation is implemented by the following equation: (i)
Wj
(i) Varj
(i)
(i)
← Wj + N (0, 1) · Varj ←
(i) Varj
· eτ
N (0,1)+τ N (0,1)
(4) (5)
with 1 τ = 3 2Nsyn 1 τ = 4 3 2 Nsyn
(6) (7)
After this perturbation has been applied, neurons whose contribution to the network output is negligible are eliminated, basing on threshold. In
A New Genetic Approach for Neural Network Design
303
this work two different kinds of threshold are considered and alternatively applied to the weight perturbation. The first is a fixed threshold, simply defining a parameter, setted before execution. The following pseudocode is implemented in mutation operator by applying a comparison between that parameter and all weight matrices values. for i = 1 to l − 1 do if Ni > 1 for j = 1 to Ni do (i) if ||Wj || <
delete the jth neuron (i)
where Ni is the number of neurons in the ith layer, and Wj is the jth column of matrix W (i) . This solution presents the drawback that the fixed threshold value could be difficult to set for different real-world application. A solution to this problem has been implemented in this approach by defining a variable threshold. In this case the new threshold is defined, depending on a norm (in this case L∞ ) of the weight vector for each node, and the relevant average and standard deviation of the norms of the considered layer. This task is carried out according to the following pseudo-code: for i = 1 to l − 1 do if Ni > 1 for j = 1 to Ni do (i) (i) (i) if ||Wj || < (avgk (||Wk ||) − r · stdevk (||Wk ||)) delete the jth neuron (i)
where Ni is the number of neurons in the ith layer, Wj is the jth column of matrix W (i) , and r is a parameter which allows the user to tune how many standard deviations below the layer average the contribution of a neuron must be before it is deleted. In this solution the settings of r parameter is only for tuning standard deviation and corresponding variances are not so invasive in mutation. 2. Topology mutation: this operator affects the network structure (i.e., the number of neurons in each layer and the number of hidden layers). In particular, three mutations can occur: a) Insertion of one hidden layer: with probability p+ layer , a hidden layer i is randomly selected and a new hidden layer i − 1 with the same number of neurons is inserted before it, with W (i−1) = I(Ni ) and bi−1,j = bij , with j = 1, . . . , Ni = Ni−1 , where I(Ni ) is the Ni × Ni identity matrix. b) Deletion of one hidden layer: with probability p− layer , a hidden layer i is randomly selected; if the network has at least two layers and layer i has exactly one neuron, layer i is removed and the connections between
304
A. Azzini and A.G.B. Tettamanzi
the (i − 1)th layer and the (i + 1)th layer (to become the ith layer) are rewired as follows: W (i−1) ← W (i−1) · W (i) .
(8)
Since W (i−1) is a row vector and W (i) is a column vector, the result of the product of their transposes is a Ni+1 × Ni−1 matrix. c) Insertion of a neuron: with probability p+ neuron , the jth neuron in the hidden layer i is randomly selected for duplication. A copy of it is inserted into the same layer i as the (Ni + 1)th neuron; the weight matrices are then updated as follows: i. a new row is appended to W (i−1) , which is a copy of jth row of W (i−1) ; (i) ii. a new column WNi +1 is appended to W (i) , where (i)
Wj
(i)
1 (i) W , 2 j (i) ← Wj .
←
WNi +1
(9) (10)
The rationale for halving the output weights from both the jth neuron and its copy is that, by doing so, the overall network behavior remains unchanged, i.e., this kind of mutation is neutral. All three topology mutation operators are designed so as to minimize their impact on the behavior of the network; in other words, they are designed to be as little disruptive (and as much neutral) as possible. 3.6 Recombination As indicated in [34] there has been some debate in the literature about the opportunity of applying crossover to ANN evolution, based on disruptive effects that it could make into neural model. In this approach two ideas of crossover are independently implemented: the first is a kind of single-point crossover with different cutting points; the second implements a kind of vertical crossover, defining a merge-operator between the topologies and weight matrices of two parents in order to create the offspring. Single-Point Crossover It is a kind of single-point crossover, where cutting points are extracted for each parent, since the genotype lenght is variable. Furthermore, the genotypes can be cut only in ‘meaningful’ places, i.e., only between one layer and the next: this means that a new weight matrix has to be created to connect the two layers at the crossover point in the offspring. These new weight matrices are initialized from a normal distribution, while corresponding variance matrices are setted to matrices of all ones. This kind of crossover is shown in Figure 1.
A New Genetic Approach for Neural Network Design Parent ‘a’
Parent ‘b’
Cut Point a
Input
First Layer
iw
lw
Cut Point b
Second Layer
{i, j}
305
lw
Output
Input
First Layer
iw
{i, j}
Second Layer
lw
{i, j}
lw
Third Layer
{i, j}
Output
lw
{i, j}
Offspring from ‘a’ and ‘b’ First Layer
Input
iw
Second Layer
lw{i , j}
Output
lw{i , j}
Fig. 1. Single-Point Crossover Representation
Vertical Crossover The second type of crossover operator is a kind of ‘vertical’ crossover and it is implemented as shown in Figure 2. Once the new population has been created by the selection operator described in Section 3, two individual are chosen for coupling and their neural structures are compared. If there are some differences in the topology length l, the hidden layer insertion mutation operator will be applied to the shortest neural topology in order to obtain individuals with the same number of layers. Then a new individual will be created, the child of the two parents selected. The neural structure of the new individual is created by adding the number of neurons in any hidden layer of each parent, excepted for input and output layer (they are the same for each neural network). The new input-weights matrix W (0) and the relative variance matrix Var(0) are respectively obtained by appending the matrix of the second parent to the matrix of the first parent. Then, the new weight matrix W (i) and the corresponding variance matrix Var(i) for each hidden layer of the new individual are respectively defined as the block diagonal matrix of the matrix of the first parent and the matrix of the second parent. Bias values and corresponding variance matrices of two parents are concatenated in order to obatin the new values for the new biases bij and variances Var(bij ). All the weights of the inputs to the new output layer will be set to the half of the corresponding weights in the parents. The rationale of this choice
306
A. Azzini and A.G.B. Tettamanzi Parent ‘a’
Input
First Layer
iw
Parent ‘b’
Second Layer
lw{i , j}
5
(a)
Output
Input
iw
lw{i , j}
(a)
First Layer
6
lw {i , j}
lw1,1 lw1,2 (a) (a) lw2,1 lw2,2
Second Layer
(b)
(b)
Output
lw {i , j}
lw1,1 lw1,2 Offspring from ‘a’ and ‘b’
Input
First Layer
iw
Second Layer
lw{i , j}
Output
lw{i , j}
⎡
⎤ (a) (a) 0 lw1,1 lw1,2 0 ⎢ (a) (a) ⎥ 0 ⎦ ⎣ lw2,1 lw2,2 0 (b) (b) 0 0 lw1,1 lw1,2 Fig. 2. Merge-Crossover Representation.
is that, if both parents were ‘good’ networks, they would both supply the appropriate input to the output layer; without halving it, the contribution from the two subnetworks would add and yield an approximately double input to the output layer. Therefore, halving the weights helps to make the operator as little disruptive as possible. The main problem in the crossover implementation, is to maintain, in a neural network, a structural alignment between neurons of each parent, when a new offspring is created. Without alignment, some disruptive effects could be make into neural model. Another important open issue, defined also in the two approaches implemented in this work, regards the initialization of connection weight values in the merging point between two selected parents. In this particular approach, in the first crossover operator, that is a singlepoint, new weight matrices in the merging point of the offspring are initialized from a normal distribution, while corresponding variance matrices are setted to matrices of all ones. In the vertical crossover, the new weight matrices are defined, in the merging point, as the block diagonal matrix of the matrix of
A New Genetic Approach for Neural Network Design
307
the first parent and the matrix of the second parent. Also in this case, the new connections will be initialized from a normal distribution. In any case, the frequency and the size of crossover changes must be carefully selected in order to assure the balance between exploring and exploiting abilities.
4 Real-World Applications As indicated in previous sections, this approach has been applied to three different real-world problems. Section 4.1 describes an industrial application for neural engine controllers design. The second application presented in Section 4.2 concerns a neural classification algorithm for brain wave signal processing, in particular the analysis of P300 Evoked Potential. Finally, the third application presented in Section 4.3, considers a financial modeling, whereby a factor model capturing the mutual relationships among several financial instruments is sought for. 4.1 Fault Diagnosis Problem The algorithm described in Section 3 was primed by applying it to a real-world diagnosis application. Every industrial application requires a suitable monitoring system for its processes in order to identify any decrease in efficiency and any loss. A generic information from an electric power measurement system, which monitors the power consumption of an electric component, can be usefully exploited for sensorless monitoring of an AC motor drive. Having in mind the recent trend toward more and more integrated systems, where the drive can be considered as a “black-box”, the only accessible points assumed of the system are the AC input terminals. The experimental setup is realized as shown in Figure 3, where a threephase PWM inverter with switching frequency of 4 kHz was used. The DC-link between the Rectifier and the PWM-Inverter performs a filtering action respect to the AC input, theoretically eliminating most of the Hall-effect Transducer Rectifier
DC-Link
PWM-Inverter
M
AC
A/D Acquisiton board
Wavelet processing stage
GA-Tuned ANN
Diagnosis
Fig. 3. The experimental setup used for the fault diagnosis application
308
A. Azzini and A.G.B. Tettamanzi
information about the output circuits of the drive and the motor. Instead, it was proved that the operating condition of the AC motor will appear on the AC side as a transient phenomenon or a sudden variation in the load power. The presence of this electrical transient in the current suggests an approach based on time-frequency or, better, time-scale analysis. In particular, the use of Discrete Wavelet Transform (DWT) [52] could be efficiently used in the process of separating the information. The genetic approach involves the analysis of the signal—the load current—through wavelet series decomposition. The decomposition results in a set of coefficients, each carrying local time-frequency information. An orthogonal basis function is chosen, thus avoiding redundancy of information and allowing for easy computation. The computation of the wavelet series coefficients can be efficiently performed with the Mallat algorithm [31]. The coefficients are computed by processing the samples of the signal with a filter bank where the coefficients of the filters are peculiar to the family of the chosen wavelets. The Figure 4 shows the bandpass filtering, which is implemented as a lowpass gi (n) - highpass hi (n) filter pair which has mirrored characteristics. In particular, in this application the 6-coefficient Daubechies wavelet [15] was used. In Table 3 the filter coefficients of the utilized wavelet are reported. The wavelet coefficients allow a compact representation of the signal and the features of the normal operating or faulty conditions are condensed in the wavelet coefficients. Conversely, the features of given operating modes can be hi (n) = Highpass filter h1 (n)
2
is (n) g1 (n)
2
gi (n) = Lowpass filter
d1
max
a1
h2 (n)
2
g2 (n)
2
d2
max
a2
2
= Downsample by2
d1
= Detail
a1
= Approximation
i s (n) = Current
Fig. 4. Bandpass filtering Table 3. Filter coefficients of the 6-coefficient Daubechies wavelet Filter Low-pass filter High-pass filter Coefficients decomposition decomposition 1 0.035226 −0.33267 2 −0.085441 0.80689 3 −0.13501 −0.45988 4 0.45988 −0.13501 5 0.80689 0.085441 6 0.33267 0.035226
A New Genetic Approach for Neural Network Design
309
Fig. 5. A depiction of the logical structure of a case of study in the fault diagnosis problem. The elements w1 to w8 are the maximum coefficients for each level of wavelet decomposition; C indicates whether the case is faulty or not
Fig. 6. Conventional approach to the fault diagnosis problem by means of a neurofuzzy network with pre-defined topology trained by BP
recognized in the wavelet coefficients of the signal and the operating mode can be identified. Employing the wavelet analysis both the identification of the drive operating conditions (faulty or normal operation) and the identification of significant parameters for the specific condition have been obtained. Figure 5 depicts the logical structure of the data describing a case of study. Each vector is known to have been originated in fault or non-fault conditions, so it can be associated with a fault index C equal to 1 (fault condition), or 0 (normal condition). This problem has been already approached with a neuro-fuzzy network, whose structure was defined a priori, trained with BP [14], as indicated in Figure 6. Experiments In this approach, both the network structure (topology) and the weights have to be determined through evolution at the same time, as depicted in Figure 7. The model proposed has to look for networks with 8 input attributes (the features from w1 to w8 of Figure 5, corresponding to the maximum coefficients for each level of wavelet decomposition, and 1 output, the diagnosis C, which there will be interpret as an estimate of the fault probability: zero thus means a fault is not expected at all, whereas one is the certainty that a fault is about to occur.
310
A. Azzini and A.G.B. Tettamanzi
Fig. 7. Neuro-genetic approach to the fault diagnosis problem
The data used for learning have been obtained from a Virtual Test Bed (VTB) model simulator of a real engine. Several settings of five parameters, backpropagation bp, population size n, and three mutation probabilities relevant to structural mutation, p+ layer , − + player , and pneuron , have been explored in order to assess the robustness of the approach and to determine an optimal set-up. The pcross parameter, that defines the probability to crossover, is set to 0 for all runs, because neither single-point crossover nor merge crossover give satisfactory results for this problem. All the other parameters are set to the default values shown in Table 2. All runs were allocated the same fixed amount of neural network executions, to allow for a fair comparison between the cases with and without backpropagation. The results are respectively summarized in Table 4 and in Table 5. Ten runs are executed for each setting, of which the average and standard deviation for the best solutions found are reported. Results A first comment can be made regarding the size of the population. In most cases it is possible to observe that the solutions found with a larger population are better than those found with a smaller population. With bp = 1, 15 settings out of 27 give better results with n = 60, while with bp = 0, 19 settings out of 27 give better results with the larger pupulation. The best solutions, on average, have been found with p+ layer = 0.1, − + player = 0.2, and pneuron = 0.05, and there is a clear tendency for the runs using backpropagation (bp = 1) to consistently obtain better quality solutions. The best model over all runs performed is a multi-layer perceptron with a phenotype of type [3, 1], here specified without input layer.
A New Genetic Approach for Neural Network Design
311
Table 4. Experimental results for the engine fault diagnosis problem with BP = 1 − + setting p+ layer player pneuron
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.05 0.05 0.05 0.1 0.1 0.1 0.2 0.2 0.2 0.05 0.05 0.05 0.1 0.1 0.1 0.2 0.2 0.2 0.05 0.05 0.05 0.1 0.1 0.1 0.2 0.2 0.2
0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
bp = 1 n = 30 n = 60 avg stdev avg stdev 0.11114 0.0070719 0.106 0.0027268 0.10676 0.003172 0.10606 0.0029861 0.10776 0.0066295 0.10513 0.0044829 0.10974 0.0076066 0.10339 0.0036281 0.1079 0.0067423 0.10696 0.0050514 0.10595 0.0035799 0.10634 0.0058783 0.10332 0.0051391 0.10423 0.0030827 0.10723 0.0097194 0.10496 0.0050782 0.10684 0.007072 0.1033 0.0031087 0.10637 0.0041459 0.10552 0.0031851 0.10579 0.0050796 0.10322 0.0035797 0.10635 0.0049606 0.10642 0.0042313 0.10592 0.0065002 0.10889 0.0038811 0.10814 0.0064667 0.10719 0.0054168 0.10851 0.0051502 0.11015 0.0055841 0.10267 0.005589 0.10318 0.0085395 0.10644 0.0045312 0.10431 0.0041649 0.10428 0.004367 0.10613 0.0052063 0.10985 0.0059448 0.10757 0.0045103 0.10593 0.0048254 0.10643 0.0056578 0.10714 0.0043861 0.10884 0.0049295 0.10441 0.0051143 0.10789 0.0046945 0.1035 0.0030094 0.1083 0.0031669 0.10722 0.0048851 0.1069 0.0050953 0.10285 0.0039064 0.1079 0.0045474 0.10785 0.008699 0.10768 0.0061734 0.10694 0.0052523 0.10652 0.0050768
In all cases, the relative standard deviation is sufficiently small to guarantee finding a good solution in a few runs. A comparison with the results obtained in [14] for a hand-crafted neurofuzzy network did not reveal any significant difference. This is an extremely positive outcome, given the expert time and effort spent in hand-crafting the neuro-fuzzy network, as compared to the practically null effort required to set up these experiments. On the other hand, the amount of required computing resources was substantially greater with this neuro-genetic approach. The experiments showed that the algorithm is somewhat robust w.r.t. the setting of its parameters, i.e., its performance is little sensitive of the fine tuning of the parameters.
312
A. Azzini and A.G.B. Tettamanzi
Table 5. Experimental results for the engine fault diagnosis problem with BP = 0 setting
p+ layer
p− layer
p+ neuron
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.05 0.05 0.05 0.1 0.1 0.1 0.2 0.2 0.2 0.05 0.05 0.05 0.1 0.1 0.1 0.2 0.2 0.2 0.05 0.05 0.05 0.1 0.1 0.1 0.2 0.2 0.2
0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2 0.05 0.1 0.2
bp = 0 n = 30 n = 60 avg stdev avg stdev 0.14578 0.013878 0.13911 0.0086825 0.1434 0.011187 0.13573 0.013579 0.13977 0.014003 0.13574 0.010239 0.14713 0.0095158 0.13559 0.011214 0.14877 0.010932 0.13759 0.014255 0.14321 0.0095505 0.1309 0.012189 0.14304 0.014855 0.13855 0.0089141 0.13495 0.015099 0.13655 0.0079848 0.14613 0.010733 0.14165 0.013385 0.13939 0.013532 0.13473 0.0085242 0.13781 0.0094961 0.13991 0.012132 0.13692 0.017408 0.13143 0.012919 0.13348 0.009155 0.1363 0.013102 0.13785 0.013465 0.13836 0.0094587 0.14076 0.01551 0.13994 0.011786 0.1396 0.0098416 0.13719 0.016372 0.13597 0.012948 0.14091 0.014344 0.14049 0.013535 0.13665 0.011426 0.13486 0.0079435 0.14068 0.013874 0.13536 0.0112 0.12998 0.013489 0.13328 0.0087402 0.1314 0.0088282 0.13693 0.0096481 0.13456 0.012431 0.13771 0.015971 0.13939 0.0092643 0.13204 0.010325 0.1378 0.01028 0.14062 0.012129 0.14005 0.011195 0.14171 0.008802 0.13877 0.0094973 0.14216 0.015659 0.13965 0.015732
The results obtained on this real-world problem compared well against alternative approaches based on the conventional training of a predefined neuro-fuzzy network with BP. 4.2 Brain Wave Analysis Brain-Computer Interfaces Brain Computer Interfaces (BCI) represent a new communication option for those suffering from neuromuscular impairment that prevents them from using conventional input devices, such as mouses, keyboards, joysticks, etc. This new approach has been developing quickly during the last few years, thanks to the increase of computational power and the availability of new algorithms for
A New Genetic Approach for Neural Network Design
313
signal processing that can be used to analyze brain waves. During the first international meeting on BCI technology, Jonathan R. Wolpaw formalized the definition of the BCI systems as follows: A brain-computer interface (BCI) is a communication or control system in which the user’s messages or commands do not depend on the brain’s normal output channels. That is, the message is not carried by nerves and muscles, and, furthermore, neuromuscular activity is not needed to produce the activity that does carry the message [57]. According with this definition, BCI systems appear as a possible and sometimes unique mode of communication for people with severe neuromuscular disorders like spinal cord injury or cerebral paralysis. Exploiting the residual functions of the brain, may allow those patients to communicate. The human brain has an intense chemical and electrical activity, partially characterized by peculiar electrical patterns, which occur at specific times and at well-localized brain sites. All of that is observable with a certain level of repeatability under well-defined environmental conditions. These simple physiological issues can lead to the development of new communication systems. Problem Description One of the most utilized electrical activities of the brain for BCI is the so-called P300 Evoked Potential. This wave is a late-appearing component of an Event Related Potential (ERP) which can be auditory, visual or somatosensory. It has a latency of about 300 ms and is elicited by rare or significant stimuli, when these are interspersed with frequent or routine stimuli. Its amplitude is strongly related to the unpredictability of the stimulus: the more unexpected the stimulus, the higher the amplitude. This particular wave has been used to make a subject chose between different stimuli [17, 18]. The general idea of Donchin’s solution is that the patient is able to generate this signal without any training. This is due to the fact that the P300 is the brains response to an unexpected or surprising event and is generated naturally. Donchin developed a BCI system able to detect an elicited P300 by signal averaging techniques (to reduce the noise) and used a specific method to speed up the overall performance. Donchin’s idea has been adopted and further developed by Beverina and colleagues of ST Microelectronics [10]. In this application, the neuro-genetic approach described in Section 3 has been applied to the same dataset of P300 evoked potential used by Beverina and colleagues for their approach on brain signal analysis based on support vector machines. Experiments The dataset provided by Beverina and colleagues consists of 700 negative cases and 295 positive cases. The feature are based on wavelets, morphological
314
A. Azzini and A.G.B. Tettamanzi
criteria and power in different time windows, for a total of 78 real-valued input attributes and 1 binary output attribute, indicating the class (positive or negative) of the relevant case. A positive case is one for which the response to the stimulus is correct; a negative case is one for which the response is incorrect. In order to create a balanced dataset of the same cardinality as the one used by Beverina and colleagues, for each run of the evolutionary algorithm 218 positive cases from the 295 positive cases of the original set, and 218 negative cases from the 700 negative cases of the original set are extracted, to create a 436 case training dataset; for each run, also a 40 case test set is created, by randomly extracting 20 positive cases and 20 negative cases from the remainder of the original dataset, so that there is no overlap between the training and the test sets. This is the same protocol followed by Beverina and colleagues. For each run of the evolutionary algorithm up to 25,000 network evaluations are allowed (i.e., simulations of the network on the whole training set), including those performed by the backpropagation algorithm. Hundred runs of the neuro-genetic approach with parameters set to the values listed in Table 1 were executed with bp = 0 and bp = 1, i.e., both without and with backpropagation. The pcross parameter is maintained to 0 for all runs, because neither single-point crossover nor merge crossover give satisfactory results also for these simulations. Results The results obtained with settings defined above for experiments are presented in Table 6. Due to the way the training set and the test set are used, it is not surprising that error rates on the test sets look better than error rates on the training sets. That happens because, in the case of bp = 1, the performance of a network on the test set is used to calculate its fitness, which is used by the evolutionary algorithm to perform selection. Therefore, it is only networks whose performance on the test set is better than average which are selected for reproduction. The best solution has been found by the algorithm using backpropagation and is a multi-layer perceptron with one hidden layer with 4 Table 6. Error rates of the best solutions found by the neuro-genetic approach with and without the use of backpropagation, averaged over 100 runs training bp false positives false negatives avg stdev avg stdev 0 93.28 38.668 86.14 38.289 1 29.42 14.329 36.47 12.716
test false positives false negatives avg stdev avg stdev 7.62 3.9817 7.39 3.9026 1.96 1.4697 2.07 1.4924
A New Genetic Approach for Neural Network Design
315
neurons, which gives 22 false positives and 29 false negatives on the training set, while it commits no classification error on the test set. The results obtained by the neuro-genetic approach, without any specific tuning of the parameters, appear promising. To provide a reference, the average number of false positives obtained by Beverina and colleagues with support vector machines are 9.62 on the training set and 3.26 on the test set, whereas the number of false negatives are 21.34 on the training set and 4.45 on the test set [43]. 4.3 Financial Modeling The last application of this neural evolutionary approach presented in this chapter, regards the building of factor models of financial instruments. Factor models are statistical models (in this case ANNs) that represent the returns of a financial instrument as a function of the returns of other financial instruments [24]. Factor models are used primarily for statistical arbitrage. A statistical arbitrageur builds a hedge portfolio consisting of one or more long positions and one or more short positions in various correlated instruments. When the price of one of the instruments diverges from the value predicted by the model, the arbitrageur puts on the arbitrage, by going long that instrument and short the others, if the price is lower than predicted, or short that instrument and long the others, if the price is higher. If the model is correct, the price will tend to revert to the value predicted by the model, and the arbitrageur will profit. To study the capabilities, the approach was been tried on a factor modeling problem whereby the Dow Jones Industrial Average (DJIA) is modeled against a number of other market indices, including foreign exchange rates, stock of individual companies taken as representatives of entire market segments, and commodity prices as shown in Table 7. Experiments In this application the training and test sets are created by considering daily returns for the period since the 2nd of January, 2001 until the 30th of November, 2005. All data are divided in two different datasets, rispectively with 1000 cases for training set and 231 cases for test set. The validation set consists of the daily returns for the period since the 1st of December, 2005 until the 13th of January, 2006. All the parameters are set to the default values shown in Table 2, while the pcross parameter is set to 0, because the two types of crossover defined give to unsatisfactory results in this application. This is probably due to fact that whereas evolutionary algorithms are known to be quite effective in exploring the search space, they are in general quite poor at closing into a local optimum; backpropagation, which is essentially a local optimization algorithm, appears to complement well the evolutionary approach.
316
A. Azzini and A.G.B. Tettamanzi Table 7. Input Market Indices Class Foreign Exchange Rates
Ticker Description EURGBP 1EUR = xGBP GBPEUR 1GBP = xEUR EURJPY 1EUR = xJPY JPYEUR 1JPY = xEUR GBPJPY 1GBP = xJPY JPYGBP 1JPY = xGBP USDEUR 1USD = xEUR EURUSD 1EUR = xUSD USDGBP 1USD = xGBP GBPUSD 1GBP = xUSD USDJPY 1USD = xJPY JPYUSD 1JPY = xUSD Industry DNA Biotecnologies Representatives* TM Motors DOW Chemicals NOK Communications JNJ Drug Manufactures UN Food BAB Airlines XOM Oil & Gas BHP Metal & Mineral AIG Insurance INTC Semiconductors VZ Telecom GE Conglomerates Commodities OIL Crude Oil $/barrel AU Gold, $/Troy ounce AG Silver, $/Troy ounce US TYX 30-year bond Treasury TNX 10-year note Bonds FVX 5-year note IRX 13-week bill * Representatives are, as a rule, the companies with largest market capitalization for their sector.
Time series of training, test and validation set are preprocessesd by deleting the 20 days moving average from all the components. Several runs of this approach have been carried out in order to find out optimal settings of the − + genetic parameters p+ layer , player , and pneuron . For each run of the evolutionary algorithm, up to 100,000 network evaluations (i.e., simulations of the network on the whole training set) have been allowed, including those performed by the backpropagation algorithm. All simulations have been carried out with bp = 1, i.e., while not all cases with bp = 0 have been considered, since no otpimal results are obtained from
A New Genetic Approach for Neural Network Design
317
Table 8. Financial Modeling Experimental Results Setting Parameter Setting − + p+ layer player pneuron 1 0.05 0.05 0.05 2 0.05 0.05 0.1 3 0.05 0.05 0.2 4 0.05 0.1 0.05 5 0.05 0.1 0.1 6 0.05 0.1 0.2 7 0.05 0.2 0.05 8 0.05 0.2 0.1 9 0.05 0.2 0.2 10 0.1 0.05 0.05 11 0.1 0.05 0.1 12 0.1 0.05 0.2 13 0.1 0.1 0.05 14 0.1 0.1 0.1 15 0.1 0.1 0.2 16 0.1 0.2 0.05 17 0.1 0.2 0.1 18 0.1 0.2 0.2 19 0.2 0.05 0.05 20 0.2 0.05 0.1 21 0.2 0.05 0.2 22 0.2 0.1 0.05 23 0.2 0.1 0.1 24 0.2 0.1 0.2 25 0.2 0.2 0.05 26 0.2 0.2 0.1 27 0.2 0.2 0.2
BP = 1 avg stdev 0.2988 0.0464 0.2980 0.0362 0.3013 0.0330 0.2865 0.0368 0.2813 0.0435 0.3040 0.0232 0.2845 0.0321 0.2908 0.0252 0.3059 0.0208 0.2987 0.0290 0.3039 0.0341 0.3155 0.0396 0.3011 0.0395 0.2957 0.0201 0.3083 0.0354 0.2785 0.0325 0.2911 0.0340 0.2835 0.0219 0.2852 0.0292 0.2983 0.0309 0.2892 0.0374 0.3006 0.0322 0.2791 0.0261 0.2894 0.0260 0.2892 0.0230 0.2797 0.0360 0.2783 0.0369
these simulation. The results obtained are presented in Table 8: here are reported data about the average and the standard deviation of the test fitness values about the best solutions found for each parameter settings over all runs. − The best solutions, on average, have been found with p+ layer = 0.2, player = 0.2, and p+ neuron = 0.2, although they do not differ significantly from other solutions found with bp = 1. The best model over all runs performed has been found by the algorithm using backpropagation, and it is a multi-layer perceptron with a phenotype of type [2, 1], specified without input layer, which obtained a mean square error of 0.39 on the test set. Results One observation is that the approach is substantially robust with respect to the setting of parameters other than bp.
318
A. Azzini and A.G.B. Tettamanzi 250 200
Returns
150 100 50 0 −50 −100 −150
0
5
10
15 Time
20
25
30
Fig. 8. Comparison between the daily closing prices predicted by the best model (dashed line) and actual daily closing prices (solid line) of the DJIA on the validation set.
Figure 8 shows a satisfactory agreement between the output of the best model with the actual closing values of the DJIA on the validation set. The results obtained with few parameter settings appear promising with respect to further simulations with different parameter tuning. A comparison with simple linear regression on the same data have been carried out, in order to assess the quality of the model. The linear regression yields a linear model 32 wi xi , (11) y= i=1
where the wi are the same weights values of the weights of best solution. The prediction obtained by the linear regression model are compared with our best solution found, as shown in Figure 9. The neuro-genetic solution obtained with our approach has a mse of 1291.7, a better result compared to the mse of 1320.5 of the prediction based on linear regression on the same validation dataset. The usefullness of such a model is evaluated with a paper simulation of a very simple statistical arbitrage strategy, carried out on the same validation set of the financial modeling. The strategy is described in detail in [6], and the results show that the information given by a neural network obtained by the approach would enable an arbitrageur to gain significant profit.
5 Conclusion and Future Work The research of genetic ANN evolution for different application has been a key issue in the ANN field. Neuro-genetic systems are coming of age, and they consider different evolutionary solutions, like architecture optimization
A New Genetic Approach for Neural Network Design
319
250 200
Returns
150 100 50 0 −50 −100 −150
0
5
10
15 Time
20
25
30
Fig. 9. Comparison between the daily closing prices predicted by the best model (dashed line), those predicted by the linear regression (dash-dotted line), and the actual daily closing prices (solid line) of the DJIA on the validation set.
of neural network models, connections weights optimizations, and more suitable conjunctions of these solutions. This chapter has shown a review of different approaches presented in literature, highlighting the several aspects of neuro-genetic evolution for each technique. A new evolutionary approach to the joint optimization of neural network weights and structure has been presented in this chapter. This approach can take advantage of both evolutionary algorithms and the backpropagation algorithm as specialized decoder. This neuro-genetic approach has been validated and then successfully applied to three different real-world applications. The results obtained on the first fault diagnosis application compared well against alternative approaches based on the conventional training of a predefined neuro-fuzzy network with BP and they shown how the algorithm is somewhat robust w.r.t. the setting of its parameters, i.e., its performance is little sensitive of the fine tuning of the parameters. In the second application of brain-wave analysis, the results obtained by the neuro-genetic approach, without any specific tuning of the parameters, appear promising after a comparison with a mature approach based on support vector machine. Finally, an application to financial modeling has been implemented and successfully validated. In each real-world application implemented, we have considered the same parameters, maintaining some of them as constant values, as indicated in the chapter, and we carried out several simulation in order to find the values of other parameter settings in order to obtain the better solution for the considered problem. For a better explanation of the results, some input and output parameter values are shown in Table 9, calculated in the three real-world applications. Note that a comparison between these results cannot be carried out, since they are referring to different real-world problem with different condition.
320
A. Azzini and A.G.B. Tettamanzi Table 9. Summary of some parameter values of real-world applications. Parameter
Real-World Applications Fault Diagnosis Brain Wave Financial Problem Analysis Modeling N-input 8 78 32 N-output 1 1 1 BP 1 1 1 p-cross 0 0 0 p+ 0.1 0.1 0.2 layer p− 0.2 0.2 0.2 layer p+ 0.05 0.05 0.2 neuron Population Dimension 60 60 60 Network Simulations 25,000 25,000 100,000 Best Topology [3 1] [4 1] [2 1]
Future work will consider the study of the efficiency and the robustness of this approach even when input data are affected by uncertainty depending on errors introduced by some measurement instrumentations. A further improvement could be given by the elimination of algorithm parameters, even though this approach has been demonstrated to be robust with respect to parameter tuning. Further studies of new crossover design could improve the genetic algorithm implementation, by being as little disruptive as possible. The new merge-crossover implemented in this work seems to be a promising step in that direction, even though its use did not boost the performance of the algorithm significantly in the present form.
References 1. Aarts EHL, Eiben AE, van Hee KM (1989) A general theory of genetic algorithms. Computing Science Notes. Eindhoven University of Technology, Eindhoven 2. Abraham A (2004) Meta learning evolutionary artificial neural networks. Neurocomputing 56:1–38 3. Angeline PJ, Saunders GM, Pollack JB (1994) An evolutionary algorithm that constructs recurrent neural networks. IEEE Trans Neural Netw 5:54–65 4. Azzini A, Cristaldi L, Lazzaroni M, Monti A, Ponci F, Tettamanzi AGB (2006) Incipient fault diagnosis in electrical drives by tuned neural networks. In: Proceedings of the IEEE instrumentation and measurement technology conference, IMTC 2006, Sorrento, Italy. IEEE, April, 24–27 5. Azzini A, Lazzaroni M, Tettamanzi AGB (2005) A neuro-genetic approach to neural network design. In: Sartori F, Manzoni S, Palmonari M (eds) AI*IA 2005
A New Genetic Approach for Neural Network Design
6.
7.
8.
9. 10. 11. 12.
13.
14.
15. 16.
17.
18.
19.
20. 21. 22.
321
workshop on evolutionary computation. AI*IA, Italian Association for Artificial Intelligence, September 20, 2005 Azzini A, Tettamanzi AGB (2006) A neural evolutionary approach to financial modeling. In: Sigevo (ed) Proceedings of the genetic and evolutionary computation conference, GECCO 2006, Seattle, WA, July 8–12, 2006 Azzini A, Tettamanzi AGB (2006) A neural evolutionary classification method for brain-wave analysis. In: Proceedings of the European workshop on evolutionary computation in image analysis and signal processing, EvoIASP 2006, April 2006 Baeck T, Fogel DB, Michalewicz Z (2000) Evolutionary computation 1–2. Institute of Physics Publishing, Bristol and Philadelphia, Dirac House, Temple Back, Bristol, UK Bersini H, Seront G (1992) In search of a good crossover between evolution and optimization. Parallel problem solving from Nature 2:479–488 Beverina F, Palmas G, Silvoni S, Piccione F, Giove S (2003) User adaptive bcis: Ssvep and p300 based interfaces. PsychNology J 1(4):331–354 Castillo PA, Carpio J, Merelo JJ, Prieto A, Rivas V, Romero G (2000) Evolving multilayer perceptrons. Neural Process Lett 12(2):115–127 Castillo PA, Gonzlez J, Merelo JJ, Rivas V, Romero G, Prieto A (1998) Sa-prop: optimization of multilayer perceptron parameters using simulated annealing. Neural Processing Lett Chalmers DJ (1990) The evolution of learning: an experiment in genetic connectionism. In: Touretzky DS, Elman JL, Hinton GE (eds) Connectionist models: proceedings of the 1990 summer school. Morgan Kaufmann, San Mateo, CA, pp 81–90 Cristaldi L, Lazzaroni M, Monti A, Ponci F (2004) A neurofuzzy application for ac motor drives monitoring system. IEEE Trans Instrum Measurement 53(4):1020–1027 Daubeschies I (1992) Ten lectures on wavelet Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania Davis L (1989) Adapting operator probabilities in genetic algorithms. In: Schaffer J (ed) Proceedings of the third international conference on genetic algorithms. Morgan Kaufmann, San Mateo, CA, pp 61–69 Donchin E, Spencer KM, Wijesinghe R (2000) The mental prosthesis: assessing the speed of a p300-based brain–computer interface. IEEE Trans Rehabil Eng 8(2):174–179 Farwell LA, Donchin E (1988) Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials. Electroencephalogr Clin Neurophysiol 70(6):510–523 Filho EFM, Carfalhode A (1997) Evolutionary design of mlp neural network architectures. In: IEEE Proceedings of the fourth Brazilian symposium on neural networks, pp 58–65 Fogel LG, Owens AJ, Walsh MJ (1966) Artificial intelligence through simulated evolution. Wiley, New York Goldberg DE (1992) Genetic algorithms in search optimization & machine learning. Addison-Wesley, Reading, MA Hancock PJB (1992) Genetic algorithms and permutation problems: a comparison of recombination operators for neural net structure specification. In: Whitley LD, Schaffer JD (eds) Proceedings of the third international workshop on combinations genetic algorithms neural networks, 1992, pp 108–122
322
A. Azzini and A.G.B. Tettamanzi
23. Harp S, Samad T, Guha A (1991) Towards the genetic syntesis of neural networks. Fourth international conference on genetic alglorithms, pp 360–369 24. Harris L (2003) Trading and exchanges: market microstructure for practitioners. Oxford University Press, Oxford 25. Holland JH (1975) Adaption in natural and artificial systems. The University of Michigan Press, Ann Arbor, MI 26. De Jong KA (1993) Evol Comput. MIT, Cambridge, MA 27. Keesing R, Stork DG (1991) Evolution and learning in neural networks: the number and distribution of learning trials affect the rate of evolution. Adv Neural Inf Process Syst 3:805–810 28. Knerr S, Personnaz L, Dreyfus G (1992) Handwritten digit recognition by neural networks with single-layer training. IEEE Trans Neural Netw 3:962–968 29. Koza JR (1994) Genetic programming. The MIT, Cambridge, MA 30. Leung EHF, Lam HF, Ling SH, Tam PKS (2003) Tuning of the structure and parameters of a neural network using an improved genetic algorithm. IEEE Trans Neural Netw 14(1):54–65 31. Mallat S (1999) A wavelet tour of signal processing. Academic, San Diego, CA 32. Maniezzo V (1993) Granularity evolution. In: Proceedings of the fifth international conference on genetic algorithm and their applications, p 644 33. Maniezzo V (1993) Searching among search spaces: hastening the genetic evolution of feedforward neural networks. In: International conference on neural networks and genetic algorithms, GA-ANN’93, pp 635–642 34. Maniezzo V (1994) Genetic evolution fo the topology and weight distribution of neural networks. IEEE Trans Neural Netw 5(1):39–53 35. Merelo JJ, Patn M, Canas A, Prieto A, Morn F (1993) Optimization of a competitive learning neural network by genetic algorithms. IWANN93. Lect Notes Comp Sci 686:185–192 36. Michalevicz Z (1996) Genetic algorithms + data structures = evolution program. Springer, Berlin Heidelberg New York 37. Miller GF, Todd PM, Hegde SU (1989) Designing neural networks using genetic algorithms. In: Schaffer JD (ed) Proceedings of the third international conference on genetic algorithms, pp 379–384 38. Moller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6(4):525–533 39. Montana D, Davis L (1989) Training feedforward neural networks using genetic algorithms. In: Proceedings of the eleventh international conference on artificial intelligence. Morgan Kaufmann, Los Altos, CA, pp 762–767 40. Mordaunt P, Zalzala AMS (2002) Towards an evolutionary neural network for gait analysis. In: Proceedings of the 2002 congress on evolutionary computation, vol 2. pp 1238–1243 41. Moze MC, Smolensky P (1989) Using relevance to reduce network size automatically. Connect Sci 1(1):3–16 42. Muhlenbein H, Schlierkamp-Voosen D (1993) The science of breeding and its application to the breeder genetic algorithm (bga). Evol Comput 1(4):335–360 43. Giorgio Palmas (2005) Personal communication, November 2005 44. Palmes PP, Hayasaka T, Usui S (2005) Mutation-based genetic neural network. IEEE Trans Neural Netw 16(3):587–600 45. Pedrajas NG, Martinez CH, Prez JM (2003) Covnet: a cooperative coevolutionary model for evolving artificial neural networks. IEEE Trans Neural Netw 14(3):575–596
A New Genetic Approach for Neural Network Design
323
46. Rechenberg I (1973) Evolutionsstrategie: optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Fromman-Holzboog, Stuttgart 47. Redmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning. In: Proceedings of the international conference on neural networks, pp 586–591 48. Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE Trans Neural Netw 5(1):96–101 49. Rumelhart DE, McClelland JL, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536 50. Rumelhart DE, Hinton G, Williams R (1986) Parallel distributed processing. MIT, Cambridge, MA 51. Schaffer JD, Whitley LD, Eshelman LJ (1992) Combinations of genetic algorithms and neural networks: a survey of the state of the art. In: Whitley LD, Schaffer JD (eds) Proceedings of the third international workshop on combinations genetic algorithms neural networks, pp 1–37 52. Sidney Burrus C, Gopinath RA, Guo H (1998) Introduction to wavelets and wavelet transorms – a primer. Prentice Hall, Upper Saddle River, NJ 53. Stanley K, Miikkulainen R (2002) Evolving neural networks through augmenting topologies. Evol Comput 10(2):99–127 54. Weymaere N, Martens J (1994) On the initialization and optimization of multiplayer perceptrons. IEEE Trans Neural Netw 5:738–751 55. Whitley D, Hanson T (1989) Optimizing neural networks using faster, more accurate genetic search. In: Schaffer JD (ed) Proceedings of the third international conference on genetic algorithms, pp 391–396 56. Whitley D, Starkweather T, Bogart C (1993) Genetic algorithms and neural networks: optimizing connections and connectivity. Parallel Comput 14:347–361 57. Wolpaw JR, Birbaumer N, Heetderks WJ, McFarland DJ, Peckham PH, Schalk G, Donchin E, Quatrano LA, Robinson JC, Vaughan TM (2000) Brain computer interface technology: a review of the first international meeting. IEEE Trans Rehab Eng 8(2):164–173 58. Yang B, Su XH, Wang YD (2002) Bp neural network optimization based on an improved genetic algorithm. In: Proceedings of the IEEE first international conference on machine learning and cybernetics, November 2002, pp 64–68 59. Yao X (1999) Evolving artificial neural networks. In: Proceedings on IEEE, pp 1423–1447 60. Yao X, Xu Y (2006) Recent advances in evolutionary computation. Comput Sci Technol 21(1):1–18 61. Yao X, Liu Y (1997) A new evolutionary system for evolving artificial neural networks. IEEE Trans Neural Netw 8(3):694–713 62. Yao X, Liu Y (1998) Towards designing artificial neural networks by evolution. Appl Math Comput 91(1):83–90
A Grammatical Genetic Programming Representation for Radial Basis Function Networks Ian Dempsey, Anthony Brabazon, and Michael O’Neill
Summary. We present a hybrid algorithm where evolutionary computation, in the form of grammatical genetic programming, is used to generate Radial Basis Function Networks. An introduction to the underlying algorithms of the hybrid approach is outlined, followed by a description of a grammatical representation for Radial Basis Function networks. The hybrid algorithm is tested on five benchmark classification problem instances, and its performance is found to be encouraging.
1 Introduction General purpose neural network (NN) models such as multi-layer perceptrons (MLPs) and radial basis function networks (RBFNs) have been applied to many real-world problems. Although these models have very general utility, the construction of a quality network can be time consuming. Practical problems faced by the modeller include the selection of model inputs, the selection of model form, and the selection of appropriate parameters for the model such as weights. The use of evolutionary algorithms (EAs) such as the genetic algorithm provides scope to automate one or more of these decisions. Traditional methods of combining EA and NN methodologies typically entail the encoding of aspects of the NN model using a fixed-length binary or realvalued chromosome. The EA is then applied to a population of chromosomes, where each member of this population encodes a specific NN structure. The population of chromosomes is evolved over time so that better NN structures are uncovered. A drawback of this method is that the use of a fixed length chromosome places a restriction on the nature of the NN models that can be evolved by the EA. This study adopts an alternative approach using a novel hybrid algorithm where evolutionary computation, in the form of grammatical genetic programming, is used to generate an RBFN. This approach employs a variable length chromosome which implies that the structure of the RBFN is not determined a priori but rather is uncovered by means of an evolutionary process. This study represents the first application of a grammar-based I. Dempsey et al.: A Grammatical Genetic Programming Representation for Radial Basis Function Networks, Studies in Computational Intelligence (SCI) 82, 325–335 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
326
I. Dempsey et al.
genetic programming algorithm, namely Grammatical Evolution, to generate RBFNs. In the remainder of this chapter the two components of the hybrid methodology are initially outlined (sections 2 and 3), followed by a description of how they are combined to form the hybrid algorithm (section 3). The results of the application of the hybrid algorithm to five benchmark classification problem instances is provided in section 5. Conclusions and suggestions for future work are detailed in section 6.
2 Grammatical Evolution Grammatical Evolution (GE) is an evolutionary algorithm that can evolve computer programs in any language [12–16] and it can be considered a form of grammar-based genetic programming. GE has enjoyed particular success in the domain of Financial Modelling [2] amongst numerous other applications including Bioinformatics, Systems Biology, Combinatorial Optimisation and Design [3, 4, 9, 11]. Rather than representing the programs as parse trees, as in GP [1, 5–8], a linear genome representation is used. A genotype-phenotype mapping is employed such that each individual’s variable length binary string contains in its codons (groups of 8 bits) the information to select production rules from a Backus Naur Form (BNF) grammar. The grammar allows the generation of programs (or in this study, RBFN forms) in an arbitrary language that are guaranteed to be syntactically correct. As such, it is used as a generative grammar, as opposed to the classical use of grammars in compilers to check syntactic correctness of sentences. The user can tailor the grammar to produce solutions that are purely syntactically constrained, or they may incorporate domain knowledge by biasing the grammar to produce very specific forms of sentences. BNF is a notation that represents a language in the form of production rules. It is comprised of a set of non-terminals that can be mapped to elements of the set of terminals (the primitive symbols that can be used to construct the output program or sentence(s)), according to the production rules. A simple example of a BNF grammar is given below, where <expr> is the start symbol from which all programs are generated. These productions state that <expr> can be replaced with either one of <expr>
<expr> or . An can become either +, -, or *, and a can become either x, or y. <expr> ::= | ::= | | ::= |
<expr><expr> + * x y
(0) (1) (0) (1) (2) (0) (1)
A Grammatical Genetic Programming Representation 220 240 220 203
101
53
202
203 102
241 133
55
30
327
221 202
74
204 140
39
202
203 102
Fig. 1. An example GE individual’s genome represented as integers for ease of reading
The grammar is used in a developmental process to construct a program by applying production rules, selected sequentially using the genome, beginning from the start symbol of the grammar. In order to select a production rule in GE, the next codon value on the genome is read, interpreted, and placed in the following formula: Rule = c % r where c is the codon value, r the number of choices for the current non-terminal, and % represents the modulus operator. Fig. 1 provides an example of an individual genome (where each 8-bit codon is represented as an integer for ease of reading). The first codon integer value is 220, and given that we have 2 rules to select from for <expr> in the above grammar, we get 220 % 2 = 0. The expression <expr> will therefore be replaced with <expr><expr>. Beginning from the the left hand side of the genome, codon integer values are generated and used to select appropriate rules for the left-most non-terminal in the developing program from the BNF grammar, until one of the following situations arise: (a) A complete program is generated. This occurs when all the non-terminals in the expression being mapped are transformed into elements from the terminal set of the BNF grammar. (b) The end of the genome is reached before the complete program is generated, in which case the wrapping operator is invoked. This results in the return of the genome reading frame to the left hand side of the genome once again. The reading of codons will then continue unless an upper threshold representing the maximum number of wrapping events has occurred during this individuals mapping process. (c) In the event that a threshold on the number of wrapping events has occurred and the individual is still incompletely mapped, the mapping process is halted, and the individual assigned the lowest possible fitness value. Returning to the example individual, the left-most <expr> in <expr><expr> is mapped by reading the next codon integer value 240 and used in 240 % 2 = 0 to become another <expr><expr>. The developing program now looks like <expr><expr><expr>. Continuing to read subsequent codons and always mapping the left-most non-terminal the individual finally generates the expression y*x-x-x+x, leaving a number of unused codons at the end of the individual, which are deemed to be introns and simply ignored. A full description of GE can be found in O’Neill & Ryan
328
I. Dempsey et al.
(2003) [12]. Some more recent developments are covered in Brabazon & O’Neill (2005) [2].
3 Radial Basis Function Networks A radial basis function network (RBFN) generally consists of a three-layer feedforward network. Like an MLP, a RBFN can be used for prediction and classification purposes, but RBFNs differ from MLPs in that the activation functions of the hidden layer nodes are radial basis functions. The training of RBFNs typically consists of a combination of unsupervised and supervised learning. Initially, a number hidden layer nodes (or centres) must be positioned in the input data space. This can be performed by following a simple rule, or in a more sophisticated application by using unsupervised learning. Methods for choosing the locations of centers include distributing the centres in a regular grid over the input space, selection of a random subset of the training data vectors to serve as centres, or using an algorithm to cluster the input data (e.g. SOMs can be used for this) and then selecting a centre location to represent each cluster. Each of these centres forms a hidden node in the RBFN’s structure. Input data vectors are typically standardised before training. When each input vector is presented to the network a value is calculated at each centre
H0
I1
w0 H1 w1
I2
. . .
Output w5
H5 I3
Fig. 2. A radial basis function network. The output from each hidden node (H0 is a bias node, with a fixed input value of 1) is obtained by measuring the distance between each input pattern and the location of the hidden node, and applying the radial basis function to that distance. The final output from the network is obtained by taking the weighted sum (using w0, w1 and w5) of the outputs from the hidden layer and from H0
A Grammatical Genetic Programming Representation
329
using a radial basis function. This value represents the quality of the match between the input vector and the location of that centre in the input space. Each hidden node, therefore, can be considered as a local detector in the input data space. The most commonly used radial basis function is a Gaussian function. This produces an output value of one if the input and weight vectors are identical, falling towards zero as the distance between the two vectors gets large. A range of alternative radial basis functions exists, including the inverse multi-quadratic function and the spline function. The second phase of the model construction process is the determination of the value of the weights on the connections between the hidden layer and the output layer. In training these weights, the output value for each input vector will be known, as will the activation values for that input vector at each hidden layer node, so a supervised learning method can be used. The simplest transfer function for the node(s) in the output layer is a linear function where the network’s output is a linearly weighted sum of the outputs from the hidden nodes. In this case, the weights on the arcs to the output node(s) can be found using linear regression, with the weight values being the regression coefficients. Sometimes it may be preferred to implement a non-linear transfer function at the output node(s). For example, when the RBFN is acting as a binary classifier it would be useful to use a sigmoid transfer function to limit outputs to the range 0 → 1. In this case, the weights between the hidden and output layer could be determined using the backpropagation algorithm. Once the RBFN has been constructed using a training set of input-output data vectors it can then be used to classify or to predict outputs for new input data vectors, for which an output value is not known. The new input data vector is presented to the network, and an activation value is calculated for each hidden node. Assuming that a linear transfer function is used in the output node(s), the final output produced by the network is the weighted sum of the activation values from the hidden layer, where these weights are the coefficient values obtained in the linear regression step during training. The basic algorithm for the canonical RBFN is as follows: i. Select the initial number of centres (m). ii. Select the initial location of each of the centres in the data space. iii. For each input data vector/centre pairing calculate the activation value φ(||x − y||), where φ is a radial basis function and ||...|| is a distance measure between input vector x and a centre y in the data space. As an example, let −d d 2= ||x − y||. The value of a Gaussian RBF is then given by y = exp( 2σ2 ), where σ is a modeller selected parameter which determines the size of the region of input space a given centre will respond to. iv. Once all the activation values for each input vector have been obtained, calculate the weights for the connections between the hidden and output layers using linear regression. v. Go to step (iii) and repeat until a stopping condition is reached.
330
I. Dempsey et al.
vi. Improve the fit of the RBFN to the training data by adjusting some or all of the following: the number of centres, their location, or the width of the radial basis functions. As the number of centres increases, the predictive ability of the RBFN on the training data will tend to increase, possibly leading to overfit and poor outof-sample generalisation. Hence, the object is to choose a sufficient number of hidden layer nodes to capture the essential features in the training data, without overfitting it.
4 GE-RBFN Hybrid Despite the apparent dissimilarities between GE and RBFN methodologies, the methods can complement each other. A practical problem in utilising RBFNs is the selection of model inputs and model form. By defining an appropriate grammar, GE is capable of automatically generating a range of RBFN forms. Hence, a combined GE-RBFN hybrid can be considered as embedding both hypothesis generation and hypothesis optimisation components. The basic operation of the GE-RBFN methodology is as follows. Initially, a population of binary strings are randomly created. In turn, each of these is mapped to a RBFN structure using a grammar which has been constructed specifically for the task of generating RBFNs (see next subsection). The quality of each resulting RBFN is then assessed using the training data. Based on this information, the binary strings resulting in higher quality networks are preferentially selected for survival and reproduction. Over successive iterations, the quality of the networks encoded in the population of binary strings improves. 4.1 Grammar There are multiple grammars that could be defined in order to generate RBFNs depending on exactly what the modeller wishes to evolve. For example, if little was known about which inputs would be useful for the RBFN, the grammar could be written so that GE selected which inputs to use, in addition to selecting the form of the RBFN itself (the number of hidden layer nodes, their associated weight vectors, the form of their associated radial basis functions and so on). In this study we define a grammar which permits GE to construct RBFNs with differing numbers of centres. GE is also used to decide where to locate those centres in the input space. The Backus Naur Form grammar for this is as follows. :: = 1 / (1 + exp (- ) ) ::= <weight> * | <weight> * +
A Grammatical Genetic Programming Representation
331
::= ::= , (one item for each ::= <weight> ::= ::=
your constant generation method of choice −
where the non-terminal ::= exp
V 2 i=1 (input[I][i]−center[HN ][i]) ) 2∗(2 )
.
Under the above grammar, the generation of a RBFN starts from the root < RBF N >. This can only be mapped to one choice, hence it gives rise to the expression 1/(1 + exp(− < HL >)). Next, the non-terminal in this expression < HL > is mapped into either < weight > ∗ < HN > or < weight > ∗ < HN > + < HL >, depending on the value of the next codon on the binary genome. Suppose the next codon on the genome gives rise to an integer value of 34. Taking 34 Mod 2 (the number of choices available for < HL >) gives 0, hence < HL > becomes the first choice, < weight > ∗ < HN >. At this point, the RBFN consists of a network with a single hidden layer node. In subsequent derivation steps, the real numbers corresponding to the location of this centre, and the real number corresponding to the radius of the centre are derived, eventually giving rise to a complete RBFN form. 4.2 Example Individuals Fig. 3 provides a graphical illustration of the possible derivation trees which the grammar could create. Tree A illustrates the basic form that all the RBFN will take. The non-terminal is then expanded and can result in a RBFN which has one or more hidden layer nodes. The RBFN generation process iterates until all the non-terminals are mapped to terminals.
5 Experimental Setup & Results Five benchmark classification problem instances from the UCI Machine Learning Repository [17] are tackled. Summary statistics on each problem instance are provided in Table 1. Each dataset was recut between training and test data ten times, with 80% of the dataset being used for training and 20% for out of sample testing in each case. In assessing the quality of the developed RBFNs, the number of correct classifications produced was used. The Wisconsin problem is a data set of malignant and benign breast cancer cases. Pima includes data on Pima Indians Diabetes from the National Institute of Diabetes and Digestive and Kidney Diseases. The Thyroid data set is made
332
I. Dempsey et al.
(A) /
1
+
1
exp
−
(B)
(C)
<weight>
*
+
*
<weight>
Fig. 3. An output radial basis function network in the form of a derivation tree. Tree (A) represents the common structure of all RBFN’s generated by the example grammar. Trees (B) and (C) represent the two possible sub-trees that can replace the non-terminal in (A). (B) represents the case where a becomes a single node, and (C) represents the case where becomes at least two nodes
Table 1. Problem instance statistics and the training and test set partition sizes in each case Dataset Wisconsin Pima Thyroid Australian Bupa
Training 559 614 172 552 276
Test 140 154 43 138 69
#variables 9 8 5 6 6
#classes 2 2 3 2 2
A Grammatical Genetic Programming Representation
333
Table 2. Results for GE/RBFN including average fitnesses for both in and out of sample data sets along with standard deviation for the out of sample data
Australian Bupa Thyroid Wisconsin Pima
Mean best in sample 70.52 60.26 62.40 88.92 68.82
Mean best out of sample 71.53 57.11 75.78 95.20 67.53
Std. dev. 4.059 4.504 4.559 2.643 3.647
Table 3. Comparative out of sample results
Bupa Thyroid Wisconsin Pima
Mean best out of sample 65.97 96.27 95.63 73.50
Std. dev. 11.27 4.17 1.58 4.23
up of thyroid patient records classified into disjoint disease classes. Australian data set is made up of cases of credit card applications from the Credit Screening Database and the Bupa data set is from Bupa Medical Research Ltd. and has data on liver disorders. The results obtained by the hybrid system over the ten recuts are reported in Table 2. Overall the results are encouraging. Comparing them against previously published results from [18] on four of the same datasets (see Table 3), it can be seen that the evolved RFBNs outperform on two of the datasets, and underperform on the other two. In assessing the results it should be noted that there is considerable room to fine-tune the parameters of the GE-RBFN hybrid, and this provides scope to further improve the results. In this proof of concept study, typical off-the-shelf parameter settings were adopted for GE. These settings included, a population size of 500 individuals, 100 generations of training, and a generational rank replacement strategy wherein 25% of the weakest performing members of the population being replaced with newly generated individuals on each generation. For each dataset, a total of 30 runs were conducted with a crossover rate of 0.9 and a mutation rate of 0.1 as in [19]. All reported results are averaged over the 30 runs.
6 Conclusions & Future Work This study presents a novel approach, based on a form of grammatical genetic programming (grammatical evolution), for the automatic generation of RBFNs. A particular feature of this methodology is that the structure of the resulting RBFN is not defined a priori, but is evolved during the construction
334
I. Dempsey et al.
process. The developed GE-RBFN hybrid was applied to five benchmark instances from the UCI Machine Learning repository with encouraging results. Substantial scope exists to further develop the RBFN-GE methodology outlined in this chapter. In this initial study we did not include the selection of inputs, or the selection of the form of the RBFs in the evolutionary process. However, the RBFN grammar could be easily adapted in order to incorporate these steps if required. The use of the GE methodology also opens up a variety of other research avenues. The methodology applied in this study is based on the canonical form of the GE algorithm. As already noted, a substantial literature exists on GE covering such issues as the use of alternative search engines for the algorithm, and the use of alternatives to the strict left-to-right mapping of the genome (piGE). Future work could usefully examine the utility of these GE variants for the purposes of evolving RBFNs.
References 1. Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic programming – an introduction. On the automatic evolution of computer programs and its applications. Morgan Kaufmann, Los Altos, CA 2. Brabazon A, O’Neill M (2005) Biologically inspired algorithms for financial modelling. Springer, Berlin Heidelberg New York 3. Cleary R, O’Neill M (2005) An attribute grammar decoder for the 01 multiConstrained knapsack problem. In: LNCS 3448 Proceedings of evolutionary computation in combinatorial optimization EvoCOP 2005, Lausanne, Switzerland. Springer, Berlin Heidelberg New York, pp 34–45 4. Hemberg M, O’Reilly U-M (2002) GENR8 – using grammatical evolution in a surface design tool. In: Proceedings of the first grammatical evolution workshop GEWS2002, New York City, New York, US. ISGEC, pp 120–123 5. Koza JR (1992) Genetic programming. MIT, Cambridge, MA 6. Koza JR (1994) Genetic programming II: automatic discovery of reusable programs. MIT, Cambridge, MA 7. Koza JR, Andre D, Bennett III FH, Keane M (1999) Genetic programming 3: Darwinian invention and problem solving. Morgan Kaufmann, Los Altos, CA 8. Koza JR, Keane M, Streeter MJ, Mydlowec W, Yu J Lanza G (2003) Genetic programming IV: routine human-competitive machine intelligence. Kluwer Academic, Dordrecht 9. Moore JH, Hahn LW (2004) Systems biology modeling in human genetics using petri nets and grammatical evolution. In: LNCS 3102 Proceedings of the genetic and evolutionary computation conference GECCO 2004, Seattle, WA, USA, Springer, Berlin Heidelberg New York, pp 392–401 10. O’Neill M, Brabazon A (2004) Grammatical swarm. In: LNCS 3102 Proceedings of the genetic and evolutionary computation conference GECCO 2004, Seattle, WA, USA. Springer Berlin Heidelberg New York, pp 163–174 11. O’Neill M, Adley C, Brabazon A (2005) A grammatical evolution approach to eukaryotic promoter recognition. In: Proceedings of Bioinformatics INFORM 2005, Dublin City University, Dublin, Ireland
A Grammatical Genetic Programming Representation
335
12. O’Neill M, Ryan C (2003) Grammatical evolution: evolutionary automatic programming in an arbitrary language. Kluwer Academic, Dordrecht 13. O’Neill M (2001) Automatic programming in an arbitrary language: evolving programs in grammatical evolution. PhD thesis, University of Limerick, 2001 14. O’Neill M, Ryan C (2001) Grammatical evolution. IEEE Trans Evol Comput 5(4):349–358 15. O’Neill M, Ryan C, Keijzer M, Cattolico M (2003) Crossover in grammatical evolution. Genetic programming and evolvable machines, vol 4, no 1. Kluwer Academic, Dordrecht 16. Ryan C, Collins JJ, O’Neill M (1998) Grammatical evolution: evolving programs for an arbitrary language. Proceedings of the first European workshop on GP. Springer, Berlin Heidelberg New York, pp 83–95 17. Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning databases [http://www.ics.uci.edu/mlearn/MLRepository.html]. University of California, Department of Information and Computer Science, Irvine, CA 18. Smith M, Bull L (2005) Genetic programming with a genetic algorithm for feature construction and selection. Genet Program Evolvable Machines 6(3): 265–281 19. Dempsey I, O’Neill M, Brabazon A (2002) Investigations into market index trading models using evolutionary automatic programming. In: O’Neill M, Sutcliffe R, Ryan C, Eaton M, Griffith N (eds) LNAI 2464, Proceedings of the 13th Irish conference in artificial intelligence and cognitive science. Springer, Berlin Heidelberg New York, pp 165–170
A Neural-Genetic Technique for Coastal Engineering: Determining Wave-induced Seabed Liquefaction Depth Daeho Cha, Michael Blumenstein, Hong Zhang, and Dong-Sheng Jeng
Summary. In the past decade, computational intelligence (CI) techniques have been widely adopted in various fields such as business, science and engineering, as well as information technology. Specifically, hybrid techniques using artificial neural networks (ANNs) and genetic algorithms (GAs) are becoming an important alternative for solving problems in the field of engineering in comparison to traditional solutions, which ordinarily use complicated mathematical theories. The wave-induced seabed liquefaction problem is one of the most critical issues for analysing and designing marine structures such as caissons, oil platforms and harbours. In the past, various investigations into wave-induced seabed liquefaction have been carried out including numerical models, analytical solutions and some laboratory experiments. However, most previous numerical studies are based on solving complicated partial differential equations. In this study, the proposed neural-genetic model is applied to wave-induced liquefaction, which provides a better prediction of liquefaction potential. The neural-genetic simulation results illustrate the applicability of the hybrid technique for the accurate prediction of wave-induced liquefaction depth, which can also provide coastal engineers with alternative tools to analyse the stability of marine sediments.
1 Introduction 1.1 Artificial Neural Networks in Engineering Artificial Neural Networks (ANNs) are amongst the most successful empirical processing technologies to be used in engineering applications. ANNs serve as an important function for engineering purposes such as modelling and predicting the evolution of dynamic systems. Hagan et al. [1] espoused that the pioneering work in neural networks commenced in 1943 when McCulloch and Pitts [2] postulated a simple mathematical model to explain how biological neurons work. It may be the first significant publication on the theory of artificial neural networks, which is generally considered. ANN models have been widely applied to various engineering problems. For example, the prediction of water quality parameters D. Cha et al.: A Neural-Genetic Technique for Coastal Engineering: Determining Waveinduced Seabed Liquefaction Depth, Studies in Computational Intelligence (SCI) 82, 337–351 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
338
D. Cha et al.
[3], generation of wave equations based on hydraulic data [4], soil dynamic amplification analysis [5], tide forecasting using artificial neural networks [6], prediction of settlement of shallow foundations [7], earthquake-induced liquefaction [8], and ground settlement by deep excavation [9]. Unlike conventional approaches based on engineering mechanics, the main requirement for accurate prediction using ANN models is an appropriate database. With sufficient quality training data, ANNs can provide accurate predictions for various engineering problems. 1.2 Genetic Algorithms Generally speaking, Genetic Algorithms (GAs) are one of the various Computational Intelligence (CI) technologies, which also include ANNs and Fuzzy Logic. Fundamental theories of GAs were established by Holland [10] in the early 1970s. Holland [10] was amongst the first to put computational evolution on a firm theoretical footing. The GA’s main role is numerical optimisation inspired by natural evolution. GAs can be applied to an extremely wide range of problems. The basic component of GAs is strings of binary values (sometimes real-values) called chromosomes. A GA operates on a population of individuals (chromosomes), each presenting a possible solution to a given problem. Each chromosome is assigned a fitness value based on a fitness function, and its operation is based on crossover, selection and mutation. 1.3 ANN models trained by GAs (Evolutionary Algorithms) Generally, it is time-consuming to configure and adjust the settings of ANN models during the supervised training procedure e.g. those using the Backpropagation (BP) algorithm. Even though its results may be acceptable for some engineering applications, ANN training algorithms such as BP may suffer problems inherent in gradient descent-based techniques such as being trapped in local minima and an incapability of finding a global minimum if the error function is multi-modal and/or non-differentiable [11–13]. An ANN model trained using GAs can deal with large, complicated spaces, which are on occasion non-differentiable, and multi-modal as is common in real world problems. Hence, the use of a hybrid technique combining GAs in conjunction with ANNs is advantageous in complex engineering problems. The configuration of an ANN model using GAs is shown in Fig. 1(b). ANN models predict data, based on the relationship between input and output values. This relationship is based on the iterative modification of network weights, via the training procedure. Clearly, the training procedure is a key factor of an ANN model. However, aside from the inherent problems associated with gradient descent-based techniques (as described above) the training procedure can be time consuming in terms of selecting the most appropriate settings for ANN training; therefore it is beneficial to employ GAs
A Neural-Genetic Technique for Coastal Engineering
339
Fig. 1. Comparison of a traditional ANN model (a) and an ANN model trained by GAs (b) (MSE: Mean Squared Error)
to optimise an ANN model’s weights in a hybrid technique for an application such as wave-induced seabed liquefaction. In this study, We executed just one epoch of the ANN model and saved the weights for the initial chromosome of the GA model. Subsequently, the GA operations optimised the chromosome using crossover and mutation, which were have varied from one point crossover to up to ten point crossover, depending on chromosome size. Mutation was varied from 2% up to 30%. Based on the results of the GA output, the ANN weights were optimised. 1.4 Wave-induced seabed liquefaction In the last few decades, various investigations of wave-induced seabed liquefaction have been carried out. Although the protection of marine structures has been extensively studied in recent years, understanding of their interaction with waves and the seabed is far from complete. Damage of marine structures still occurs from time to time, with two general failure modes evident. The first mode is that of structural failure, caused by wave forces acting on and damaging the structure itself. The second mode is that of foundation failure, caused by liquefaction or erosion of the seabed in the vicinity of the structure, resulting in collapse of the structure as a whole. Numerous research studies have been carried out on this topic in the last decade [14,15]. Fig. 2 illustrates the change of a seabed due to wave action (Fig. 2(a): without liquefaction, Fig. 2(b) with liquefaction).
340
D. Cha et al.
(a)
(b) Fig. 2. Phenomenon of wave-induced seabed liquefaction. (a) Stable structure without liquefaction. (b) Failed structure due to liquefaction
Bjerrum [16] was possibly the first author that considered wave-induced liquefaction occurring in saturated seabed sediments. Later, Nataraja et al. [17] suggested a simplified procedure for ocean-based, wave-induced liquefaction analysis. Recently, Rahman [18] established the relationship between liquefaction and characteristics of wave and soil. He concluded that liquefaction potential increases in degree of saturation and with an increase of wave period. Jeng [19] examined a wave-induced liquefied state for several different cases, together with Zen and Yamazaki’s [20] field data. He found that no liquefaction occurs in a saturated seabed, except in very shallow water, for large waves and a seabed with very low permeability. For more advanced poro-elastic models for the wave-induced liquefaction potential, the readers can refer to Sassa and Sekiguchi [21] and Sassa et al. [22]. All aforementioned investigations have been reviewed by Jeng [23]. However, most previous investigations for the wave-induced liquefaction potential in a porous seabed have been based on various assumptions of engineering mechanics, which limits the application of the model in realistic engineering problems. CI models for the estimation of the wave-induced liquefaction apply different techniques to investigate coastal engineering problems, as compared to the traditional engineering approach. Traditional engineering methods for waveinduced liquefaction prediction always use deterministic models, which involve
A Neural-Genetic Technique for Coastal Engineering
341
complicated partial differential equations, depending on physical variables, such as shear modulus, degree of saturation and Poisson ratio etc. However, CI models employ statistical theory, which is a data-driven technique and can be built through the knowledge of a quality database. The data should comprise all characteristic phenomena. Therefore, physical understanding of the application problem is essential in CI model development. In this study, we adopt a neural-genetic model for the wave-induced seabed maximum liquefaction depth, based on a pre-built poro-elastic model [24].
2 A neural-genetic technique for wave-induced liquefaction Besides the use of standard ANNs for wave-induced liquefaction depth prediction, we will discuss a neural-genetic approach for this problem. Generally, there are two areas where GAs have been used in ANN modelling, these are: optimising the weights of network connections (training) [32] and optimising neural network structure design [25, 26]. In this study we adopted the option of optimising network weights using a single hidden layer feedforward ANNs with a fixed network structure. A general multi layer neural network is presented in (Fig. 3). As shown in Fig. 3, a multi-layer network is an expansion of the single layer network, and it can be used to solve more difficult and complicated problems. It consists of an input layer, one or more hidden layers of neurons and an
Fig. 3. A typical multi-layer feedforward network architecture
342
D. Cha et al.
output layer of neurons. In the present study, GAs are utilised to optimise the weights of the network and adjust the interconnections to minimise its output error. It can be applied to the network, which has at least one hidden layer, and fully connected to all units in each layer. The goal of this procedure is to obtain a desired output when certain inputs are given. The general network error E is shown in (1), where Dx and Ox are the desired and actual output values, respectively. n 1 E(x) = (Dx − Ox )2 (1) n x=1 Since the error is the difference between the actual output and the target output, the error depends on the weights, so we employed this error function in the GA fitness function for optimising the weights instead of a standard, iterative, gradient descent-based training method (2), f (x) = Emax − E (x)
(2)
where, f (x) is the corresponding fitness and Emax is the maximum performance error. For the GA selection function, we used the Roulette wheel method developed by Holland [10]. It is defined as Fi Pi = S i=1
Fi
(3)
where, Pi is the probability for each individual chromosome, S is the population size, and Fi is the fitness value of each chromosome. In the use of GAs, there are basic operators employed and modified in the search mechanism, which are Crossover and Mutation. For example, the basic two point Crossover operator takes two individuals from a chromosome and produces two new individuals, whilst mutation alters one individual to produce a single new solution. Further discussion and details pertaining to Crossover and Mutation settings in this research are presented in the results section. 2.1 Data preparation To ensure the accurate prediction of an ANN model trained using GAs, we needed to build a reliable database or training/test sets. In this section, we describe the establishment of the database by an existing poro-elastic model developed by the author [27]. Poro-elastic model In this model, we consider an ocean wave propagating over a porous seabed of infinite thickness. A two-dimensional wave-seabed interaction problem is considered, treating the porous seabed as hydraulically isotropic with a uniform
A Neural-Genetic Technique for Coastal Engineering
343
permeability. Biot [28] presented a general set of equations governing the behaviour of a linear elastic porous solid under dynamic conditions. They are summarized in the tensor form as below u i + ρf w ¨i σij,j = ρ¨ ρf ρf g w ¨i + w˙ i n kz n p˙ ε˙ii + w˙ ii = − Kf
−p,i = ρf u ¨+
(4) (5) (6)
where p is pore pressure, n is porosity, ρ is the combined density, ρf is the fluid density, u and w are the displacements of solid and relative displacements of solid and pore fluid, σij is total stress. K1f is the compressibility of pore-fluid, which is defined by 1 1 1−S = + (7) 9 Kf 2 × 10 Pwo where S is the degree of saturation, Pwo is the absolute water pressure. , which are assumed to control the deThe definition of effective stresses, σij formation of the soil skeleton, given by the total stress (σij ) and pore pressure (p) as, σij = σij − δij p, (8) where, δij is the Delta denotation. Therefore, the equation of force balance, equation (4) becomes 9 : n Kf − (εii + wii )i . p˙ = (ε˙ii + w˙ ii )i ⇒ −p,i = (9) Kf i n Then substituting (9) into (4) and (6), the governing equation can be rewritten as ρf ρf g Kf (εii + wii )i = ρf u ¨i + ¨i + w w˙ i . (10) n n kz If the acceleration terms are neglected in the above equation, it becomes the consolidation equation, which has been used in previous work [19]. Based on the wave-induced soil and fluid displacements, we can obtain the wave-induced pore pressure, effective stresses and shear stresses. Detailed information of the above solution can be found in [27]. Estimation of liquefaction It has generally been accepted that when the vertical effective stress vanishes, the soil will be liquefied. Thus, the soil matrix loses its strength to carry and load and consequently causes seabed instability. Based on the concept of excess pore pressure, Zen and Yamazaki [20] proposed a criterion of
344
D. Cha et al.
liquefaction, which has been further extended by considering the effects of lateral loading [19] 1 − (1 + K0 )(γs − γw )z + (Pb − p) ≤ 0. 3
(11)
where K0 is the coefficient of earth pressure at rest, which is normally varied from 0.4 to 1.0, and 0.5 is commonly used for marine sediments [30], γs is the unit weight of soil, γw is the unit weight of water, p is pore pressure, z is maximum liquefaction depth and Pb is the wave pressure at the seabed surface, which is given by p=
γw H cos(kx − ωt) cosh kd
(12)
where “cos(kx − ωt)” denotes the spatial and temporal variations in wave pressure within the two-dimensional progressive wave described above.
3 Results and discussion 3.1 Neural-genetic model configuration for wave-induced liquefaction In general, wave-induced seabed liquefaction is calculated by solving complicated mathematical equations, such as poro-elastic models. However, an ANN-based model does not need to solve complicated nonlinear equations; rather it requires high quality input data and accurate values for output data. As pointed out in previous work [19], the most important factors, which significantly affect the wave-induced soil liquefaction, include the degree of saturation, seabed thickness, soil permeability, wave period, water depth and wave height. Since Cha [27] found that the soil permeability has been very sensitive to the occurrence of the wave-induced liquefaction potential, we use a fixed value of permeability in addition to the other parameters as inputs to the ANN model as shown in Table 2. The wave-induced maximum liquefaction depth is the output value. Fig. 4 illustrates the structure of the ANN model for this study. The important factors, such as degree of saturation, seabed thickness, wave period, wave height and water depth, of this ANN model are displayed in Table 1. The database generated by the Poro-elastic model as described in Section 2.1 is used to establish the ANN model. The range of the variables is shown in Table 2. As seen in the table, we built the database, based on the most possible ranges of wave and soil conditions. Amongst this input data, we chose the most effective component of wave-induced liquefaction, which included soil permeability, seabed thickness, degree of saturation, wave period, wave height and water depth for the neural-genetic model. In the current study, we had approximately 20,000 outputs of maximum liquefaction depth from each simulation. Among these data we used 80% for
A Neural-Genetic Technique for Coastal Engineering
345
Fig. 4. Structure of ANN model for wave-induced liquefaction Table 1. Key components of the neural network model Training model Number of inputs Number of output neurons Number of hidden neurons learning rate momentum factor
5 1 1 to 5 0.5 0.2
Table 2. Input data for the poro-elastic model Wave characteristics wave period (T ) 8 sec to 12.5 sec wave height (H) 7.5 m to 10.5 m water depth (d) 50 m to 100 m Soil characteristics soil permeability (Kz ) seabed thickness (h) degree of saturation (S)
10−4 , 5 × 10−4 m/sec 10 to 80 m 0.95 to 0.99
the training procedure, and the remaining data for validating the prediction capability using the best run of each case. In this paper, we not only use a correlation value(R2 ) for comparison between the database and the neural-genetic prediction but we also use the
346
D. Cha et al.
(RMSE) value, which is more meaningful in engineering application. The RMSE is defined as N 1 (LAi − LP i )2 (13) RM SE = N i=1 where LAi and LP i are the liquefaction depth from the ANN model and the Poro-elastic model respectively; N is the total number of liquefaction depth data. 3.2 ANN model training using GAs for wave-induced liquefaction Generally, it is time-consuming to configure and adjust the settings of an ANN model during the supervised training procedure. Even though its results are acceptable from the engineering viewpoint, an ANN model trained using GAs can reduce the complexity of the procedure; hence, it is an advantage to use them in conjunction with ANNs. As discussed in the previous sections, ANN models predict data, based on the relationship between input and output values. This relationship is based on the iterative modification of network weights, during the training procedure. Clearly, the training procedure is a key factor of an ANN model. However, the authors found that the training procedure can be time consuming in terms of selecting the most appropriate settings for ANN training; therefore the authors propose to use GAs to optimise the ANN model’s weights, which is described in the previous section. The use of GAs is advantageous in this research due to the problems inherent in regular gradient-descent based learning techniques (as discussed in Section 1.3). Hence to initiate the training procedure, we save the first weight configuration of the existing ANN model (small random values) for the initial GA population. Fig. 5 illustrates the concept of a chromosome, which we adopted (encoded) from the ANN model. As seen in Fig. 5, the size of the chromosome depends on the number of weights in the ANN model. After we store the initial population from the ANN model, GA operations are performed, such as selection, crossover and mutation for optimising the weights. In this study, we adopted 3 crossover (simple, arithmetic and heuristic crossover) and mutation (uniform, non-uniform and boundary mutation) functions, which were developed by Michalewicz [31]. For example, Arithmetic Crossover produces two complimentary linear combinations of the
Fig. 5. The concept of the chromosome used in the approach
A Neural-Genetic Technique for Coastal Engineering
347
parents, and Heuristic Crossover produces a linear extrapolation of the two individuals, which is the only operator that utilises fitness information. In regards to mutation, the first type (Uniform mutation) may be explained as follows. Let ai and bi be the lower and upper bound, respectively, for each variable i then it may be further described as below
U (ai , bi ), when i = j xi = (14) when i = j xi , where, U (ai , bi ) is uniform random number, j is a randomly selected variable Similarly Non-uniform mutation is described as
xi + (bi − xi )f (G), if ri ≤ 0.5, xi = (15) xi − (xi + ai )f (G), if ri ≥ 0.5 %s $ G ) r1 and r2 is a uniform random number between where, f (G) = r2 (1− Gmax (0, 1), G is the current, Gmax is the maximum number of generations, s is a shape parameter. During GA experimentation, we altered the settings from one point crossover up to ten point crossover, depending on the chromosome size. Mutation was varied from 2% up to 30%. 3.3 Results for determining wave-induced liquefaction Numerous experiments were conducted to determine the maximum liquefaction depth for two conditions of soil permeability. In this sub-section, the top results are presented in Figs. 6 and 7 using the neural-genetic approach with accompanying discussion. Fig. 6 represents the predicted maximum liquefaction depth obtained from an ANN model trained by GAs versus the database maximum liquefaction depth (soil permeability, 10−4 m/s). As seen in the figures, overall, the prediction of maximum liquefaction depth agrees with the numerical calculation data. It is shown that the correlation of the ANN model and the poro-elastic model is over 96%. The figure illustrates that the RMSE value is between the 10 to 30% range, which is acceptable for an engineering applications. Also, the figures indicated that the correlation of the neural-genetic model with the poro-elastic model and RMSE values could improve depending on the GA settings used (e.g. Fig. 6(a) 6000 total generations, Fig. 6(b) 8000 total generations). Fig. 7 illustrates the predicted maximum liquefaction depth obtained using the neural-genetic model versus the poro-elastic numerical maximum liquefaction depth (soil permeability, 5×10−4 m/s). As shown in the figures, prediction of maximum liquefaction depth using the neural-genetic model agrees well the with the poro-elastic model, in that the correlation values are greater than 96% and the RMSE values are less than 25% in both cases. Fig. 7 results are
348
D. Cha et al.
(a)
(b) Fig. 6. Comparisons of the wave-induced liquefaction depth by the approach versus the poro-elastic model (Soil Permeability: 10−4 m/sec). (a) 6000 Generations. (b) 8000 Generations
slightly better than those shown in Fig. 6 because the GA operation settings were based on those in Fig. 6(b). It is clearly shown that better results may be producing varying the GA settings, with specific attention to increasing the number of generations and varying crossover and mutation parameters.
A Neural-Genetic Technique for Coastal Engineering
349
(a)
(b) Fig. 7. Comparisons of the wave-induced liquefaction depth by the approach versus the poro-elastic model (Soil Permeability: 5 × 10−4 m/sec). (a) 6000 Generations. (b) 8000 Generations
These results indicated that the performance of neural-genetic model for the prediction of maximum wave-induced seabed liquefaction compares favourably with the previous authors’ results [29]. In this study, 3 crossover and mutation functions were adopted, which were used in the neural-genetic model.
350
D. Cha et al.
4 Conclusions In this study, we adopted the concept of GA-based training of ANN models in an effort to overcome the problems inherent in some ANN training procedures (i.e. gradient-based techniques) whilst providing accurate results for determining maximum liquefaction depth in a real-world application. Unlike the conventional engineering mechanics approaches, the neuralgenetic techniques are based on statistical theory, which is a data-driven technique and can be built with the knowledge of a quality database, and can save time in configuring and adjusting the settings of an ANN model during the supervised training process. In the proposed neural-genetic model, based on a physical understanding of wave-induced seabed liquefaction, several important parameters, including wave period, water depth, wave height, seabed thickness and the degree of saturation, were used as the input parameters with constant soil permeability, whilst the maximum liquefaction depth was the output parameter. Experimental results demonstrate that the neural-genetic model is a successful technique in predicting the wave-induced maximum liquefaction depth.
References 1. Hagan MT, Demuth HB, Beale M (1996) Neural network design. PWS, Boston, MA 2. McCulloch WS, Pitts W (1943) A logical calculus of the ideas imminent in nervous activity. Bull Math Biophys 5:115–133 3. Maier HR, Dandy HR (1997) Modeling cyanobacteria (blue-green algae) in the River Murray using artificial neural networks. Math Comput Simulation 43: 377–386 4. Dibike YB, Minns AW, Abbott MB (1999) Applications of artificial neural networks to the generation of wave equations from hydraulic data. J Hydraulic Res 37(1):81–97 5. Hurtado JE, Londono JE, Meza JE (2001) On the applicability of neural networks for soil dynamic amplification analysis. Soil Dyn Earthquake Eng 21(7):579–591 6. Lee TL, Jeng DS (2002) Application of artificial neural networks in tide forecasting. Ocean Eng 29(9):1003–1022 7. Mohamed AS, Holger RM, Mark BJ (2002) Predicting settlement of shallow foundations using neural networks. J Geotech Geo Envir Eng 128(9):785–793 8. Jeng DS, Lee TL, Lin C (2003) Application of artificial neural networks in assessment of Chi–Chi earthquake-induced liquefaction. Asian J Inf Technol 2(3):190–198 9. Leo SS, Lo HS (2004) Neural network based regression model of ground surface settlement induced by deep excavation automation in construction 13:279–289 10. Holland J (1975) Adaptation in natural and artificial systems. University of Michigan Press. (Second edition: MIT, Cambridge, MA, 1999) 11. Yao X (1993) Evolving artificial neural networks. Int J Neural Syst 4(3):203–222
A Neural-Genetic Technique for Coastal Engineering
351
12. Gruau F, Whitley D, Pyeatt L (1996) A comparison between cellular encoding and direct encoding for genetic neural networks. In: Genetic programming 1996: proceedings of the first annual conference. MIT, Cambridge, MA, pp 81–89 13. Stanley KO, Miikkulainen R (2002) Evolving neural networks through augmenting topologies. Evol Comput 10(2):99–127 14. Zen K, Umehara Y, Finn WDL (1985) A case study of the wave-induced liquefaction of sand layers under damaged breakwater. In: Proceedings of the third Canadian conference on marine geotechnical engineering, pp 505–520 15. Silvester R, Hsu JRC (1989) Sines Revisited. J Waterways, Port, Coastal Ocean Eng, ASCE 115(3):327–344 16. Bjerrum J (1973) Geotechnical problem involved in foundations of structures in the North Sea. Geotechnique 23(3):319–358 17. Nataraja MS, Singh H, Maloney D (1980) Ocean wave-induced liquefaction analysis: a simplified procedure. In: Proceedings of an international symposium on soils under cyclic and transient loadings, pp 509–516 18. Rahman MS (1997) Instability and movement of ocean floor sediments – a review. Int J Offshore Polar Eng 7(3):220–225 19. Jeng DS (1997) Wave-induced seabed instability in front of a breakwater. Ocean Eng 24(10):887–917 20. Zen K, Yamazaki H (1991) Field observation and analysis of wave-induced liquefaction in seabed. Soils Found 31(4):161–179 21. Sassa S, Sekiguchi H (2001) Analysis of wave-induced liquefaction of sand beds. Geotechnique 51(2):115–126 22. Sassa S, Sekiguchi H, Miyamamoto J (2001) Analysis of progressive liquefaction as moving-boundary problem. Geotechnique 51(10):847–857 23. Jeng DS (2003) Wave-induced seafloor dynamics. Appl Mech Rev 56(4):407–429 24. Jeng DS, Cha DH (2003) Effects of dynamic soil behaviour and wave nonlinearity on the wave-induced pore pressure and effective stresses in porous seabed. Ocean Eng 30(16):2065–2089 25. Montana DJ, Davis L (1989) Training Feedforward neural networks using genetic algorithms. In: Proceedings of the international joint conference on artificial intelligence, pp 762–767 26. Miller GF, Todd PM, Hegde PM (1989) Designing neural networks using genetic algorithms. In: Proceedings of the third international conference on genetic algorithms. Morgan Kaufmann, San Francisco 27. Cha DH (2003) Mechanism of ocean waves propagating over a porous seabed. MPhil Thesis, Griffith University, Australia 28. Biot MA (1956) Theory of propagation of elastic waves in a fluid-saturated porous solid. Part I: low frequency range; part II. High frequency analysis. J Acoustics Soc 28:168–191 29. Jeng DS, Cha DH, Michael B (2004) Neural network model for the prediction of wave-induced liquefaction potential. Ocean Eng 31(17–18):2073–2086 30. Scott RF (1968) Principle of soil mechanics. Addison, MA 31. Michalewicz Z (1994) Genetic algorithms + data structures = evolution programs, AI Series, Springer, Berlin Heidelberg New York 32. Rooij van AJF, Jain LC, Johnson RP (1996) Neural network training using genetic algorithms. World Scientific, London
On the Design of Large-scale Cellular Mobile Networks Using Multi-population Memetic Algorithms Alejandro Quintero and Samuel Pierre
Summary. This chapter proposes a proposes a multi-population memetic algorithm (MA) with migration and elitism to solve the problem of assigning cells to switches as a design step of large-scale mobile networks. Well-known in the literature as an NP-hard combinatorial optimization problem, this problem requires the recourse to heuristic methods which can practically lead to good feasible solutions, not necessarily optimal, the objective being rather to reduce the convergence time toward these solutions. Computational results obtained from extensive tests confirm the efficiency and the effectiveness of MA to provide good solutions in comparison with other heuristic methods well-known in the literature, specially for large-scale cellular mobile networks with a number of cells varying between 100 and 1000, and a number of switches varying between 5 and 10, that means the search space size is between 5100 and 101000 .
1 Introduction A Personal Communication Network is a wireless communication network which integrates various services such as voice, video, electronic mail, accessible from a single mobile terminal and for which the subscriber obtains a single invoicing. These various services are offered in an area called cover zone which is divided into cells. In each cell is installed a base station which manages all the communications within the cell. In the cover zone, cells are connected to special units called switches which are located in mobile switching centers (MSC). When a user in communication goes from a cell to another, the base station of the new cell has the responsibility to relay this communication by allotting a new radio channel to the user. Supporting the transfer of the communication from a base station to another is called handoff. This mechanism, which primarily involves the switches, occurs when the level of signal received by the user reaches a certain threshold. We distinguish two types of handoffs. In the case of Figure 1 for example, when a user moves from cell A to cell B, it refers to soft handoff because these two cells are connected to the same switch. The MSC which supervises the two cells remains the same A. Quintero and S. Pierre: On the Design of Large-scale Cellular Mobile Networks Using Multipopulation Memetic Algorithms, Studies in Computational Intelligence (SCI) 82, 353–377 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
354
A. Quintero and S. Pierre switch 1
Switch 1
cell
Simple handoff
Cell A
Cell B
simple handoff Assignment of Cells to Switches
B
A
switch 1 C
switch 2
D
Complex handoff
Cell A
Switch 2 cell
Cell C
complex handoff
Fig. 1. Geographic division in a cellular network
and the induced cost is weak. On the other hand, when the user moves from cell A to cell C, there is a complex handoff. The induced cost is high because both switches 1 and 2 remain active during the procedure of handoff and the database containing information on subscribers must be updated. The total operating cost of a cellular network includes two components: the cost of the links between the cells (base station) and the switches to which they are joined, and the cost generated by the handoffs between cells. It appears therefore intuitively more discriminating to join cells A and C to the same switch if the frequency of the handoffs between them is high. The problem of assigning cells to switches essentially consists of finding the configuration that minimizes the total operating cost of the network. The resolution of this problem by an exhaustive search method would entail a combinatorial explosion, and therefore an exponential growth of execution times. Assigning cells to switches in cellular mobile networks being an NP-hard problem, enumerative search methods are practically inappropriate to solve large-sized instances of this problem [1, 40]. Because they exhaustively examine the entire search space in order to find the optimal solution, they are only efficient for small search spaces corresponding to small-sized instances of the problem. For example, for a network with m switches and n cells, mn solutions should be examined. Merchant and Sengupta [27] have proposed the first heuristic to solve this problem. Their algorithm starts from an initial solution, which they attempt to improve through a series of greedy moves, while avoiding to be blocked in a local minimum. The moves used to escape a local minimum explore a very limited set of options. These moves depend on the initial solution and do not necessarily lead to a good final solution. Others heuristic approaches have been developed for this kind of problem [1, 2, 7, 23, 39, 49].
On the Design of Large-scale Cellular Mobile Networks
355
In the most general terms, evolution can be described as a two-step iterative process, consisting of random variation followed by selection [11, 12]. In the real world, an evolutionary approach to solving engineering problems offers considerable advantages. One such advantage is adaptability to changing situations [11, 12]. Evolutionary algorithms have been applied successfully in various domains of search, optimization, and artificial intelligence [6, 19, 21, 31, 38, 47, 50, 53, 58]. Genetic Algorithms (GA) are robust search techniques based on Darwin’s concepts of natural selection and genetic mechanisms [18, 22, 33]. They consists of creating a population of candidate solutions and applying probabilistic rules to simulate the evolution of the population [18]. They are used to solve extremely complex search and optimization problems which are difficult to handle using analytic or simple enumeration methods, by combining the space exploration of solutions with an adequate selection of the best results. Memetic Algorithm (MA) are inspired by Dawkins’ notion of a meme defined as a unit of information that reproduces itself while people exchange ideas. In contrast to genes, memes are typically adapted by the people who transmit them before they are passed on to the next generation [35]. Informally, the idea exploited is that if a local optimiser is added to a genetic algorithm, and applied to every child before it is inserted into the population, then a memetic algorithm can be thought of simply as a special kind of genetic search over the subspace of local optima. Recombination and mutation will usually produce solutions that are outside this space of local optima but a local optimiser can then “repair” such solutions to produce final children that lie within this subspace, yielding a memetic algorithm [43]. Memetic algorithms have been applied with success to several other combinatorial optimization problems [29–32]. In previous papers, we have studied the problem of assigning cells to switches in moderate-sized mobile networks with a number of cells varying between 100 and 200, and a number of switches varying between 5 and 7, that means the search space size is between 5100 and 7200 . In [42], a short letter, we have presented preliminary results obtained from memetic algorithm without the concepts of multi-population, elitism and migration. In [41], we have presented a comparative study between canonical genetic algorithm, tabu search and simulated annealing. In [41], we have presented a hybrid genetic algorithm to solve the problem of assigning cells to switches in moderate-sized mobile networks. In those papers, we have not studied the large-scale cellular mobile networks with a number of cells varying between 500 and 1000, and a number of switches varying between 8 and 10, because the average improvement rates are very small in comparison with well-known heuristic methods in the literature. This chapter proposes a multi-population memetic algorithm with migrations (MA) to efficiently solve the problem of assigning cells to switches in cellular mobile networks. Section 2 presents background and related work. Section 3 first describes the genetic and memetic algorithms, then presents
356
A. Quintero and S. Pierre
a multi-population approach. Section 4 presents some adaptation and implementations details of memetic and local search strategy. Finally, Section 5 presents an analysis of results and compares them to other methods well studied in the literature.
2 Background and related work This assignment problem consists of determining a cell assignment pattern, which minimizes a certain cost function, while respecting certain constraints, especially those related to limited switch’s capacity. An assignment of cells can be carried out according to a single or a double cell’s homing. A single homing of cells corresponds to the situation where a cell can only be assigned to a single switch. When a cell is related to two switches, that refers to a double homing. In this chapter, only single homing is considered. Let n be the number of cells to be assigned to m switches. The location of cells and switches is fixed and known. Let Hij be the cost per unit of time for a simple handoff between cells i and j involving only one switch, and Hij the cost per time unit for a complex handoff between cells i and j (i, j = 1, . . . , n are proportional to with i = j) involving two different switches. Hij and Hij the handoff frequency between cells i and j. Let cik be the amortization cost of the link between cell i and switch k (i = 1, . . . , n; k = 1, . . . , m). Let xik be a binary variable, equal to 1 if cell i is related to switch k, otherwise xik is equal 0. The assignment of cells to switches is subject to a number of constraints. Actually, each cell must be assigned to only one switch, which can be expressed by the follows: n xik = 1 for i = 1, . . . , n. (1) k=1
Let zijk and yij be defined as: zijk = xik xjk for i, j = 1, . . . , n and k = 1, . . . , m with i = j. yij =
m
zijk
for i, j = 1, . . . , n and i = j.
k=1
zijk is equal to 1 if cells i and j, with i = j, are both connected to the same switch k, otherwise zijk is equal to 0. Thus yij takes the value 1 if cells i and j are both connected to the same switch and the value 0 if cells i and j are connected to different switches. The cost per time unit f of the assignment is expressed as follows: f=
n n i=1 k=1
cik xik +
n n i=1 j=1 j=i
Hij (1 − yij ) +
n n i=1 j=1 j=i
Hij yij
(2)
On the Design of Large-scale Cellular Mobile Networks
357
The first term of the equation represents the link or cabling cost. The second term takes into account the complex handoffs cost and the third, the cost of simple handoffs. We should keep in mind that the cost function is quadratic in xik , because yij is a quadratic function of xik . Let’s mention that an eventual weighting could be taken into account directly in the link and handoff costs definitions. The capacity of a switch k is denoted Mk . If λi denotes the number of calls per unit of time directed to i, the limited capacity of switches imposes the following constraint: n
λi xik ≤ Mk
for k = 1, . . . , m
(3)
i=1
according to which the total load of all cells which are assigned to the switch k is less than the capacity Mk of the switch. Finally, the constraints of the problem are completed by: xik = 0 or 1 for i = 1, . . . , n and k = 1, . . . , m
(4)
zijk = xij xik and i, j = 1, . . . , n and k = 1, . . . , m
(5)
yij =
m
zijk
for i, j = 1, . . . , n and i = j
(6)
k=1
(1), (3) and (4) are constraints of transport problems. In fact, each cell i could be likened to a factory which produces a call volume λi . The switches are then considered as warehouses of capacity Mk where the cells production could be stored. Therefore, the problem is to minimize (2) under (1), and (3) to (6). When the problem is formulated in this way, it could not be solved with a polynomial time algorithm such as linear programming because constraint (5) is not linear. Merchant and Sengupta [27, 28] replaced it by the following equivalent set of constraints: zijk ≤ xik zijk ≤ xjk zijk ≥ xik + xjk − 1 zijk ≥ 0
(7) (8) (9) (10)
Thus, the problem could be reformulated as follows: minimizing (2) under constraints (1), (3), (4) and (6) to (10). We can further simplify the problem by defining: hij = Hij − Hij
358
A. Quintero and S. Pierre
hij refers to the reduced cost per time unit of a complex handoff between cells i and j. Relation (2) is then re-written as follows: f=
n m
cik xik +
i=1 k=1
n n
hij (1 − yij ) +
i=1 j=1 j=i
n n
Hij
i=1 j=1 j=i
+
,-
.
constant
The assignment problem takes then the following form: Minimize: n n m n f= cik xik + hij (1 − yij ) i=1 k=1
(11)
i=1 j=1 j=i
subject to (1), (3), (4) and (7) to (10). In this form, the assignment problem could be solved by usual programming methods such as integer programming. The total cost includes two types of cost, namely cost of handoff between two adjacent cells, and cost of cabling between cells and switches. The design is to be optimized subject to the constraint that the call volume of each switch must not exceed its call handling capacity. This kind of problem is NP-hard, so enumerative searches are practically inappropriate for moderateand large-sized cellular mobile networks [27]. Merchant and Sengupta [27, 28] studied this assignment problem. Their algorithm starts from an initial solution, which they attempt to improve through a series of greedy moves, while avoiding to be stranded in a local minimum. The moves used to escape a local minimum explore only a very limited set of options. These moves depend on the initial solution and do not necessarily lead to a good final solution. In [46], an engineering cost model has been proposed to estimate the cost of providing personal communications services in a new residential development. The cost model estimated the costs of building and operating a new PCS using existing infrastructure such as the telephone, cable television and cellular networks. In [14], economic aspects of configuring cellular networks are presented. Major components of costs and revenues as well as the major stakeholders were identified and a model was developed to determine the system configuration (e.g., cell size, number of channels, link cost, etc.). For example, in a large cellular network, it is impossible for a cell located in east America to be assigned to a switch located in west America. In this case, the variable link cost is ∞. The geographical relationships between cells and switches are considered in the value of the cost of linking, so that the base station of a cell is generally assigned to a neighbouring switch and not to far switches [60]. In [15], different methods have been proposed to estimate the handoff rate in PCS and the economic impacts of mobility on system configuration decisions (e.g., annual maintenance and operations, channel cost, etc.). The cost model used in this chapter is based on [14, 15, 46].
On the Design of Large-scale Cellular Mobile Networks
359
3 Memetic Approach In the field of combinatorial optimization, it has been shown that combining evolutionary algorithms with problem-specific heuristics can lead to highly effective approaches [10, 48]. These hybrid evolutionary algorithms combine the advantages of efficient heuristics incorporating domain knowledge and population-based search approaches [32, 34]. For further details on evolutionary algorithms, see [55, 56]. 3.1 Basic Principles of Canonical Genetic Algorithms Genetic algorithms (GA) are robust search techniques based on natural selection and genetic production mechanisms. GAs perform a search by evolving a population of candidate solutions through non-deterministic operators and by incrementally improving the individual solutions forming the population using mechanisms inspired from natural genetics and heredity (e.g., selection, crossover and mutation). In many cases, especially with problems characterized by many local optima (graph coloring, travelling salesman, network design problems, etc.), traditional optimization techniques fail to find high quality solutions. GAs can be considered as an efficient and interesting option [22, 52]. GAs [18] are composed of a first step (initialization of the population) and a loop. In each loop step, called generation, the population is altered through selection and variation operators, then the resulting individuals are evaluated. It is hoped that the last generation will contain a good solution, but this solution is not necessarily the optimum [9]. Crossover is a process by which two chosen string genes are interchanged. To execute the crossover, strings of the mating pool are coupled at random. The crossover of a string pair of length l is performed as follows: a position i is chosen uniformly between 1 and (l − 1), then two new strings are created by exchanging all values between positions (i + 1) and l of each string of the pair considered. Mutation is the process by which a randomly chosen bit in a chromosome is flipped. It is employed to introduce new information into the population and also to prevent the population from becoming saturated with similar chromosomes (premature convergence). Large mutation rates increase the probability that good schemata be destroyed, but increase population diversity. A schema is a subset of chromosomes which are identical in certain fixed positions [18, 22]. The next generation of chromosomes is generated from present population by selection and reproduction. The selection process is based on the fitness of the present population, such that the fitter chromosome contribute more to the reproductive pool; typically this is also done probabilistically. For a selection, inheritance is required; the offspring must retain at least some of the features that made their parents fitter than average [18].
360
A. Quintero and S. Pierre
The steady state genetic algorithm is different to the canonical genetic algorithm in that there is typically one single new member inserted into the new population at any time. A replacement strategy defines which member of the population will be replaced by the new offspring [59]. Sayoud et al. [51] present an application of steady state genetic algorithms to minimize the total installation cost of a communication network by optimally designing the topology layout and assigning the corresponding capacities. Natural selection is no independent force of nature, it is the result of the competition of natural organisms for resources. In contrast, in the science of breeding the above problem does not exist. The selection is done by human breeders. Their strategies are based on the assumption that mating two individuals with high fitness more likely produces an offspring of high fitness than two randomly mating individuals [36]. 3.2 Basic Principles of Memetic Algorithms In problems characterized by many local optima, canonical genetic algorithm (CGA) can suffer from excessively slow convergence before finding an accurate solution because of their characteristics of using a priori minimal knowledge and failure to exploit local information [4, 13, 38, 44, 53]. This may prevent them from being really of practical interest for a lot of large-scale constrained applications. Memetic algorithms (MA) are populationbased heuristic search approaches for combinatorial optimization problems based on cultural evolution [35, 47]. The memetic approach takes the concept of evolution as employed in genetic algorithms and combines this with an element of local search. It can be seen that a genetic algorithm where one variation operator is a local search operator, either providing the local minimum closest to the starting point or a point on the path leading to this closest local optimum. This local search in the neighbourhood is applied to each newly created offspring before its insertion into the new population. In the context of evolutionary computation, a hybrid evolutionary algorithm is called memetic if the individuals representing solutions to a given problem are improved by a local search or another improvement technique [29]. Kado et al. [24] compare different implementations of hybrid genetic algorithms. In this chapter, we propose a memetic algorithm with local refinement strategy to combine the strengths of both by providing global and local exploitation aspects to the problem of assigning cells to switches in cellular mobile networks. The local refinement strategy used with our memetic algorithm is tabu search. A tabu search method is an adaptive technique used in combinatorial optimization to solve difficult problems [17, 39, 40]. Tabu search can indeed be applied to different problems and different instances of problems, but mainly the local search neighborhood and the way the tabu list is built and exploited
On the Design of Large-scale Cellular Mobile Networks
361
are subject to many variations, which gives to Tabu its meta-heuristic nature. The tabu list is not always a list of solutions, but can be a list of forbidden moves/perturbations [16, 45]. Tabu search is a hill-climber endowed with a tabu list (list of solutions or moves) [54]. Let Xi denote the current point; let N (Xi ) denote all admissible neighbors of Xi , where Y is an admissible neighbor of Xi if Y is obtained from Xi through a single move not in the tabu list, and Y does not belong to the tabu list is updated as Xi is replaced with the best point in N (Xi ); stop after nbmax steps or if N (Xi ) is empty. Other mechanisms of tabu search are intensification and diversification: by the intensification mechanism, the algorithm does a more comprehensive exploration of attractive regions which may lead to a local optimal point; by the diversification mechanism, on the other hand, the search is moved to previously unvisited regions, something that is important in order to avoid getting trapped into local minimum points [16]. 3.3 Multi-population Approach Canonical genetic algorithms are powerful and perform well on a broad class of problems. However, part of the biological and cultural analogies used to motivate a genetic algorithm search are inherently parallel. One approach is the partitioning of the population into several subpopulations (multi-population approach) [57]. The evolution of each subpopulation is handled independently from each other and help maintain genetic diversity. Diversity is the term used to describe the relative uniqueness of each individual in the population. From time to time, there is however some interchange of genetic material between different subpopulations. This exchange of individuals is called migration [37]. Sometimes a topology is introduced on the population, so that individuals can only interact with nearby chromosomes in their neighborhood [20]. The parallel implementation of the migration model shows not only a speedup in computation time, but it also needs less objective function evaluations when compared to a single population algorithm for some classes of problems [55]. Cohoon et al. [5] present results in which parallel algorithms with migration found better solutions than a sequential GA for optimization problems, and Lienig [26] indicates that parallel genetic algorithms in isolated evolving subpopulations with migrations may offer advantages over sequential approaches. The migration algorithm is controlled by many parameters that affect its efficiency and accuracy. Among other things, one must decide the number and the size of the populations, the rate of the migration, the migration interval and the destination of the migrants. The migration interval is the number of generations between each migration, and the migration rate is the number of individuals selected for migration. An important property of the architecture used between the demes is its degree, which is the number of neighbors
362
A. Quintero and S. Pierre
of each subpopulation or deme (a separately evolving subset of the whole population) [26]. In this chapter, all the demes have the same degree denoted as δ. The degree completely determines the cost of communications, it also influences the size of demes and consequently the time of computations. The execution time of the parallel algorithm is the sum of communications and computation times : Tp = gnd Tf + δTc , whereas the execution time of sequential algorithm only refers to the computation time : Tp = gnd Tf , where g is the domaindependant number of generations until convergence, nd is the population size, Tf is the time of one fitness evaluation, and Tc is the time required to communicate with one neighbour. For further details on the parameters associated with the migration algorithm, see [3, 57].
4 Implementation Details 4.1 Memetic Algorithm Implementation We have introduced a simple notation to represent cells and switches, and to encode chromosomes and genes. We opted for a non-binary representation of the chromosomes [19]. In this representation, the genes (squares) represent the cells, and the integers they contain represent the switch to which the cell of row i (gene of the ith position) is assigned. Our chromosomes have therefore a length equal to the number of cells in the network n, and the maximal value that a gene can take is equal to the maximal number of switches m. A chromosome represents the set of cells in the cellular mobile network, and the length is the number of cells. A particular value of the string is called a gene and the possible values are called alleles, taken from the alphabet V = 1, 2, . . . , m. A chromosome of the proposed MA consists of sequence of positive integers that represent the IDs of switches to which the cell of row i (gene of the ith position) is assigned. The first individual of the initial population is the one obtained when all cells are assigned to the nearest switch. This first chromosome is created therefore in a deterministic way. The creation of other chromosomes of the population is probabilistic and follows the strategy of population without doubles, that means, we test equality between individuals and remove doubles. This strategy permits to ensure the diversity of the population and a good cover of the search space. All chromosomes of the population verify the unique assignment constraint, but not necessarily the one of the switches’ capacity. Figure 2(a) shows the overall procedure of mutation operator. This operator is uniform random. For example, The gene of the 4th position is changed from switch 1 to switch 3. The crossover operator used in the proposed MA is the same as that of the conventional one-point crossover (Figure 2(b)).
On the Design of Large-scale Cellular Mobile Networks
A = 2 1 113 2 1 fl A’ = 2 1 1 33 2 1
363
A = 211↓1321 B = 132↓1212 fl A’ = 2 1 1 1 2 1 2 B’ = 1 3 2 1 3 2 1
(a) Mutation
(b) Crossover
Fig. 2. An example of mutation and crossover
The choice of the candidates is based on the evaluation function given by: f=
n m i=1 k=1
cik xik +
n n
hij (1 − yij )
(12)
i=1 j=1 j=i
In our adaptation, every chromosome is evaluated according to the criterion of cost in a first instance. The sort by ascending order of the objective value of those chromosomes permits to have the best potential chromosomes as the first individual of population. The second stage of evaluation consists of verifying the chromosomes in relation to the capacity constraint on the switches and to determine the best chromosome that verifies this constraint. To select the offsprings of the new generation, we have used the concept of elitism; according to this concept only the lowest ranked string is deleted and the best string is automatically kept in the population and for the others we used the roulette wheel method. Because the problem we have to solve is a minimization problem, we applied the roulette wheel in order to invert the objective values of chromosomes. Then, we recover both in the new selected population, chromosomes that verify the switches capacity’s constraint and those that violate it. The number of generations is fixed at the beginning of the execution. For the migration algorithm used in this chapter, subpopulations are arranged in fully-meshed topology. Here, individuals may migrate from any subpopulation to another. For each subpopulation, a pool of potential emigrants (clones) is constructed from the other subpopulations and they are not deleted from their original populations. The migration interval is incorporated into the parallel algorithm as a probability Pm , and the migration rate is incorporated as a maximum value Sm . For each subpopulation in the parallel algorithm, migration is achieved as follows. At the end of a generation, a uniformly distributed random number x is generated. If x < Pm then migration is initialized. During migration, a uniform random number determines the number of individual ns between 1 and Sm to send. The selection of the individuals for migration is a fitness-based process. The best ns individuals in the
364
A. Quintero and S. Pierre Subpopulation i
Subpopulation 2
…
…
The Sm best individuals
Subpopulation 1
Subpopulation k
…
The Sm best individuals
Selection the Pr best individuals
… Before exchange
… Replace the Pr worst individuals
after exchange
Fig. 3. Scheme for migration of individuals between subpopulations
subpopulation are sent to the other subpopulations. Whether or not emigrants are sent to the other subpopulations, each subpopulation then checks to see if emigrants are arriving from its neighbour. If so, a uniform random number pr determines the number of accepted individuals, then the best pr individuals are received into the subpopulation and replace the pr least fit individuals. Figure 3 gives a detailed description for the unrestricted migration scheme of k subpopulations with fitness-based selection. Subpopulations 2, 3, . . . , k construct a pool of their best individuals (fitness-based migration). Then the pr best individuals are chosen from this pool and replace the worst pr individuals in subpopulation 1. This cycle is performed for every subpopulation. Thus, it is ensured that no subpopulation will receive individuals from itself. Figure 4 shows the multi-population memetic algorithm with migration proposed. Migration intervals are typically specified as a fixed number of generations, known as an epoch. The problem with using a fixed epoch value is that migration is globally synchronized across all subpopulations. Using a random interval allows the subpopulations to evolve asynchronously [34]. 4.2 Local Search Strategy This section presents the implementations details of the local refinement strategy used to improve the individuals representing solutions provided by genetic algorithms : tabu search. To solve the assignment problem with tabu search, we have chosen a search domain free from capacity constraints on the switches, but respecting the constraints of unique assignment of cells to switches. We associate with each solution two values: the first one is the intrinsic cost of the solution, which is calculated from the objective function; the second is the evaluation of the solution, which takes into account the cost and the penalty for not respecting the capacity constraints. At each step, the solution that has the best evaluation is chosen. Once an initial solution built from the problem
On the Design of Large-scale Cellular Mobile Networks
365
Initialize population(gen) evaluation population(gen) for each individual i∈gen do i = tabu_search(i) end for while not terminated do repeat select two individuals i,j ∈ gen apply Crossover(i,j) giving children(c) c = tabu_search(c) add children(c) to newgen until crossover = false for each individual i ∈ gen do if probability-mutation then i = apply tabu_search(Mutation(i)) add (i) to newgen end if end for gen = Select_elitist(newgen) begin migration if migration appropriate Choose emigrants(population(gen)) Send clones of emigrants end if if immigrants available Im = Receive immigrants end if end migration gen = Select_ elitist((Im ∪ gen) end while Fig. 4. Multi-population memetic algorithm with migration and elitism
data, the short term memory component attempts to improve it, while avoiding cycles. The middle-term memory component seeks to intensify the search in specified neighbourhoods, while the long-term memory aims at diversifying the exploration area. The neighbourhood N (S) of a solution S is defined by all the solutions that are accessible from S by applying a move a → b to S. a → b is defined as reassignment of cell a to switch b. To evaluate the solutions in the neighbourhood
366
A. Quintero and S. Pierre
N (S), we define the gain GS (a, b) associated to the move a → b and to the solution S by: ⎧ n n ⎪ ⎨ (hai + hia )xib0 − (hai + hia )xib + cab − cab0 if b =
b0 i=1 GS (a, b) = ii=1 =a i=a ⎪ ⎩ M if not (13) where: • hij refers to the handoff cost between cells i and j; • b0 is the switch of cell a in solution S, that is, before the application of move a → b; • xik takes value 1 if cell i is assigned to switch k, 0 otherwise; • cik is the cost of linking cell i to switch k; • M is an arbitrary large number. The short-term memory moves iteratively from one solution to another, by applying moves, while prohibiting a return to the k latest visited solutions. It starts with an initial solution, obtained simply by assigning each cell to the closest switch, according to an Euclidean distance metric. The objective of this memory component is to improve the current solution, either by diminishing its cost or by diminishing the penalties. The middle-term memory component tries to intensify the search in promising regions. It is introduced after the end of the short-term memory component and allows a return to solutions we may have omitted. It mainly consists in defining the regions of intensified search, and then choosing the types of move to be applied. To diversify the search, we use a long-term memory structure in order to guide the search towards regions that have not been explored. This is often done by generating new initial solutions. In this case, a table n×m (where n is the number of cells and m the number of switches) counts, for each link (a, b), the number of times this link appears in the visited solutions. A new initial solution is generated by choosing, for each cell a, the least visited link (a, b). Solutions visited during the intensification phase are not taken into account because they result from different type of moves than those applied in short and long-term memory components. 4.3 Experimental Setting We submitted our approach to a series of tests in order to determine its efficiency and sensitivity to different parameters. In the first step, the experiences were executed by supposing that the cells are arranged on an hexagonal grid of almost equal length and width. The antennas are located at the center of cells and distributed evenly on the grid. The cost of cabling between a cell and a switch is proportional to the
On the Design of Large-scale Cellular Mobile Networks
367
distance separating both [40]. The call rate γi of a cell i follows a gamma law of average and variance equal to the unit. The call duration inside the cells are distributed according to an exponential law of parameter equal to 1 [8]. If a cell j has k neighbors, the [0,1] interval is divided into k + 1 subintervals by choosing k random numbers distributed evenly between 0 and 1. At the end of the service period in cell j, the call could be either transferred to the ith neighbour (i = 1, . . . , k) with a handoff probability rij equal to the length of ith interval, or ended with a probability equal to the length of the k + 1th interval. To find the call volumes and the rates of coherent handoff, the cells are considered as M/M/1 queues forming a Jackson network [25]. The incoming rates αi in cells are obtained by solving the following system: αi −
n
αj rij = γi
with i = 1, . . . , n
j=1
If the incoming rate αi is greater than the service rate, the distribution is rejected and chosen again. The handoff rate hij is defined by: hij = λi rij All the switches have the same capacity M calculated as follows: 1 K (1 + ) λi m 100 i=1 n
M=
where K is uniformly chosen between 10 and 50, which insures a global excess of 10 to 50% of the switches’ capacity compared to the cells’ volume of call. In the second step, we generate an initial population of size 100 chromosomes. In the third step, we estimate each chromosome by the objective function, what allows to deduct its value of capacity. Finally, in the last step, the cycle of generations of the populations begins then, each new population replacing the previous one. The number of 400 generations is defined at first. To determine the number of subpopulations in parallel, MA was executed over a set of 600 cases with 3 instances of problem in series of 20 runs for each assignment pattern, with a number of populations varying between 1 and 10. This experience shows that MA converges to good solutions with a number of populations varying between 7 and 10, as shown in Figure 5. To define the population size, MA was executed over a set of 600 cases with 3 instances of problem in series of 20 runs for each assignment pattern with 8 populations. This experience shows that MA converges to provide good solutions with a population size varying between 80 and 140. The values used by MA are: the number of generations is 400; the population size is 100; the number of populations is 8 for MA; the crossover probability is 0.9; the mutation probability is 0.08; the migration interval (Pm ) is 0.1; the migration rate (Sm ) is 0.4 and the emigrants accepted (Pr )
Evaluation Cost
368
A. Quintero and S. Pierre 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
5 switches, 100 cells 6 switches, 150 cells 7 switches, 200 cells
1
2
3
4
5
6
7
8
9
10
Number of populations in parallel
Fig. 5. MA convergence results
3100
20 migrants 40 migrants
2900
60 migrants 2700
80 migrants
Fitness value
0 migrants 2500
2300
2100
1900
1700
1500 1
11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 351 361 371 381 391 401
Number of generations
Fig. 6. Number of immigrants and MA convergence
is 0.2. A larger migration interval is normally used in connection with larger migration rate (Figure 6). Whereas the migration algorithm seeks to improve the normalized cost and reliability of the CGA, it is important also to ensure that unacceptable time overhead is not introduced by migration (Table 1). In order to analyze the performance of the multi-population memetic algorithm without migration (MPW) migration is turned off (pm = 0.0). Turning off migration for analysis of the multi-population memetic algorithm ensures that evaluation costs measurements indicate the effects of parallel algorithm, rather than the effects of migration. In order to compare the performance of MA and MPW,
On the Design of Large-scale Cellular Mobile Networks
369
Table 1. CPU Time for MA execution Element
CPU Time (%)
Generate an initial population
1
Evaluation of the chromosomes
49
Crossover, mutation, local-search, Selection procedure Immigration procedure
41 1
Communication time
7
a set of experiments were performed to evaluate the quality of the solutions in terms of their costs. In these experiments, the results obtained by MPW are compared directly with MA. The two algorithms always provide the feasible solutions. However, MA yields an improvement in evaluation cost in comparison with MPW, because MA converges faster than MPW to good solutions with a small number of generations (Figure 6). All the test runs described in this section were performed in a networked workstation environment operating at 100 Mbps with 10 PCs (Pentium 500 MHz).
5 Performance Evaluation and Numerical Results In order to compare the performance of memetic algorithms with that of the other heuristics, two types of experiments were performed: a set of experiments to evaluate the quality of the solutions in terms of their costs, and another set to evaluate the performance of MA in terms of CPU times. 5.1 Comparison with Other Heuristics Merchant and Sengupta [27] have designed a heuristic, which we call H, for solving the cell assignment problem. Pierre and Hou´eto [40] have been used Tabu search (TS) for solving the same problem. We compare TS and heuristics H with MA. For the experiments, we used a number of cells varying between 100 and 1000, and a number of switches varying between 5 and 10, that means the search space size is between 5100 and 101000 . In all our tests, the total number of evaluations remained the same. The three heuristics always find feasible solutions. However, these results inform only on the feasibility of obtained results without demonstrating whether these solutions are among the best or not. Figure 7 shows the results obtained for 5 different instances of problem: 5 switches and 100 cells, 6 switches and 150 cells, 7 switches and 200 cells, 8 switches and 500 cells, and 10 switches and 1000 cells. For each instance, we tested 16 different cases whose evaluation costs represent the average over 100 runs of each algorithm.
370
A. Quintero and S. Pierre
Evaluation cost comparison 1900 1800
Fitness value
1700 1600
Memetic algorithm
1500
Tabu search
1400
Heuristic H
1300 1200 1100 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Mobile network (100 cells, 5 switches) (a) Evaluation cost comparison for an instance of problem with 5 switches, 100 cells
Evaluation cost comparison 3800 3600
Fitness value
3400 3200 3000
Memetic algorithm
2800
Tabu search
2600
Heuristic H
2400 2200 2000 1800
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
Mobile network (150 cells, 6 switches) (b) Evaluation cost comparison for an instance of problem with 6 switches, 150 cells Fig. 7. Comparison between MA, tabu search and heuristics H
On the Design of Large-scale Cellular Mobile Networks
371
Evaluation cost comparison 4200
Fitness value
4000 3800 Memetic algorithm
3600
Tabu search
3400
Heuristic H
3200 3000 2800 1
2
3
4
5
6
7
9 10 11 12 13 14 15 16
8
Mobile network (200 cells, 7 switches) (c) Evaluation cost comparison for an instance of problem with 7 switches, 200 cells
8500 8300
Fitness value
8100 7900 Memetic algorithm
7700
Tabu search Heuristic H
7500 7300 7100 6900 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
Mobile network (500 cells, 8 switches) (d) Evaluation cost comparison for an instance of problem with 8 switches, 500 cells Fig. 7. (Continued)
372
A. Quintero and S. Pierre Evaluation cost comparison 14700
Fitness value
14200
13700
Memetic algorithm Tabu search Heuristic H
13200
12700
12200 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Mobile network (100 cells, 10 switches)
(e) Evaluation cost comparison for an instance of problem with 10 switches, 1000 cells Fig. 7. (Continued) Table 2. MA average improvement rates
Tabu search Heuristics H
5 switches 100 cells
6 switches 150 cells
7 switches 200 cells
8 switches 500 cells
10 switches 1000 cells
1.35% 1.77%
1.94% 2.01%
2.19% 4.16%
3.97% 6.51%
4.84% 7.98%
The three heuristics always find feasible solutions with objective values close to the optimum solution. In each of the all considered series of tests, MA yields an improvement in the cost function in comparison with the other two heuristics. In terms of evaluation fitness, MA provides better results than tabu search and heuristics H. Table 2 summarizes the results. Nevertheless, given the initial link, the handoff and the annual maintenance costs for largesized cellular mobile networks (in the order of hundred of millions of dollards) this small improvement represents a large reduction in costs over a 10-years period in the order of millions of dollards. For example, in a cellular network composed by 300 cells, with initial link and handoff cost of $350,000 for each cell, an improvement of 2% in the cost function represents an approximate saving of $2M over 10 years. In terms of CPU times, for a large number of cells, TS is a bit faster than heuristic H. Conversely, for problems of smaller size, TS is a bit slower. MA is slower than heuristics H and TS. However, this is not an important
On the Design of Large-scale Cellular Mobile Networks
373
Table 3. Relative distances between the MA solutions and the lower bound Instance of problem MA’s Mean distance (%)
5 switches 100 cells 1.35%
6 switches 150 cells 1.94%
7 switches 200 cells 2.19%
8 switches 500 cells 3.97%
10 switches 1000 cells 4.84%
fact because this heuristic is used in designing and planning phase of cellular mobile networks. 5.2 Quality of the Solutions A MA solution does not necessarily correspond to a global minimum. An intuitive lower bound for the problem is: LB1 =
n i=1
min(cik ) k
which is the link cost of the solution obtained by assigning each cell i to the nearest switch k. This lower bound does not take into account handoff cost. In fact, we suppose that capacity constraint is being relaxed and that all cells could be assigned to a single switch. Thus, we have a lower bound whatever the values of Mk and λi . Table 3 summarizes the results. MA gives good solutions in comparison with the lower bound. Note that the lower bound does not include handoff costs and therefore no solution could equal the lower bound.
6 Conclusion In this chapter, we proposed a multi-population memetic algorithm with elitism (MA) to design large-scale cellular mobile networks, and specifically to solve the problem of assigning cells to switches in cellular mobile networks. To select the offsprings of the new generation, we have used the concept of elitism; according to this concept only the lowest ranked string is deleted and the best string is automatically kept in the population. Also, the migrants are clones, and they are not deleted from their original populations. The local refinement strategy used with our memetic algorithm is tabu search. Experiments have been conducted to measure the quality of solutions provided by this algorithm. To evaluate the performance of this approach, we defined two lower bounds for the global optimum, which are used as references to judge the quality of the obtained solutions. Generally, the results are sufficiently close to the global optimum.
374
A. Quintero and S. Pierre
Computational results obtained confirm the efficiency and the effectiveness of MA to provide good solutions in comparison with tabu search [40] and Merchant and Sengupta’s heuristics [27], specially for large-scale cellular mobile networks with a number of cells varying between 100 and 1000, and a number of switches varying between 5 and 10, that means the search space size is between 5100 and 101000 , and the average improvement rates are the order of 5% and 8% respectively. Also, we have improved the results reported in [43] where a memetic algorithm without elitism where used. This improvement represents a large reduction in maintenance and operations costs for a 10 years period in the order of millions dollars. This heuristic can be used for designing next-generation mobile networks.
References [1] Beaubrun R, Pierre S, Conan J (1999) An efficient method for optimizing the assignment of cells to MSCs in PCS networks. In: Proceedings of the eleventh international conference on wireless communication, wireless 99, vol 1. Calgary (AB), July 1999, pp 259–265 [2] Bhattacharjee P, Saha D, Mukherjee A (1999) Heuristics for assignment of cells to switches in a PCSN: a comparative study. In: International conference on personal wireless communications, Jaipur, India, February 1999, pp 331–334 [3] Cantu-Paz E (2000) Efficient and accurate parallel genetic algorithms, Kluwer Academic, Dordecht [4] Ching-Hung W, Tzung-Pei H, Shian-Shyong T (1998) Integrating fuzzy knowledge by genetic algorithms. IEEE Trans Evol Comput 2(4):138–149 [5] Cohoon J, Martin W, Richards D (1991) A multi-population genetic algorithm for solving the K-partition problem on hyper-cubes. In: Proceedings of the fourth international conference on genetic algorithms, pp 244–248 [6] Costa D (1995) An evolutionary Tabu Search algorithm and the NHL scheduling problem. INFOR 33(3):161–178 [7] Demirkol I, Ersoy C, Caglayan MU, Delic H (2001) Location area planning in cellular networks using simulated annealing. In: Proceedings of IEEEINFOCOM 2001, vol 1, 2001, pp 13–20 [8] Fang Y, Chlamtac I, Lin Y (1997) Modeling PCS networks under general call holding time and cell residence time distributions. IEEE/ACM Trans Network 5(6):893–905 [9] Fogel D (1995) Evolutionary computation. Piscataway, NJ [10] Fogel D (1995) Evolutionary computation: toward a new philosophy of machine intelligence. IEEE, New York [11] Fogel D (1999) An overview of evolutionary programming. Springer-Verlag, Berlin Heidelberg New York, pp 89–109 [12] Fogel D (1999) An introduction to evolutionary computation and some applications. Wiley, Chichester, UK [13] Forrest S, Mitchell M (1999) What makes a problem hard for a genetic algorithm? Some anomalous results and their explanation. Machine Learning 13(2):285–319 [14] Gavish B, Sridhar S (1995) Economic aspects of configuring cellular networks. Wireless Netw 1(1):115–128 [15] Gavish B, Sridhar S (2001) The impact of mobility on cellular network configuration. Wireless Netw 7(1):173–185
On the Design of Large-scale Cellular Mobile Networks
375
[16] Glover F, Laguna M (1993) Tabu search. Kluwer, Boston [17] Glover F, Taillard E, Werra D (1993) A user’s guide to tabu search. Ann Oper Res 41(3):3–28 [18] Goldberg DE (1989) Genetic algorithms in search, optimization and machines learning. Addison-Wesley, Reading, MA [19] Gondim RLP (1996) Genetic algorithms and the location area partitioning problem in cellular networks. In: Proceedings of the vehicular technology conference 1996, Atlanta, VA, April 1996, pp 1835–1838 [20] Gorges-Schleuter M (1989) ASPARAGOS: an asynchronous parallel genetic optimization strategy. In: Proceedings third international conference on genetic algorithms, pp 422–427 [21] He L, Mort N (2000) Hybrid genetic algorithms for telecommunications network back-up routing. BT Tech J 18(4):42–50 [22] Holland J (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor [23] Hurley S (2002) Planning effective cellular mobile radio networks. IEEE Transactions on Vehicular Technology 51(2):243–253 [24] Kado K, Ross P, Corne D (1995) A study of genetic algorithms hybrids for facility layout problems. In: Eshelman LJ (ed). Proceedings of the sixth international conference genetic algorithms, San Mateo, CA. Morgan Kaufmann, Los Altos, CA, pp 498–505 [25] Kleinrock L (1975) Queuing systems I: theory. Wiley, New York [26] Lienig (1997) A parallel genetic algorithm for performance-driven VLSI routing. IEEE Transactions on Evolutionary Computation 1(1):29–39 [27] Merchant A, Sengupta B (1995) Assignment of cells to switches in PCS networks. IEEE/ACM Transactions on Networking 3(5):521–526 [28] Merchant A, Sengupta B (1994) Multiway graph partitioning with applications to PCS networks. 13th Proceedings of IEEE Networking for Global Communications, INFOCOM ’94 2:593–600 [29] Merz P, Freisleben B (2000) Fitness landscape analysis and memetic algorithms for the quadratic assignment problem. IEEE Trans Evol Comput 4(4):337–352 [30] Merz P, Freisleben B (1997) Genetic local search for the TSP: new results. In: Proceedings of the IEEE international conference evolutionary computation, Piscataway, NJ, pp 159–164 [31] Merz P, Freisleben B (1998) Memetic algorithms and the fitness landscape of the graph bi-partitioning problem. In: Eiben AE, Back T, Schoenauer M, Schwefel HP (eds) Proceedings of the fifth international conference on parallel problem solving from nature PPSN V. Springer, Berlin Heidelberg New York, pp 765–774 [32] Merz P, Freisleben B (1999) A comparison of memetic algorithms, tabu search, and ant colonies for the quadratic assignment problem. In: Proceedings of the 1999 international congress of evolutionary computation (CEC’99). IEEE, New York [33] Michalewicz M (1996) Genetic algorithms + data structures = evolution programs. Springer, Berlin Heidelberg New York [34] Moscato P (1993) An introduction to population approaches for optimization and hierarchical objective functions: a discussion on the role of tabu search, vol 41, pp 85–121 [35] Moscato P, Norman MG (1993) A memetic approach for the traveling salesman problem implementation of a computational ecology for combinatorial optimization on message passing systems. IOS, pp 177–186 [36] Muhlenbein H, Schlierkamp-Voosen D (1993) Predictive models for the breeder genetic algorithm I. Continuous parameter optimization. Trans Evol Comput 1(1):25–49
376
A. Quintero and S. Pierre
[37] Munetomo M, Takai Y, Sato Y (1993) An efficient migration scheme for subpopulations-based asynchronously parallel genetic algorithms. In: Proceedings of the fifth international conference on genetic algorithms. Morgan Kaufmann, Los Altos, CA, p 649 [38] Olivier F (1998) An evolutionary strategy for global minimization and its Markov chain analysis. IEEE Trans Evol Comput 2(3):77–90 [39] Pierre S, Elgibaoui A (1997) A tabu-search approach for designing computernetwork topologies with unreliable components. IEEE Trans Reliab 46(3):350– 359 [40] Pierre S, Hou´eto F (2002) A tabu search approach for assigning cells to switches in cellular mobile networks. Comput Commun 25:464–477 [41] Quintero A, Pierre S (2003) Assigning cells to switches in cellular mobile networks: a comparative study. Comput Commun 26(9):950–960 [42] Quintero A, Pierre S (2002) A memetic algorithm for assigning cells to switches in cellular mobile networks. IEEE Commun Lett 6(11):484–486 [43] Radcliffe NJ, Surry PD (1994), Formal memetic algorithms. Springer Verlag LNCS 865, Berlin Heidelberg New York, pp 1–16 [44] Rankin R, Wilkerson R, Harris G, Spring J (1993) A hybrid genetic algorithm for an NP-complete problem with an expensive evaluation function. In: Proceedings of the 1993 ACM/SIGAPP symposium on applied computing: states of the art and practice, Indianapolis, USA, pp 251–256 [45] Rayward-Smith V, Osman I, Reeves C, Smith G (1996) Modern heuristic search methods. Wiley, New York [46] Reed DP (1993) The cost structure of personal communication services. IEEE Commun Mag 31(4):102–108 [47] Reynolds RG, Sverdlik W (1994) Problem solving using cultural algorithms. In: IEEE world congress on computational intelligence, Proceedings of the first IEEE conference on evolutionary computation, vol 2, pp 645–650 [48] Reynolds RG, Zhu S (2001) Knowledge-based function optimization using fuzzy cultural algorithms with evolutionary programming. IEEE Trans Syst Man Cybernet, Part B 31(1):1–18 [49] Saha D, Mukherjee A, Bhattacharjee P (2000) A simple heuristic for assigment of cell to switches in a PCS network. Wireless Personal Commun 12:209–224 [50] Salomon R (1998) Evolutionary algorithms and gradient search: similarities and differences. IEEE Trans Evol Comput 2(2):45–55 [51] Sayoud H, Takahashi K, Vaillant B (2001) Designing communication network topologies using steady-state genetic algorithms. IEEE Commun Lett 5(3):113– 115 [52] Schaffer J (1987) Some effects of selection procedures on hyperplane sampling by genetic algorithms. Pitman, London, pp 89–99 [53] Schenecke V, Vornberger V (1997) Hybrid genetic algorithms for constrained placement problems. IEEE Trans Evol Comput 1(4):266–277 [54] Sebag M, Schoenauer M (1997) A society of hill-climbers. In: Proceedings of the fourth IEEE international conference on evolutionary computation, pp 319–324 [55] B¨ ack T (1996) Evolutionary algorithms in theory and practice. Oxford University Press, New York [56] B¨ ack T, Schwefel H (1993) An overview of evolutionary algorithms for parameter ptimization. Evol Comput 1(1):1–23 [57] Tanese R (1989) Distributed genetic algorithms. In: Schaffer JD (ed) Proceedings of the third international conference on genetic algorithms. Morgan Kaufmann, San Mateo CA, pp 434–439 [58] Turney P (1995) Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. J Artif Intell Res 2:369–409
On the Design of Large-scale Cellular Mobile Networks
377
[59] Vavak F, Fogarty T (1996) Comparison of steady state and generational genetic algorithms for use in non stationary environments. In: Proceedings of IEEE international conference on evolutionary computation, pp 192–195 [60] Wheatly C (1995) Trading coverage for capacity in cellular systems: a system perspective. Microwave J 38(7):62–76
A Hybrid Cellular Genetic Algorithm for the Capacitated Vehicle Routing Problem Enrique Alba and Bernab´e Dorronsoro
Summary. Cellular genetic algorithms (cGAs) are a kind of genetic algorithm (GA) – population based heuristic – with a structured population so that each individual can only interact with its neighbors. The existence of small overlapped neighborhoods in this decentralized population provides both diversity and opportunities for exploration, while the exploitation of the search space is strengthened inside each neighborhood. This balance between intensification and diversification makes cGAs naturally suitable for solving complex problems. In this chapter, we solve a large benchmark (composed of 160 instances) of the Capacitated Vehicle Routing Problem (CVRP) with a cGA hybridized with a problem customized recombination operation, an advanced mutation operator integrating three mutation methods, and two well-known local search algorithms for routing problems. The studied test-suite contains almost every existing instance for CVRP in the literature. In this work, the best-so-far solution is found (or even improved) in 80% of the tested instances (126 out of 160), and in the other cases (20%, i.e. 34 out of 160) the deviation between our best solution and the best-known one is always very low (under 2.90%). Moreover, 9 new best-known solutions have been found.
1 Introduction Transportation plays an important role in logistic tasks of many companies since it usually accounts for a high percentage of the value added to goods. Therefore, the use of computerized methods in transportation often results in significant savings of up to 20% of the total costs (see Chap. 1 in [1]). A distinguished problem in the field of transportation consists in finding the optimal routes for a fleet of vehicles which serve a set of clients. In this problem, an arbitrary set of clients must receive goods from a central depot. This general scenario presents many chances for defining (related) problem scenarios: determining the optimal number of vehicles, finding the shortest routes, and so on, all of them are subject to many restrictions like vehicle capacity, time windows for deliveries, etc. This variety of scenarios leads to a plethora of problem variants in practice. Some reference case studies where the E. Alba and B. Dorronsoro: A Hybrid Cellular Genetic Algorithm for the Capacitated Vehicle Routing Problem, Studies in Computational Intelligence (SCI) 82, 379–422 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
380
E. Alba and B. Dorronsoro
Depot
(a)
VRP
Depot
(b)
Fig. 1. The Vehicle Routing Problem consists in serving a set of geographically distributed customers (points) from a depot (a) using the minimum cost routes (b)
application of vehicle routing algorithms has led to substantial cost savings can be found in Chaps. 10 and 14 in [1]. As stated before, the Vehicle Routing Problem (VRP) [2] consists in delivering goods to a set of customers with known demands through minimumcost vehicle routes originating and terminating at the depot, as can be seen in Figs. 1a and 1b (a detailed description of the problem is available in Sect. 2). The VRP is a very important source for problems, since solving it is equivalent to solving multiple TSP problems at once. Due to the difficulty of this problem (NP-hard) and because of its many industrial applications, it has been largely studied both theoretically and in practice [3]. There is a large number of extensions to the canonical VRP. One basic extension is known as the Capacitated VRP – CVRP –, in which vehicles have fixed capacities of a single commodity. Many different variants can be constructed from CVRP; some of the most important ones [1] are those including Time Windows restrictions – VRPTW – (customers must be supplied following a certain time schedule), Pick-up and Delivery – VRPPD – (customers will require goods to be either delivered or picked up), or Backhauls – VRPB – (like VRPPD, but deliveries must be completed before any pickups are made). In fact, there are many more extensions for this problem, like the use of Multiple Depots (MDVRP), Split Deliveries (SDVRP), Stochastic variables (SVRP), or Periodic scheduling (PVRP). The reader can find a public web site with all of them, latest best solutions, chapters and related stuff in our web site [4]. We consider in this chapter the Capacitated Vehicle Routing Problem (CVRP), in which a fixed fleet of delivery vehicles of the same capacity must service known customer demands for a single commodity from a common depot at minimum transit costs. The CVRP has been studied in a large number of separate works in the literature, but (to our knowledge) no work addresses such a huge benchmark, since it means solving 160 different instances. We use such a large set of instances to test the behavior of our algorithm in many different scenarios in order to give a deep analysis of our algorithm and a general
A Hybrid cGA for the CVRP
381
view of this problem not biased by any ad hoc selection of individual instances. The included instances are characterized by many different features: instances from real world, theoretical ones, clustered, not clustered, with homogeneous and heterogeneous demands on customers, with the existence of drop times or not, etc. In recent VRP history, there has been a steady evolution in the quality of the solution methodologies used for this problem, borrowed both from the exact and the heuristic fields of research. However, due to the difficulty of the problem, no known exact method is capable of consistently solving to optimality instances involving more than 50 customers [1, 5]. In fact, it is also clear that, as for most complex problems, non-customized heuristics would not compete in terms of the solution quality with present techniques like the ones described in [1, 6]. Moreover, the power of exploration of some modern techniques like Genetic Algorithms or Ant Systems is not yet fully explored, specially when combined with an effective local search step. All these considerations could allow us to refine solutions to optimality. Particularly, we present in this chapter a Cellular Genetic Algorithm (cGA) [7] hybridized with specialized recombination and mutation operators and with a local search algorithm for solving CVRP. Genetic Algorithms (GAs) are heuristics based on an analogy with biological evolution. A population of individuals representing tentative solutions is maintained. New individuals are produced by combining members of the population, and they replace existing individuals attending to some given policy. In order to induce a lower selection pressure (larger exploration) compared to that of panmictic GAs (here panmictic means that an individual may mate with any other in the population – see Fig. 2a), the population can be decentralized by structuring it in some way [8] (see Figs. 2b and 2c). Cellular GAs are a subclass of GAs in which the population is structured in a specified topology (usually a toroidal mesh of dimensions d = 1, 2 or 3), so that individuals may only interact with their neighbors (see Fig. 2c). The pursued effect is to improve on the diversity and exploration capabilities of the algorithm (due to the presence of overlapped small neighborhoods) while still admitting an easy
(a)
(b)
(c)
Fig. 2. (a) Panmictic GA has all its individual black points in the same population. Structuring the population usually leads to distinguish between (b) distributed, and (c) cellular GAs
382
E. Alba and B. Dorronsoro
combination with local search at the level of each individual to improve on exploitation [9]. These techniques, also called diffusion or fine-grained models, have been popularized, among others, by early works of Gorges-Schleuter [10], Manderick and Spiessens [7], and more [11–13]. The contribution of this work is then to define a powerful yet simple hybrid cGA capable of competing with the best known approaches in solving of the CVRP in terms of accuracy (final cost) and effort (the number of evaluations made). For that goal, we test our algorithm over a large selection of instances (160), which will allow us to guarantee deep and meaningful conclusions. Besides, we compare our results against the best existing ones in the literature, even improving them in some instances. In [14] the reader can find a seminal work with a comparison between our algorithm and some other known heuristics for a reduced set of 8 instances. In that work, we showed the advantages of embedding local search techniques into a cGA for solving CVRP, since our hybrid cGA was the best algorithm out of all those compared in terms of accuracy and time. Cellular GAs represent a paradigm much simpler in terms of customization and comprehension than others such as Tabu Search (TS) [15,16] and similar (very specialized or very abstract) algorithms [17,18]. This is an important point too, since the greatest emphasis on simplicity and flexibility is nowadays a must in research to achieve widely useful contributions [6]. The chapter is organized in the following manner. In Sect. 2 we define CVRP. The proposed hybrid cGA is thoroughly described in Sect. 3. The objective of Sect. 4 is to justify the elections we have made for the design of our highly hybridized algorithm. Sect. 5 presents the results of our tests, comparing them with the best-known values in the literature. Finally, our conclusions and future lines of research are presented in Sect. 6.
2 The Vehicle Routing Problem The VRP can be defined as an integer programming problem which falls into the category of NP-hard problems [19]. Among the different variants of VRP we work here with the Capacitated VRP (CVRP), in which every vehicle has a uniform capacity of a single commodity. The CVRP is defined on an undirected graph G = (V, E) where V = {v0 , v1 , . . . , vn } is a vertex set and E = {(vi , vj )/vi , vj ∈ V, i < j} is an edge set. Vertex v0 stands for the depot, and it is from where m identical vehicles of capacity Q must serve all the cities or customers, represented by the set of n vertices {v1 , . . . , vn }. We define on E a non-negative cost, distance or travel time matrix C = (cij ) between customers vi and vj . Each customer vi has non-negative demand of goods qi and drop time δi (time needed to unload all goods). Let be V1 , . . . , Vm a partition of V, a route Ri is a permutation of the customers in Vi specifying the order of visiting them, starting and finishing at the depot v0 . The cost of a given route Ri = {v0 , v1 , . . . , vk+1 }, where vj ∈ V and v0 = vk+1 = 0
A Hybrid cGA for the CVRP
383
(0 denotes the depot), is given by: Cost(Ri ) =
k
cj,j+1 +
j=0
k
δj ,
(1)
j=0
and the cost of the problem solution (S) is: FCVRP (S) =
m
Cost(Ri ).
(2)
i=1
The CVRP consists in determining a set of m routes (i) of minimum total cost – see (2) –; (ii) starting and ending at the depot v0 ; and such that (iii) each customer is visited exactly once by exactly one vehicle; subject to the restrictions that (iv) the total demand of any route does not exceed Q ( vj ∈Ri qj ≤ Q); and (v) the total duration of any route is not larger than a preset bound D (Cost(Ri ) ≤ D). All the vehicles have the same capacity and carry a single kind of commodity. The number of vehicles is either an input value or a decision variable. In this study, the length of routes is minimized independently of the number of vehicles used. It is clear from our description that the VRP is closely related to two difficult combinatorial problems. On the one hand, we can get an instance of the Multiple Travelling Salesman Problem (MTSP) just by setting C = ∞. An MTSP instance can be transformed into a TSP instance by adjoining to the graph k − 1 additional copies of node 0 (depot) and its incident edges (there are no edges among the k depot nodes). On the other hand, the question of whether there exists a feasible solution for a given instance of the VRP is an instance of the Bin Packing Problem (BPP). So the VRP is extremely difficult to solve in practice because of the interplay of these two underlying difficult models (TSP and BPP). In fact, the largest solvable instances of the VRP are two orders of magnitude smaller than those of the TSP [20].
3 The Proposed Algorithm In this section we will present a detailed description of the algorithm we have developed. Basically, we use a simple cGA highly hybridized with specific recombination and mutation operators, and also with an added local postoptimization step. In Algorithm 3.1 we can see a pseudocode of JCell , our basic hybrid cGA. In JCell, the population is structured in a 2D toroidal grid, and the neighborhood defined on it – line 5 – always contains 5 individuals: the one being considered (position(x,y)) plus the North, East, West, and South individuals (called NEWS, linear5, or Von Neumann neighborhood [7, 9, 21]). The first parent for a recombination (operator of arity 2) is chosen by using binary
384
E. Alba and B. Dorronsoro
Algorithm 3.1 Pseudocode of JCell 1: proc Steps Up(cga) //Algorithm parameters in ‘cga’ 2: while not Termination Condition() do 3: for x ← 1 to WIDTH do 4: for y ← 1 to HEIGHT do 5: n list←Get Neighborhood(cga,position(x,y)); 6: parent1←Binary Tournament(n list); 7: parent2←Individual At(cga,position(x,y)); 8: aux indiv←Recombination(cga.Pc,parent1,parent2); 9: aux indiv←Mutation(cga.Pm,aux indiv); 10: aux indiv←Local Search(cga.Pl,aux indiv); 11: Evaluate Fitness(aux indiv); 12: Insert If Better(position(x,y),aux indiv,cga,aux pop); 13: end for 14: end for 15: cga.pop←aux pop; 16: Update Statistics(cga); 17: end while 18: end proc Steps Up;
tournament selection (BT) inside this neighborhood (line 6), while the other parent will be the current individual itself (line 7). The two parents can be the same individual (replacement) or not. The algorithm iteratively considers as “current” each individual in the grid. Genetic operators are applied to the individuals in lines 8 and 9 to increasingly improve on the average fitness of individuals in the grid also on a neighborhood basis (explained in Sects. 3.2 and 3.3). We add to this basic cGA a local search technique in line 10 consisting in applying 2-Opt and 1-Interchange, which are well-known local optimization methods (see Sect. 3.4). After applying these operators, we evaluate the fitness value of the new individual (line 11), and insert it in the new (auxiliary) population – line 12 – only if its fitness value is larger than that of the parent located at that position in the current population (elitist replacement). After applying the above mentioned operators to the individuals, we replace the old population for the new one at once (line 15), and we then calculate some statistics (line 16). It can be noticed that new individuals replace the old ones en bloc (synchronously) and not incrementally (see [22] for other replacement techniques). The algorithm stops when an optimal solution is found or when an a priori predetermined maximum number of generations is reached. The fitness value assigned to every individual is computed as follows [14, 15]: (3) f (S) = FCVRP (S) + λ · overcap(S) + µ · overtm(S), feval (S) = fmax − f (S).
(4)
The objective of our algorithm is to maximize feval (S) (4) by minimizing f (S) (3). The value fmax must be larger or equal with respect to that of the worst feasible solution for the problem. Function f (S) is computed by adding the total costs of all the routes (FCVRP (S) – see (2) –), and penalizes the fitness value only in the case that the capacity of any vehicle and/or
A Hybrid cGA for the CVRP
385
the time of any route are exceeded. Functions ‘overcap(S)’ and ‘overtm(S)’ return the overhead in capacity and time of the solution (respectively) with respect to the maximum allowed value of each route. These values returned by ‘overcap(S)’ and ‘overtm(S)’ are weighted by multiplying them by factors λ and µ, respectively. In this work we have used λ = µ = 1000 [23]. In Sects. 3.1 to 3.4 we proceed to explain in detail the main features that characterize our algorithm (JCell). The algorithm itself can be applied with all the mentioned operations and also applying only some of them to analyze their separate contribution to the performance of the search. 3.1 Problem Representation In a GA, individuals represent candidate solutions. A candidate solution to an instance of the CVRP must specify the number of vehicles required, the allocation of the demands to all these vehicles, and also the delivery order of each route. We adopted a representation consisting of a permutation of integer numbers. Each permutation will contain both customers and route splitters (delimiting different routes), so we will use a permutation of numbers [0 . . . n − 1] with length n = c + k for representing a solution for the CVRP with c customers and a maximum of k + 1 possible routes. Customers are represented with numbers [0 . . . c−1], while route splitters belong to the range [c . . . n − 1]. Note that due to the nature of the chromosome (permutation of integer numbers) route splitters must be different numbers, although it should be possible to use the same number for designating route splitters in the case of using other possible chromosome configuration. Each route is composed of the customers between two route splitters in the individual. For example, in Fig. 3 we plot an individual representing a possible solution for a hypothetical CVRP instance with 10 customers using at most 4 vehicles. Values [0, . . . , 9] represent the customers while [10, . . . , 12] are the route splitters. Route 1 begins at the depot, visits customers 4–5–2 (in that order), and returns to the depot. Route 2 goes from the depot to customers 0–3–1 and returns. The vehicle of Route 3 starts at the depot and visits customers 7–8–9. Finally, in Route 4, only customer 6 is visited from the depot. Empty routes are allowed in this representation simply by placing two route splitters contiguously without clients between them.
Fig. 3. Individual representing a solution for 10 customers and 4 vehicles
386
E. Alba and B. Dorronsoro STEP2
Random STEP1 Choice
1 2 3 4 5 6
1
1 2 3 4 5 6
Offspring
STEP3
Offspring
1 2 3 4 5 6
1 2 3 4 5 6
Offspring
END
Offspring
Fig. 4. Edge recombination operator
3.2 Recombination Recombination is used in GAs as an operator for combining parts in two (or more) patterns in order to transmit (hopefully) good building blocks in them to their offspring. The recombination operator we use is the edge recombination operator (ERX) [24], since it has been largely reported as the most appropriate for permutations compared to other general operators like order crossover (OX) or partially matched crossover (PMX). ERX builds an offspring by preserving edges from its two parents. For that, an edge list is used. This list contains, for each city, the edges of the parents that start or finish in it (see Fig. 4). After constructing the edge list of the two parents, the ERX algorithm builds one child solution by proceeding as follows. The first gene of the offspring is chosen from between the first one of both parents. Specifically, the gene having a lower number of edges is selected. In the case of a tie, the first gene of the first parent will be chosen. The other genes are chosen by taking from among their neighbors the one with the shortest edge list. Ties are broken by choosing the first city found that fulfill this shortest list criterion. Once a gene is selected, it is removed from the edge list. 3.3 Mutation The mutation operator we use in our algorithms will play an important role during the evolution since it is in charge of introducing a considerable degree of diversity in each generation, counteracting in this way the strong selective pressure which is a result of the local search method we plan to use. The mutation consists in applying Insertion, Swap or Inversion operations to each gene with equal probability (see Algorithm 3.2). These three mutation operators (see Fig. 5) are well-known methods found in the literature, and typically applied sooner than later in routing
A Hybrid cGA for the CVRP
387
Algorithm 3.2 The mutation algorithm 1:
proc Mutation(pm, ind) // ‘pm’ is the mutation probability, and ‘ind’ is the individual to mutate 2: for i ← 1 to ind.Length() do 3: if Rand0To1() 0.66 then 8: ind.Insertion(i, RandomInt(ind.Length())); 9: else 10: ind.Swap(i, RandomInt(ind.Length())); 11: end if 12: end if 13: end for 14: end proc Mutation;
Insertion
Swap
Inversion
Parent
Offspring
Fig. 5. The three different mutations used
problems. Our idea here has been to merge these three in a new combined operator. The Insertion operator [25] selects a gene (either customer or route splitter) and inserts it in another randomly selected place of the same individual. Swap [26] consists in randomly selecting two genes in a solution and exchanging them. Finally, Inversion [27] reverses the visiting order of the genes between two randomly selected points of the permutation. Note that the induced changes might occur in an intra or inter-route way in all the three operators. Formally stated, given a potential solution S = {s1 , . . . , sp−1 , sp , sp+1 , . . . , sq−1 , sq , sq+1 , . . . , sn }, where p and q are randomly selected indexes, and n is the sum of the number of customers plus the number of route splitters (n = c + k), then the new solution S obtained after applying each of the different proposed mechanisms is shown below: Insertion : S = {s1 , . . . , sp−1 , sp+1 , . . . , sq−1 , sq , sp , sq+1 , . . . , sn }, Swap : S = {s1 , . . . , sp−1 , sq , sp+1 , . . . , sq−1 , sp , sq+1 , . . . , sn },
(5) (6)
Inversion : S = {s1 , . . . , sp−1 , sq , sq−1 , . . . , sp+1 , sp , sq+1 , . . . , sn }.
(7)
3.4 Local Search It is very clear after all the existing literature on VRP that the utilization of a local search method is almost mandatory to achieve results of high quality [14, 17, 28]. This is why we envision from the beginning the application of two of the most successful techniques in recent years. In effect, we will add a local refining step to some of our algorithms consisting in applying 2-Opt [29] and 1-Interchange [30] local optimization to every individual.
388
E. Alba and B. Dorronsoro a
b
a
b
Route i
Route j d
c
d (a)
c (b)
Fig. 6. 2-Opt works into a route (a), while λ-Interchange affects two routes (b)
On the one hand, the 2-Opt simple local search method works inside each route. It randomly selects two non-adjacent edges (i.e. (a, b) and (c, d)) of a single route, deletes them, thus breaking the tour into two parts, and then reconnects those parts in the other possible way: (a, c) and (b, d) (Fig. 6a). Hence, given a route R = {r1 , . . . , ra , rb , . . . , rc , rd , . . . , rn }, being (ra , rb ) and (rc , rd ) two randomly selected non-adjacent edges, the new route R obtained after applying the 2-Opt method to the two considered edges will be R = {r1 , . . . , ra , rc , . . . , rb , rd , . . . , rn }. On the other hand, the λ-Interchange local optimization method that we use is based on the analysis of all the possible combinations for up to λ customers between sets of routes (Fig. 6b). Hence, this method results in customers either being shifted from one route to another, or being exchanged between routes. The mechanism can be described as follows. A solution to the problem is represented by a set of routes S = {R1 , . . . , Rp , . . . , Rq , . . . , Rk }, where Ri is the set of customers serviced in route i. New neighboring solutions can be obtained after applying λ-Interchange between a pair of routes Rp and Rq ; to do so we replace each subset of customers S1 ⊆ Rp of size |S1 | ≤ λ with any other one S2 ; ⊆ Rq of size |S2 | ≤ λ. This ; way, we obtain two new routes Rp = (Rp − S1 ) S2 and Rq = (Rq − S2 ) S1 , which are part of the new solution S = {R1 . . . Rp . . . Rq . . . Rk }. Hence, 2-Opt searches for better solutions by modifying the order of visiting customers inside a route, while the λ-Interchange method results in customers either being shifted from one route to another, or customers being exchanged between routes. This local search step is applied to an individual after the recombination and mutation operators, and returns the best solution between the best ones found by 2-Opt and 1-Interchange, or the current one if it is better (see a pseudocode in Algorithm 3.3). In the local search step, the algorithm applies 2-Opt to all the pairs of non-adjacent edges in every route and 1-Interchange to all the subsets of up to 1 customer between every pair of routes. In summary, the functioning of JCell is quite simple: in each generation, an offspring is obtained for every individual after applying the recombination operator (ERX) to its two parents (selected from its neighborhood).
A Hybrid cGA for the CVRP
389
Algorithm 3.3 The local search step 1: proc Local Search(ind) // ‘ind’ is the individual to be improved 2: // MAX STEPS = 20 3: for s← 0 to MAX STEPS do 4: //First: 2-Opt. K stands for the number of routes 5: best2-Opt = ind; 6: for r ← 0 to K do 7: sol = 2-Opt(ind,r); 8: if Better(sol,best2-Opt) then 9: best2-Opt = sol; 10: end if 11: end for 12: end for 13: //Second: 1-Interchange. 14: best1-Interchange = ind; 15: for s← 0 to MAX STEPS do 16: for i← 0 to Length(ind) do 17: for j← i+1 to Length(ind) do 18: sol = 1-Interchange(i,j); 19: if Better(sol,best1-Interchange) then 20: best1-Interchange = sol; 21: end if 22: end for 23: end for 24: end for 25: Better(best2-Opt, best1-Interchange)? return best2-Opt : return best1-Interchange; 26: end proc Local Search;
The offsprings are mutated with a special combined mutation, and later a local post-optimization step is applied to the mutated individuals. This local search algorithm consists in applying to the individual two different local search methods (2-Opt and 1-Interchange), and returns the best individual from among the input individual and the output of 2-Opt and 1-Interchange. The population of the next generation will be composed of the current one after replacing its individuals with their offsprings in the cases where they are better.
4 Looking for a New Algorithm: the Way to JCell2o1i In this section we present a preliminary study in order to justify some choices made in the final design of our algorithm, JCell2o1i. Specifically, in Sect. 4.1 we exemplify the kind of advantage obtained from using a decentralized population (cellular) instead of a panmictic one (non-structured). In Sect. 4.2 we analyze the behavior of three algorithms using the three mutation methods separately (very commonly used in routing problems), that will later justify our proposed combined mutation. Finally, we justify the choice of the local search method used (composed of two well known methods) in Sect. 4.3. The observed performances in this initial study will guide us to the final proposal of our hybrid cGA. For the present studies, some instances of the hard benchmark of Christofides, Mingozzi and Toth [3] have been chosen. Specifically, we use
390
E. Alba and B. Dorronsoro Table 1. Parameterization used in the algorithm
Population Size Selection of Parents Recombination Individual Mutation Probability of Bit Mutation Individual Length Replacement Local Search (LS)
100 Individuals (10 × 10) Binary Tournament + Current Individual ERX, pc = 0.65 Insertion, Swap or Inversion (equal prob.), pm = 0.85 pbm = 1.2/L L Rep if Better 2-Opt + 1-Interchange, 20 optimization steps
the three smallest instances of the benchmark with and without drop times and maximum route length restrictions (CMT-n51-k5, CMT-n76-k10 and CMT-n101-k8, and CMTd-n51-k6, CMTd-n76-k11 and CMTd-n101-k9). In order to obtain a more complete benchmark (and thus, more reliable results), we have additionally selected the clustered instances CMT-n101-k10 and CMTd-n101-k11. This last one (CMTd-n101-k11) is obtained after including drop times and maximum route lengths to CMT-n101-k10. The algorithms studied in this section are compared by means of the average total travelled length of the solutions found, the average number of evaluations needed to get those solutions, and the hit rate (percentage of success rate) in 30 independent runs. Hence, we are comparing them in terms of their accuracy, efficiency, and efficacy. In order to look for the existence of statistical reliability in the comparison of the results, ANOVA tests are performed. The parameters used for all the algorithms in the current work are essentially those of Table 1, with some exceptions mentioned in each case. The average values in the tables are shown along with their standard deviation, and the best ones are bolded. 4.1 Cellular vs. Generational Genetic Algorithms We will compare in this section the behavior of our cellular GA, JCell2o1i, with a generational panmictic one (genGA) in order to illustrate our decision of using a structured cellular model for solving the CVRP. The same algorithm is run in the two cases with the difference being the neighborhood structure of the cGA. As we will see, structuring the population has been useful for improving the algorithm in many domains, since the population diversity is better maintained with regard to non-structured population. The results of this comparison are shown in Table 2. As can be seen, the genGA is not able to find the optimum for half of the tested instances (CMT-n76-k10, CMT-n101-k8, CMTd-n51-k6, and CMTd-n101-k9), while JCell2o1i always finds it, and with larger hit rates and lower effort. In Fig. 7 we graphically compare the two algorithms by plotting the difference of the values obtained (genGA minus cGA) in terms of the accuracy –average solution length– (Fig. 7a), efficiency –average evaluations– (Fig. 7b),
A Hybrid cGA for the CVRP
391
Table 2. Cellular and generational genetic algorithms Generational GA Instance
JCell2o1i
Avg. Sol. Length Avg. Evals. Hit (%) Avg. Sol. Length Avg. Evals. Hit (%) 527.88 98158.34 40.00 525.19 26374.31 53.33
CMT-n51-k5
±3 .70
±23550 .01
845.00
828.98
—– —– —– —– 321194.74
±14 .65
±122200 .58
CMTd-n51-k6
561.37
CMTd-n76-k11
923.47
—– —– 344500.00
±9 .43
±62044 .26
879.20 868.34
—– —– 342742.86
±4 .50
±125050 .93
CMT-n76-k10
±6 .22
833.50
CMT-n101-k8
±4 .07
CMT-n101-k10
±3 .03
CMTd-n101-k9
±5 .78
CMTd-n101-k11
0.00 0.00 63.33
±1 .52
±6355 .15
842.77
72102.00
±4 .58
±0 .00
832.43
177248.00
±0 .00
±0 .00
820.89 ±4 .88
±13104 .67
558.28 ±2 .05
±6050 .44
10.00
918.50
64964.50
±7 .42
±25595 .15
876.94
91882.00
±5 .89
±0 .00
63.33
867.35 ±2 .04
3.33
70637.41 90.00
0.00
0.00
3.33
27969.15 23.33 6.67 3.33
333926.32 70.00 ±125839 .62
CMT-n51-k5 CMT-n76-k10 CMT-n101-k8 CMT-n101-k10 CMTd-n51-k6 CMTd-n76-k11 CMTd-n101-k9 CMTd-n101-k11 0
2 4 6 8 Avg. Distances Difference (a)
100000 200000 300000 −30 0 Avg. Number of Evaluations Difference (b)
−20 −10 0 Hit Rate Difference (% ) (c)
Fig. 7. Comparison between the generational GA and JCell2o1i (values of genGA minus JCell2o1i)
and efficacy –hit rate– (Fig. 7c). As it can be noticed, JCell2o1i always outperforms the generational GA for all of the three compared parameters (8×3 = 24 tests), except for the hit rate in CMTd-n76-k11. The histogram of Fig. 7b is incomplete because no solutions were found by the genGA for some instances (see second column of Table 2). After applying t -tests to the obtained results in the number of evaluations made and the distance of the solutions, we conclude that there is always statistical confidence (p-values ≥ 0.05) for these claims between the two algorithms. 4.2 On the Importance of the Mutation Operator Once we clearly showed the higher performance of JCell on a generational GA we go deeper in analyzing its behavior. Hence, in this section we will study the behavior of JCell when using each of the three proposed mutation operators separately, and we compare the results with those of JCell2o1i, which uses a combination of the three mutations. In Table 3 we can see the results obtained by the four algorithms (along with their standard deviations).
392
E. Alba and B. Dorronsoro Table 3. Analysis of mutation JCell2o1iinv Inst.
CMT-n51-k5
Avg. Sol.
Avg. Evals.
527.38
1.52E5
±3 .48 ±8 .23E4
CMT-n76-k10 CMT-n101-k8 CMT-n101-k10
2.09E5
±3 .32 ±6 .53E5
JCell2o1isw Hit (%)
Avg. Sol.
Avg. Evals.
±1 .52 ±6 .36E3
6.19E5
3 844.81
—–
±5 .48
±0 .00
±5 .71
835.99
0 832.98
5.25E5
3 835.25
—– —–
±4 .54
—–
±4 .20
±0 .00
±5 .71
—–
824.91
3.05E6
73 820.27
2.71E5
90 826.98
3.48E5
9.85E4
921.57
4.01E5
876.08
7.59E5
±6 .51
±0 .00
CMTd-n101-k11 867.50
3.50E5
±3 .68 ±1 .31E5
17 558.61
2.60E5
±2 .37 ±1 .88E5
10 917.71
5.06E5
±7 .42 ±5 .51E4
3 873.26
5.55E5
±5 .77 ±1 .54E5
90 867.83
3.86E5
±3 .46 ±1 .38E5
Avg. Evals.
80 525.19 2.64E4
0 843.13
558.86
Avg. Sol.
1.48E5
—– —–
±3 .60 ±7 .49E4
JCell2o1i Hit (%)
±2 .48 ±7 .59E4
60 525.68
—–
±10 .21 ±1 .90E5
CMTd-n101-k9
57 527.02
Avg. Evals.
±4 .75
±2 .62 ±2 .15E4
CMTd-n76-k11
Avg. Sol.
844.79
±9 .15 ±1 .30E5
CMTd-n51-k6
JCell2o1iins Hit (%)
±9 .28 ±1 .56E5
0 842.77 7.21E4 ±4 .58
57
820.89 7.06E4
10 558.28 2.80E4 ±2 .05 ±6 .05E3
3.83E5
10 918.50 6.50E4
±8 .40 ±1 .03E4
±7 .42 ±2 .56E4
13 876.15
6.03E5
±6 .74 ±4 .12E5
7 876.94 9.19E4 ±5 .89
90 23 7 3
±0 .00
4.51E5
83 867.35 3.34E5
±4 .24 ±1 .73E5
±2 .04 ±1 .26E5
77 867.40
3
±4 .88 ±1 .31E4
1.04E5
7 918.76
3
±0 .00
±3 .06 ±3 .49E4
20 559.89
53
±0 .00
0 832.43 1.77E5 ±0 .00
Hit (%)
70
These algorithms differ from each other just in the mutation method applied: inversion (JCell2o1iinv ), insertion (JCell2o1iins ), swap (JCell2o1isw ), or they three (JCell2o1i) —for details on the mutation operators refer to Sect. 3.3. An interesting observation is that using a single mutation operator does not seem to throw a clear heading winner, i.e., no overall best performance of any of the three basic cellular variants can be concluded. For example, using the inversion mutation (JCell2o1iinv ) we obtain very accurate solutions for most instances, although it is less accurate for some other ones (i.e., instances CMT-n51-k5, CMT-n101-k8, and CMTd-n76-k11). The same behavior is obtained in terms of the average evaluations made and for the hit rate; so it is clear that the best algorithm depends on the instance being solved when we are using a single mutation operator. Hence, as in the work of Chellapilla [31], wherein his algorithm (evolutionary programming) is outperformed by combining two different mutations, we decided to develop the new mutation composed of the three proposed ones (called combined mutation) in order to exploit the behavior of the best of them for each instance. As we can see in Table 3 this idea works: JCell2o1i stands out as the best one out of the four compared methods. In terms of the average evaluations made, it is the best algorithm for all the tested instances. Moreover, in terms of accuracy and the hit rate, JCell2o1i is the algorithm that finds the best values for a larger number of instances (see the bolded values for the three metrics). In Fig. 8 we show a graphical evaluation of the four algorithms in terms of accuracy, efficiency, and efficacy. JCell2o1i is, in general, more accurate than the other three algorithms, although differences are very small (Fig. 8a). In terms of the average evaluations made (efficiency), the algorithm using the combined mutation largely outperforms the other three (Fig. 8b), while the obtained values are slightly worse in the case of the hit rate only for a few instances (Fig. 8c). After applying the ANOVA test to the results of
A Hybrid cGA for the CVRP JCellinv
JCellsw
100
Hit rate (%)
Avg. Distance
900 800 700
80 60 40
600
20
500
0
-k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11 6 51 0 01 5 0 1 76 T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10 CM CM CM MT CM MT MT Td C C C CM
(a)
-k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11 5 51 0 01 6 0 1 76 T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10 CM CM CM MT CM MT MT Td C C C CM
(b)
JCell2o1i Avg. Number of Evaluations
JCellins
393
8,0E+05 6,0E+05 4,0E+05 2,0E+05 0,0E+00 -k5 -k10 1- k8 -k10 1- k6 -k11 1-k9 - k11 51 0 6 5 0 1 01 76 T-n T-n7 T-n1 -n1 Td-n d-n d-n1 -n10 CM CM CM MT CM MT MT Td C C C CM
(c)
Fig. 8. Comparison of the behavior of the cGA using the three different mutations independently and together in terms of (a) accuracy, (b) effort, and (c) efficacy
the algorithms, no statistical differences were found, in general, in terms of accuracy, but the new combined mutation method is better than the other three with statistical confidence for the eight tested instances if we pay attention to the number of evaluations made (effort). This means that we have an algorithm of competitive accuracy but with greater proven efficiency. 4.3 Tuning the Local Search Step Mutation has been shown to be an important ingredient to reduce the effort of the algorithm but local search is also an influential operator, so let us enter the analysis on how deep its impact is here. In the local search step, our proposal includes 2-Opt and 1-Interchange, and the natural question is whether or not (improved) variants could have been used instead (more sophisticated). To answer this question we test in this section the behavior of JCell (no local search) against some different possibilities for hybridizing it. All the studied hybrid options are based on 2-Opt and λ-Interchange (two well-known local search methods for VRP): JCell with 2-Opt (JCell2o), with 1-Interchange (JCell1i), and two different combinations of 2-Opt and λ-Interchange; those with λ = 1 (JCell2o1i) and λ = 2 (JCell2o2i). All these proposed algorithms implement the combined mutation. The results are all shown in Table 4. From Table 4 we can conclude that it is possible to improve the performance of a cGA with the combined mutation for VRP just by including a local search step into it. It can be seen that JCell2o obtains better results than JCell (without local search) in all the studied instances (algorithm columns 1 and 2), but the improvement specially stands out when the λ-Interchange method is considered with or without 2-Opt. The importance of this local search operator is substantial, to such an extent that the resulting algorithm is unable to find the best solution to the problem for any tested instance when λ-Interchange is not used. Applying the ANOVA test to the solutions obtained by the algorithms we concluded that the three algorithms using λ-Interchange are better than the others with statistical confidence in terms of accuracy. The
394
E. Alba and B. Dorronsoro
Table 4. Analysis of the effects of using local search on the JCell with combined mutation Inst.
JCell (no LS) Avg. Avg. Hit Dist. Evals. (%)
576.1 ±14 .4 956.4 ±20 .5 CMT-n101-k8 994.4 ±29 .9 CMT-n101-k10 1053.6 ±43 .5 CMTd-n51-k6 611.5 ±15 .2 CMTd-n76-k11 1099.1 ±26 .1 CMTd-n101-k10 1117.5 ±34 .8 CMTd-n101-k11 1110.1 ±58 .8 CMT-n51-k5
CMT-n76-k10
—– —– —– —– —– —– —– —– —– —– —– —– —– —– —– —–
0 0 0 0 0 0 0 0
JCell2o JCell1i Avg. Avg. Hit Avg. Avg. Dist. Evals. (%) Dist. Evals. 551.6 ±9 .4 901.8 ±15 .8 867.2 ±14 .5 949.6 ±29 .0 571.9 ±13 .4 1002.6 ±21 .7 936.5 ±15 .1 1002.8 ±40 .4
—– —– —– —– —– —– —– —– —– —– —– —– —– —– —– —–
0 529.9 1.0E5 ±6 .7 ±3 .7E4 0 851.5 —– ±6 .3 —– 0 840.6 —– ±6 .5 —– 0 822.8 2.3E5 ±5 .9 ±4 .2eE5 0 561.7 6.9E4 ±4 .0 ±5 .9E3 0 924.0 2.0E5 ±8 .2 ±0 .0 0 882.2 3.8E5 ±10 .4 ±0 .0 0 869.5 1.8E5 ±3 .5 ±6 .7E4
Hit (%)
JCell2o1i JCell2o2i Avg. Avg. Hit Avg. Avg. Hit Dist. Evals. (%) Dist. Evals. (%)
50 525.2 ±1 .5 0 842.8 ±4 .6 0 832.4 ±0 .0 10 820.9 ±4 .9 3 558.3 ±2 .1 3 918.50 ±7 .4 13 876.94 ±5 .9 13 867.4 ±2 .0
2.6E4 ±6 .4E3 7.2E4 ±0 .0 1.8E4 ±0 .0 7.1E4 ±1 .3E4 2.8E4 ±6 .1E3 6.5E4 ±2 .6E4 9.2E4 ±0 .0 3.3E5 ±1 .3E5
53 526.1 ±2 .9 3 844.7 ±5 .3 3 831.9 ±7 .2 90 823.3 ±7 .6 23 558.6 ±2 .5 7 917.7 ±9 .1 3 873.7 ±4 .9 70 867.9 ±3 .9
1.6E5 ±6 .8E4 7.5E5 ±0 .0 8.2E5 ±0 .0 1.2E5 ±2 .7e4 3.7E5 ±1 .8E5 7.1E5 ±3 .5E5 2.87E5 ±1 .2E5 3.5E5 ±1 .4E5
77 3 3 17 13 13 80 83
differences among these three algorithms in the average evaluations made have also (with minor connotations) statistical significance. Regarding the two algorithms with the overall best results (those using 2Opt and λ-Interchange), JCell2o1i always obtains better values than JCell2o2i (see bolded values) in terms of the average evaluations made (with statistical significance). Hence, applying 1-Interchange is faster than applying 2Interchange (as can be expected since it is a lighter operation) with greater accuracy and efficiency (less computational resources). We summarize in Fig. 9 the results of the compared algorithms. JCell and Jcell2o do not appear in Figs. 9b and 9c because these two algorithms were not able to find the best-known solution for any of the tested instances. In terms of the average distance of the solutions found (Fig. 9a), it can be seen that the three algorithms using the λ-Interchange method have similar results, outperforming the rest in all the cases. JCell2o1i is the algorithm which needs the lowest number of evaluations to get the solution in almost all the cases (Fig. 9b). In terms of the hit rate (Fig. 9c), JCell1i is, in general, the worst algorithm (among the three using λ-Interchange) for all the tested instances. To summarize in one example the studies made in this subsection, we plot in Fig. 10 the evolution of the best (10a) and the average (10b) solutions found during a run with the five algorithms studied in this section for CMT-n51-k5. All the algorithms implement the combined mutation, composed of the three methods studied in Sect. 3.3: insertion, inversion and swap. In graphics 10a and 10b we can see a similar behavior, in general, for all the algorithms: the population quickly evolves towards solutions close to the best-known one (value 524.61) in all the cases. After zooming in on the graphics we can notice more clearly that both JCell and JCell2o converge more slowly than the others, and finally get stuck at local optima. JCell2o maintains the diversity for longer with respect to the other algorithms. Although the algorithms with λ-Interchange converge faster than JCell and JCell2o, they are able to find
Avg. Distance
A Hybrid cGA for the CVRP 1100
395 Jcell JCell2o JCell1i JCell2o1i JCell2o2i
900 700 500 10
5
1-k
n5 T-
CM
6-k
CM
T
-n7
10
-k8
01
-n1
T
CM
T
CM
11
6
1-k
1-k
0 -n1
5 d-n
T
CM
6-k
7 d-n
T
CM
T
CM
11
-k9
01
1 d-n
1-k
0 -n1
Td
CM
1.0E+06
100
8.0E+05
80
HitRate(%)
Avg. Number of Evaluations
(a)
6.0E+05 4.0E+05 2.0E+05
60 40 20 0
0.0E+00 0 5 0 8 6 1 k9 k11 1-k -k1 1-k -k1 1-k -k1 1n5 -n76 -n10 101 d-n5 n76 -n10 101 T T d -n -n T dT CM CM CM MT CM MT CMT MTd C C C
5 0 k8 10 -k6 k11 -k9 k11 1-k -k1 01- 1-k 1 1 -n5 -n76 -n1 n10 d-n5 -n76 -n10 n101 T T d d dCM CMT CM MT CMT MT MT T C C C CM
(b)
(c)
Fig. 9. Comparison of the behavior of the cGA without local search and four cGAs using different local search techniques in terms of (a) accuracy, (b) effort, and (c) efficacy
the optimal value, escaping from the local optima thanks to the λ-Interchange local search method.
5 Solving CVRP with JCell2o1i In this section we describe the results of the experiments we have made for testing our algorithm on CVRP. JCell2o1i has been implemented in Java and tested on a 2.6 GHz PC under Linux with a very large test suite, composed of an extensive set of benchmarks drawn from the literature: (i) Augerat et al. (sets A, B and P) [32], (ii) Van Breedam [33], (iii) Christofides and Eilon [34], (iv) Christofides, Mingozzi and Toth (CMT) [3], (v) Fisher [35], (vi) Golden et al. [5], (vii) Taillard [36], and (viii) a benchmark generated from TSP instances [37]. All these benchmarks, as well as the best-known solution for their instances, are publicly available at [4]. Due to the heterogeneity of this huge test suite, we will find many different features in the studied instances, which will represent a hard test to JCell2o1i for this problem, usually much larger than usual studies in a single chapter in the VRP literature. The parameters used by JCell2o1i in all tests are those previously listed in Table 1. In Sects. 5.1 to 5.8 we analyze in details the results obtained with JCell2o1i for the 160 instances composing our test suite. The numerical results of the
396
E. Alba and B. Dorronsoro JCell
JCell2o
JCell1i
JCell2o1i
JCell2o2i
Best solution 1850 1650
Distance
1450 1250 1050 850 650 450 0
100000
200000
300000
400000 500000
600000
700000
Evaluations
(a) Average Solution 25000
Distance
20000 15000 10000 5000 0 0
100000
200000 300000
400000 500000 600000
700000
Evaluations
(b)
Fig. 10. Evolution of the best (a) and average (b) solutions for CMT-n51-k5 with some algorithms differing only in the local search applied
algorithms themselves (optimal routes, costs, etc.) are shown in Appendix B, where the figures in the tables are calculated after making 100 independent runs (for statistical significance), except for Table 21, where 30 runs were made due to the computational difficulty of the benchmark. The new bestknown solutions found here for the first time are shown in Appendix A. We use the same nomenclature for all the instances. It is composed of an identifier of the benchmark which the instance belongs to, followed by an ‘n’ and the number of nodes (customers to serve plus depot), and a ‘k’ with the number of vehicles used in the best known solution (i.e., bench-nXX-kXX). In this section, the metrics for the performance of the algorithm are (see tables of results in Appendix B) the value of the best solution found for each instance, the average number of evaluations made until that best solution was found, the success rate, the average value of the solutions found in all the independent runs, the deviation between our best solution and the best-known
A Hybrid cGA for the CVRP
397
one so far for the instance (in percentage), and the previously best-known solution for each instance. The deviation between our best solution (sol ) and the best-so-far one (best) is calculated by Equation 8. < = best − sol ∆(sol) = ∗ 100 . (8) best In sections 5.1, 5.2, 5.3, 5.5, and 5.8, distances between customers have been rounded to the closest integer value, following the TSPLIB convention [37]. In the other sections, no rounding has been done. We discuss each benchmark in a separate section. 5.1 Benchmark by Augerat et al. This benchmark, proposed in 1995, is composed of three suite cases (Sets A, B, and P). All the instances of each set are characterized by some different features, like the distribution of the customer locations. The best-known solutions have been proved to be the optimal ones for every instance of this benchmark [38]. In the remainder of this section we will study the behavior of JCell2o1i when solving the three suite cases of this benchmark. The deviations of the solutions found are not plotted in this section because JCell2o1i finds the best-known ones in all the instances in this benchmark (except for B-n68-k9). Set A. This benchmark is made up of instances wherein both customer locations and demands are randomly generated by a uniform distribution. The size of the instances range from 31 to 79 customers. We can see in Table 14 that JCell2o1i solves all the instances to optimality. In Fig. 11 we plot the percentage of runs in which the algorithm was able to find the optimal solution for every instance (hit rate). The reader can notice that, in general, it is harder for the algorithm 100
60 40 20
n3 A- 2-k n3 5 A- 3-k n3 5 A- 3-k n3 6 A- 4-k n3 5 A- 6-k n3 5 A- 7-k n3 5 A- 7-k n3 6 A- 8-k n3 5 A- 9-k n3 5 A- 9-k n4 6 A- 4-k n4 7 A- 5-k n4 6 A- 5-k n4 7 A- 6-k n4 7 A- 8-k n5 7 A- 3-k n5 7 A- 4-k n5 7 A- 5-k n6 9 A- 0-k n6 9 A- 1-k n6 9 A- 2-k n 8 A- 63n6 k9 3 A- -k1 n6 0 A- 4-k n6 9 A- 5-k n6 9 A- 9n8 k9 0k1 0
0 A-
Hit Rate (%)
80
Fig. 11. Success rate for the benchmark of Augerat et al., set A
398
E. Alba and B. Dorronsoro 100
Hit Rate (%)
80 60 40 20
B-
n3
B- 1-k n3 5 B- 4-k n3 5 B- 5-k n3 5 B- 8-k n3 6 B- 9-k n4 5 B- 1-k n4 6 B- 3-k n4 6 B- 4-k n4 7 B- 5-k n4 5 B- 5-k n5 6 B- 0-k n5 7 B- 0-k n5 8 B- 1-k n5 7 B- 2-k n5 7 B- 6-k n5 7 B- 7-k n 7 B- 57n6 k9 3 B- -k1 n6 0 B- 4-k n6 9 B- 6-k n6 9 7 B- -k1 n6 0 B- 8-k n7 9 8k1 0
0
Fig. 12. Success rate for the benchmark of Augerat et al., set B
to find the optimal solution as the number of customers grows, since the hit rate decreases as the size of the instance increases. Indeed, the hit rate is under 8% for instances involving more than 59 customers, although it is also low for smaller instances like A-n39-k5 and A-n45-k6. Set B. The instances in this set are all mainly characterized by being clustered. This set of instances does not seem to pose a real challenge to our algorithm since it is able to find the optimal values for all of them (Table 15). There is just one exception (instance B-n68-k9) but the deviation between our best found solution and the best known one is really low (0.08%). In Fig. 12 we show the success rate obtained for every instance. This hit rate is lower than 5% only for few instances (B-n50-k8, B-n63-k10, and B-n66-k9). Set P. The instances in class ‘P’ are modified versions of other instances taken from the literature. In Table 16 we summarize our results. The reader can see that our algorithm is also able to find all the optimal solutions for this benchmark. The success rates are shown in Fig. 13, and the same trend of the two previous suite cases appears: linking low hit rates for the largest instances (more than 60 customers). To end this section we just point out the remarkable properties of the algorithm, being itself competitive to many other algorithms and in the three benchmarks. 5.2 Benchmark by Van Breedam Van Breedam proposed in his thesis [33] a large set of 420 instances with many different features, like different distributions of the customers (single,
A Hybrid cGA for the CVRP
399
100
Hit Rate (%)
80 60 40 20
P-
n1 P- 6-k n1 8 P- 9-k n 2 P- 20n2 k2 P- 1-k n2 2 P- 2-k n2 2 P- 2-k n2 8 P- 3n4 k8 P- 0-k n4 5 P- 5-k n5 5 P- 0-k n5 7 P- 0-k n 8 P- 50n5 k1 1 0 P- -k1 n5 0 P- 5n k7 P- 55n5 k8 P- 5n5 k1 P- 5-k 0 n6 15 P- 0-k n6 10 P- 0-k n6 15 5 P- -k1 n7 0 0 P- -k1 n7 0 P- 6-k n7 4 P- 6-k n1 5 01 -k 4
0
Fig. 13. Success rate for the benchmark of Augerat et al., set P
clusters, cones, ...) and the depot (central, inside, outside) in the space, time windows, pick-up and delivery, or heterogeneous demands. Van Breedam also proposed a reduced set of instances (composed of 15 problems) from the initial one, and solved it with many different heuristics [33] (there exist more recent works using this benchmark, but no new best-solutions were reported [39,40]). This reduced set of instances is the one we study in this chapter. All the instances of this benchmark are composed of the same number of customers (n = 100), and the total demand of the customers is always the same, 1000 units. If we should adopt the nomenclature initially used in all this chapter, instances using the same vehicle capacity will have the same name; to avoid this we will use a special nomenclature for this benchmark, numbering the problems from Bre-1 to Bre-15, in order to avoid the repetition of names for different instances. Problems Bre-1 to Bre-6 are constrained only by vehicle capacity. The demand for each stop is 10 units. The vehicle capacity equals 100 units for Bre-1 and Bre-2, 50 units for Bre-3 and Bre-4, and 200 units for Bre-5 and Bre-6. Problems Bre-7 and Bre-8 are not studied in this work because they include pick-up and delivery constraints. A specific feature of this benchmark is the use of homogeneous demands at stops, i.e. all the customers of an instance require the same amount of goods. The exceptions are problems Bre-9 to Bre-11, for which the demands of customers are heterogeneous. The remaining four problems, Bre-12 to Bre-15, are not studied in this work because they include time window constraints (out of scope). Our first conclusion is that JCell2o1i outperforms the best-so-far solutions for eight out of nine instances. Furthermore, the average solution found in all the runs is better than the previously best-known solution in all these eight improved problems (see Table 17), what is an indication of the high accuracy and stability of our algorithm. As we can see in Fig. 14, the algorithm finds the new best-solutions in a high percentage of the runs for some instances (Bre-1, Bre-2, Bre-4, and Bre-6), while in some other ones new best-solutions were found in several runs (Bre-5, Bre-9, Bre-10, and Bre-11). In Fig. 15 we show the deviations of the solutions found (the symbols mark
400
E. Alba and B. Dorronsoro 100
Hit Rate (%)
80
60
40
20
0 Bre-1 Bre-2 Bre-3 Bre-4 Bre-5 Bre-6 Bre-9 Bre-10 Bre-11
Fig. 14. Success rate for the benchmark of Van Breedam 6 5,59* 4,68*
5
(%)
4 3,24*
2,76*
3 2,56* 2
1,82*
1,02* 1 0.0
0,41*
0 Bre-1 Bre-2 Bre-3 Bre-4 Bre-5 Bre-6 Bre-9 Bre-10 Bre-11
Fig. 15. Deviation rates for the benchmark of Van Breedam
the improved instances). The new best solutions found for this benchmark are shown in Tables 5 to 12 in Appendix A. We have not found common patterns in the different instances (neither in the distribution of customers nor in the location of the depot) that allow us to predict the accuracy of the algorithm when solving them. We have noticed that the 4 instances for which the obtained hit rate is lower (Bre-5, Bre-9, Bre-10, and Bre-11) have common features, such as the non-centered localization of the depot and the presence of a unique customer located far away from the rest, but these two features can also be found in instances Bre-2 and Bre-6. However, a common aspect of instances Bre-9 to Bre-11 (not present in the other instances of this benchmark) are the heterogeneous demands at the stops. Hence, we can conclude that the use of heterogeneous demands in this benchmark reduces the number of runs in which the algorithm obtains the best found solution (hit rate). Finally, in the case of Bre-3 (uniformly circledistributed clusters and not centered depot), the best-so-far solution could not be improved, but it was found in every run. 5.3 Benchmark by Christofides and Eilon The instances that compose this benchmark range from easy problems of just 21 customers to more difficult ones of 100 customers. We can distinguish two different kind of instances in the benchmark. In the smaller instances (up to
A Hybrid cGA for the CVRP
401
100
Hit Rate (%)
80 60 40 20 0
-k4 2-k4 3-k3 0-k3 0-k4 1-k7 3-k4 1-k5 6-k7 6-k8 -k10 -k14 1-k8 -k14 7 3 3 3 13 3 2 2 1 5 7 0 6 6 E-n E-n E-n E-n E-n E-n E-n E-n E-n E-n E-n7 E-n7 E-n1 -n10 E
Fig. 16. Success rate for the benchmark of Christofides and Eilon
32 customers), the depot and all the customers (except one) are in a small region of the plane, and there exists only one customer located far away from that region wherein all the others are (like in the case of some instances of the benchmark of Van Breedam). In the rest of instances of this benchmark, the customers are randomly distributed in the plane and the depot is either in the center or near to it. In order to compare our results with those found in the literature, distances between customers have been rounded to the closest integer value, as we previously did in the cases of the three sets of instances of Augerat et al. in Sect. 5.1. We show the percentage of runs in which the optimal solution is found in Fig. 16. It can be seen in that picture how the algorithm finds some difficulties when solving the largest (over 50 customers) instances (customers distributed in the whole problem area), for which the hit rate ranges from 1% to 8%. In the case of the smaller instances (32 customers or less), the solution is found in a 100% of the runs, except for E-n31-k7 (67%). The deviations of the solutions found (∆) are not plotted because JCell2o1i is able to find the optimal solutions (∆ = 0.0) for all the instances in this benchmark (see Table 18). 5.4 Benchmark by Christofides, Mingozzi and Toth The benchmark we present in this section has been well-explored by many researchers. It is composed of eight instances in which the customers are located randomly in the plane, plus four clustered instances (CMT-n101-k10, CMT-n121-k7, CMTd-n101-k11, and CMTd-n121-k11). Among the fourteen problems composing this benchmark, there are seven basic VRP problems that do not include route length restrictions (those called CMT-nXX-kXX).
402
E. Alba and B. Dorronsoro
The other seven instances (named CMTd-nXX-kXX) use the same customer locations as the previous seven, but they are subject to additional constraints, like limited distances on routes and drop times (time needed to unload goods) for visiting each customer. These instances are unique in all the benchmarks studied in this work because they have drop times associated to the customers. These drop times are homogenous for all the customer of an instance, i.e. all the customers spend the same time in unloading goods. Conversely, the demands at the stops are heterogeneous, so different customers can require distinct amounts of goods. The four clustered instances have a not centered depot, while in the other ones the depot is centered (with the exceptions of CMT-n51-k5 and CMTdn51-k6). The reader can see in Table 19 that these instances are, in general, harder than all the previously studied ones since JCell2o1i is not able to find the optimal solution for some of them for the first time in our study (see Fig 17). Specifically, the best-known solution cannot be found by our algorithm for six instances (the largest ones). Nevertheless, it can be seen in Fig. 18 that the 100
Hit Rate (%)
80 60 40 20
M
C
C
M
Tn5
1k5 Tn7 6C k1 M T0 n1 C 01 M -k T8 n1 01 C -k M 10 Tn1 C 21 M -k T7 n1 51 C M -k T12 n2 00 C k1 M Td 7 -n C 5 M 1Td k6 -n 76 C M T d - k1 1 -n C 10 M 1Td k9 -n 10 C M 1 -k Td 11 -n 12 C M 1Td k1 -n 1 15 C M T d 1- k 14 -n 20 0k1 8
0
Fig. 17. Success rate for the benchmark of Christofides, Mingozzi and Toth 2
1.88
1.6 1.08
(%)
1.2 0.66
0.8
0.16 0.22
0.4 0.00 0.00 0.00 0.00 0.00
M
C
C
M
Tn5
1k Tn7 5 6C M k1 Tn1 0 C 01 M Tn 1 k8 01 C -k M 1 Tn1 0 C 21 M -k T7 n1 51 C M k Tn2 12 00 C -k M 17 Td -n C 51 M -k Td 6 -n 76 C M k1 Td 1 -n C 10 M Td 1k -n 10 9 C M T d 1- k 1 -n 12 1 C M T d 1- k 1 -n 15 1 C M T d 1- k 1 -n 20 4 0k1 8
0
0.00 0.00 0.00 0.00
Fig. 18. Deviation rates for the benchmark of Christofides, Mingozzi and Toth
A Hybrid cGA for the CVRP
403
difference (∆) between our best solution and the best-known one is always lower than 2%. The best-known solution has been found for all the clustered instances, except for CMTd-n121-k11 (∆ = 0.16). Notice that the two nonclustered instances with a non-centered depot where solved in a 51% (CMTn51-k5) and 23% (CMTd-n51-k6) of the runs. Hence, the best-known solution was found for all instances having a non-centered depot, with the exception of CMTd-n121-k11. 5.5 Benchmark by Fisher This benchmark is composed of just three instances, which are taken from reallife vehicle routing applications. Problems F-n45-k4 and F-n135-k7 represent a day of grocery deliveries from the Peterboro and Bramalea, Ontario terminals, respectively, of National Grocers Limited. The other problem (F-n72-k4) is concerned with the delivery of tires, batteries and accessories to gasoline service stations and is based on data obtained from Exxon (USA). The depot is not centered in the three instances. Since this benchmark also belongs to the TSPLIB, distances between customers are rounded to the nearest integer value. Once again, all the instances have been solved to optimality by our algorithm (see Table 20) at high hit rates for two out of the three instances (hit rate of 3% for F-n135-k7). In Fig. 19 we show the obtained hit rates. This figure clearly shows how the difficulty of finding the optimal solution grows with the size of the instance. Once more, the deviations (∆) have not been plotted because they are 0.00 for the three instances. 5.6 Benchmark by Golden et al. In this subsection we deal with a difficult set of still larger-scale instances, ranging from 200 to 483 customers. The instances composing the benchmark are theoretical, and the customers are spread in the plane forming geometric figures, such as concentric circles (8 instances), squares (4 instances), rhombuses (4 instances) and stars (4 instances). The depot is centered in all the instances except for the four having a rhombus shape, for which the depot is located in one vertex. Maximal route distances are considered only in instances named gold-nXXX-kXX. In the other ones (gol-nXXX-kXX), route distances are unlimited. No drop times are considered in this benchmark.
Hit Rate (%)
100 80 60 40 20 0
F-n45-k4
F-n72-k4
F-n135-k7
Fig. 19. Success rate for the benchmark of Fisher
404
E. Alba and B. Dorronsoro 2,8 2,43
2,53 2,54
2,42
2,4 2,04
(%)
2 1,6 1,2
1,57
1,79 1,90 1,74 1,65
1,73
1,63 1,16
0,98
0,99
0,8
0,75 0,48
0,4 0,00
0,31
0,18
go
l-n 2 go 41 l-n -k2 go 253 2 l-n -k 2 go 256 7 l-n -k go 301 14 l-n -k go 321 28 l-n -k go 324 30 l-n -k 1 go 361 6 l-n -k 3 go 397 3 l-n -k 3 go 400 4 l-n -k go 42 18 l-n 1-k go 481 41 l-n -k 3 go 484 8 ld -k go -n2 19 ld 01 -n -k go 241 5 ld -k go -n2 10 ld 81 -n go 32 k8 l 1go d-n k10 ld 36 1 go n40 -k9 ld 1-n k1 go 44 0 ld 1-n k1 48 1 1k1 2
0
Fig. 20. Deviation rates for the benchmark of Golden et al.
The results are given in Table 21 of Appendix B. As can be seen, the instances of this benchmark are specially hard to solve: only for instance gold-n201-k5 the best-known solution was obtained (with a hit rate of 2%). However, the deviation between the best-known value and the best found one is always very low (as in the case of the previous sections): under 2.60% (see Fig. 20). We have also observed that the run time of our algorithm grows with the size of the instances. This is due to the exhaustive local search step we apply to each generation for every individual. In this local optimization step, the algorithm tests all the neighbors of a given individual obtained after applying 2-Opt and 1-Interchange methods to it. Therefore, the number of individuals to test grows exponentially with the number of customers of the instance. Conversely, the accuracy of JCell2o1i does not necessarily decrease with the increment in the size of the instance, as shown in Fig. 20. For example, the accuracy when solving gol-n484-k19 is higher (lower deviation with respect to the best-known solution) than in the case of smaller instances such as gol-n481-k38, gol-n421-k41, gol-n400-k18, or even instances having under 100 customers less (gol-n361-k33 and gol-n324-k16). 5.7 Benchmark by Taillard We present in this subsection a hard VRP benchmark proposed by Taillard. It is composed of 15 nonuniform problems. The customers are located on the plane, spread out into several clusters, and both the number of such clusters and their density are quite variable. The quantities ordered by the customers are exponentially distributed (so the demand of one customer may require the entire capacity of one vehicle). There are four instances with size 75 customers (9 or 10 vehicles), 100 customers (10 or 11 vehicles), and 150 customers (14 or 15 vehicles). There
A Hybrid cGA for the CVRP
405
is also one pseudo-real problem involving 385 cities, which was generated as follows: each city is the most important town or village of the smallest political entity (commune) of the canton of Vaud, in Switzerland. The census of inhabitants of each commune was taken at the beginning of 1990. A demand of 1 unit per 100 inhabitants (but at least 1 for each commune) was considered, and vehicles have a capacity of 65 units. As we can see in Table 22, JCell2o1i is able to obtain the best-known solutions for the four smaller instances (those of 75 customers) and also for ta-n101-k11b (see Table 21 for the hit rates). Moreover, for instance ta-n76-k10b our algorithm has improved on the best-known solution so far by 0.0015% compared to the previous solution (new solution = 1344.62); this solution is shown in Table 13 of Appendix A. Note that this best solution is quite close to the previously existing one, but they represent very different solutions in terms of resulting routes. JCell2o1i has found the previous best solution for this instance in 30% of the runs made. In the instances for which the known best solution was found but not improved, the hit rate is over 60%. The deviations in the other instances, as in the case of the previous section, are very low (under 2.90), as can be seen in Fig. 22, i.e., the algorithm is very stable. 100 Hit Rate (%)
80 60 40 20 0 b b b a 7 a b a a b a b a 10 15 14 15 11 11 14 10 10 -k4 -k9 -k9 10 6-k 76-k -n76 -n76 01-k 01-k 01-k 01-k 51-k 51-k 51-k 51-k 386 1 1 1 1 n 1 1 1 1 n n n n a-n n a-n n ta a-n ta tatatatatat tat tat
7 a-n
t
Fig. 21. Success rate for the benchmark of Taillard 3
2.87
2.5
2.09
(%)
2 1.5 0.95
1 0.5 0
0.32
0.39 0.00 0.00* 0.00 0.00
0.19
0.35 0.02
0.04
7 5b 4b 4a 5a 0a 10b 11a 11b b a a 9b -k1 10 -k9 10 -k -k1 -k1 1-k1 1-k1 6-k4 -k -k 6-k 101 101 101 101 8 5 51 151 5 6-k 76-k -n76 7 3 1 1 1 n n n n n n n n n n n ta tatatatatatatatatatata-
7 a-n
t
Fig. 22. Deviation rates for the benchmark of Taillard ( marks new best solution found by JCell2o1i)
406
E. Alba and B. Dorronsoro
5.8 Benchmark of Translated Instances from TSP This benchmark is made up of small instances (with sizes ranging from 15 to 47 customers) whose data have been reused from well known instances for the Travelling Salesman Problem (TSP). As in the case of some instances of the benchmark of Van Breedam, the demands of customers are homogeneous in this benchmark, i.e., all the customers require the same amount of goods. Once more, JCell2o1i solves all the instances to optimality (see Table 23), and obtaining a (very) high percentage of runs finding the optimal solution for every instance. Indeed, the optimal solution was found in the 100% of the runs for every instance except for att-n48-k4 (85%) and gr-n48-k3 (39%), as it is shown in Fig. 23. Hence, the average solution found in the 100 runs made for each instance is very proximate to the optimal one, ranging from 0.00% in the cases of 100% of hit rate to 0.00071% for att-n48-k4. These results suggest that translations from TSP end in instances of just average difficulty compared to problems proposed directly as CVRP instances. Due to the large distances appearing in instance att-n48-k4, we have used values λ = µ = 10000 for avoiding possible non-correct solutions to be better than the optimal (see Sect. 3).
6 Conclusions and Further Work In this chapter we have developed a single algorithm which is able to compete with the many different best known optimization techniques for solving the CVRP. Our algorithm has been tested in a very large set of 160 instances with different features, e.g., uniformly and not uniformly dispersed customers, clustered and not clustered, with a centered or not centered depot, having maximal route distances or not, considering drop times or not, with homogeneous or heterogeneous demands, etc. We consider that the behavior of our cellular algorithm with merged mutation plus local search is very satisfactory since it obtains the best-known solution in most cases, or values very close to it, for all of the test suite. Moreover,
Hit Rate (%)
100 80 60 40 20 0
5 3 4 4 3 3 4 3 4 5 -k4 9-k 6-k 2-k 9-k 7-k 1-k 4-k 8-k 8-k 2-k -n2 ig-n4 fri-n2 gr-n1 gr-n2 gr-n2 gr-n4 -n2 -n4 -n4 s k s g y y h is ba ntz ba sw da
48
-n att
Fig. 23. Success rate for the benchmark of translated instances from TSP
A Hybrid cGA for the CVRP
407
it has been able to improve the known best solution so far for nine of the tested instances, which represents an important record in present research. Hence, we can say that the performance of JCell2o1i is similar or even better to that of the best algorithm for each instance. Besides, our algorithm is quite simple since we have designed a canonical cGA with three widely used mutations in the literature for this problem, plus two well known local search methods. As future work, it may be interesting to test the behavior of the algorithm with other local search methods, i.e., 3-Opt. Another further step is to adapt the algorithm for testing it on other variants of the studied problem, like VRP with time windows (VRPTW), multiple depots (MDVRP), or backhauls (VRPB). Finally, it will also be interesting to study the behavior of JCell2o1i after applying the local search step in a less exhaustive form, thus developing a faster algorithm.
7 Acknowledgement This work has been partially funded by MCYT and FEDER under contract TIN2005-08818-C04-01 (the OPLINK project: http://oplink.lcc. uma.es).
References 1. Toth P, Vigo D (2001) The vehicle routing problem. Monographs on discrete mathematics and applications. SIAM, Philadelphia 2. Dantzing G, Ramster R (1959) The truck dispatching problem. Manag Sci 6:80–91 3. Christofides N, Mingozzi A, Toth P (1979) The vehicle routing problem. In: Combinatorial optimization. Wiley, New York, pp 315–338 4. http://neo.lcc.uma.es/radi-aeb/WebVRP/index.html 5. Golden B, Wasil E, Kelly J, Chao IM (1998) The impact of metaheuristics on solving the vehicle routing problem: algorithms, problem sets, and computational results. In: Fleet management and logistics. Kluwer, Boston, pp 33–56 6. Cordeau JF, Gendreau M, Hertz A, Laporte G, Sormany JS (2005) New heuristics for the vehicle routing problem. In: Langevin A, Riopel D (eds.) Logistics systems: design and optimization. Kluwer Academic, Dordecht, Springer Verlag NY, pp. 279–297 7. Manderick B, Spiessens P (1989) Fine-grained parallel genetic algorithm. In: Schaffer J (ed.) Proceedings of the third international conference on genetic algorithms – ICGA89, Morgan-Kaufmann, Los Altos, CA, pp 428–433 8. Alba E, Tomassini M (2002) Parallelism and evolutionary algorithms. IEEE Trans Evol Comput 6:443–462 9. Sarma J, Jong KD (1996) An analysis of the effect of the neighborhood size and shape on local selection algorithms. In: Voigt H, Ebeling W, Rechenberg I, Schwefel H (eds.) Parallel problem solving from nature (PPSN IV). Volume 1141 of lecture notes in computer science. Springer, Berlin Heidelberg New York, pp 236–244
408
E. Alba and B. Dorronsoro
10. Gorges-Schleuter M (1989) ASPARAGOS an asynchronous parallel genetic optimisation strategy. In: Schaffer JD (ed.) Proceedings of the third international conference on genetic algorithms – ICGA89. Morgan Kaufmann, Los Altos, CA, pp 422–427 11. Alba E, Dorronsoro B (2005) The exploration/exploitation tradeoff in dynamic cellular evolutionary algorithms. IEEE Trans Evol Comput 9:126–142 12. Giacobini M, Alba E, Tomassini M (2003) Selection intensity in asynchronous cellular evolutionary algorithms. In: E. Cant´ u-Paz et al. (ed.) Proceedings of the genetic and evolutionary computation conference, GECCO03. Springer, Berlin Heidelberg New York, pp 955–966 13. Giacobini M, Alba E, Tettamanzi A, Tomassini M (2004) Modeling selection intensity for toroidal cellular evolutionary algorithms. In: Deb K (ed.) Proceedings of the genetic and evolutionary computation conference, GECCO04. LNCS 3102, Seattle, Washington. Springer, Berlin Heidelberg New York, pp 1138–1149 14. Alba E, Dorronsoro B (2004) Solving the vehicle routing problem by using cellular genetic algorithms. In: Gottlieb J, Raidl GR (eds.) Evolutionary computation in combinatorial optimization – EvoCOP 2004. volume 3004 of LNCS, Coimbra, Portugal. Springer, Berlin Heidelberg New York, pp 11–20 15. Gendreau M, Hertz A, Laporte G (1994) A tabu search heuristic for the vehicle routing problem. Manag Sci 40:1276–1290 16. Toth P, Vigo D (2003). The granular tabu search and its application to the vehicle routing problem. INFORMS J Comput 15:333–346 17. Berger J, Barkaoui M (2003). A hybrid genetic algorithm for the capacitated vehicle routing problem. In: Cant´ u-Paz E (ed.) Proceedings of the international genetic and evolutionary computation conference – GECCO03. LNCS 2723, Illinois, Chicago, USA. Springer-Verlag, Berlin, pp 646–656 18. Prins C (2004) A simple and effective evolutionary algorithm for the vehicle routing problem, Computers and Operations Research 31:1985–2002 19. Lenstra J, Kan AR (1981) Complexity of vehicle routing and scheduling problems. Networks 11:221–227 20. Ralphs T, Kopman L, Pulleyblank W Jr, LT (2003) On the capacitated vehicle routing problem. Math Prog Ser B 94:343–359 21. Whitley D (1993) Cellular genetic algorithms. In: Forrest S (ed.) Proceedings of the fifth international conference on genetic algorithms. Morgan Kaufmann, Los Altos, CA, p 658 22. Alba E, Giacobini M, Tomassini M, Romero S Comparing synchronous and asynchronous cellular genetic algorithms. In: JJ Merelo et al. (ed.) Parallel problem solving from nature – PPSN VII. Volume 2439 of lecture notes in computer science, Granada, Spain. Springer-Verlag, Heidelberg, pp 601–610 23. Duncan T (1995) Experiments in the use of neighbourhood search techniques for vehicle routing. Technical report AIAI-TR-176, Artificial Intelligence Applications Institute, University of Edinburgh, Edinburgh 24. Whitley D, Starkweather T, Fuquay D (1989) Scheduling problems and traveling salesman: the genetic edge recombination operator. In: Schaffer J (ed.) Proceedings of the third international conference on genetic algorithms – ICGA89. Morgan-Kaufmann, Los Altos, CA, pp 133–140 25. Fogel D (1988) An evolutionary approach to the traveling salesman problem. Biol Cybernetics 60:139–144 26. Banzhaf W (1990) The molecular traveling salesman. Biol Cybernetics 64:7–14
A Hybrid cGA for the CVRP
409
27. Holland J (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, MI 28. Rochat Y, Taillard E (1995) Probabilistic diversification and intensification in local search for vehicle routing. J Heuristics 1:147–167 29. Croes G (1958) A method for solving traveling salesman problems. Oper Res 6:791–812 30. Osman I (1993) Metastrategy simulated annealing and tabu search algorithms for the vehicle routing problems. Ann Oper Res 41:421–451 31. Chellapilla K (1998) Combining mutation operators in evolutionary programming. IEEE Trans Evol Comput 2:91–96 32. Augerat P, Belenguer J, Benavent E, Corbern A, Naddef D, Rinaldi G (1995) Computational results with a branch and cut code for the capacitated vehicle routing problem. Research Report 949-M, Universite Joseph Fourier, Grenoble, France 33. Van Breedam A (1994) An analysis of the behavior of heuristics for the vehicle routing problem for a selection of problems with vehicle related, customerrelated, and time-related constraints. PhD thesis, University of Antwerp – RUCA, Belgium 34. Christofides N, Eilon S (1969) An algorithm for the vehicle dispatching problem. Oper Res Quart 20:309–318 35. Fisher M (1994) Optimal solution of vehicle routing problems using minimum k-trees. Oper Res 42–44:626–642 36. Taillard E (1993) Parallel iterative search methods for vehicle-routing problems. Networks 23:661–673 37. Reinelt G (1991) TSPLIB: A travelling salesman problem library. ORSA J Comput 3:376–384 URL: http://www.iwr.uni–heidelberg.de/groups/comopt/ software/TSPLIB95/ 38. Lysgaard J, Letchford A, Eglese R (2004) A new branch-and-cut algorithm for capacitated vehicle routing problems. Math Program 100:423–445 39. Van Breedam A (2001) Comparing descent heuristics and metaheuristics for the vehicle routing problem. Comput Oper Res 28:289–315 40. Van Breedam A (2002) A parametric analysis of heuristics for the vehicle routing problem with side-constraints. Eur J Oper Res 137:348–370 41. Fukasawa R et al (2004) Robust branch-and-cut-and-price for the capacitated vehicle routing problem. In: Integer programming and combinatorial optimization (IPCO). Volume 3064 of LNCS, New York, USA. Springer-Verlag, Berlin, pp 1–15 42. Gendreau M, Hertz A, Laporte G (1991) A tabu search heuristic for the vehicle routing problem. Technical Report CRT-777, Centre de Recherche sur les Transports – Universit de Montral, Montral, Canada 43. Mester D, Br¨ aysy O (2005) Active guided evolution strategies for large-scale vehicle routing problems with time windows. Comps & Ops Res 32:1593–1614 44. Li F, Golden B, Wasil E (2005). Very large-scale vehicle routing: new test problems, algorithms, and results. Comput & Oper Res 32:1165–1179 45. Tarantilis C, Kiranoudis C (2002). Boneroute: an adaptive memory-based method for effective fleet management. Ann Oper Res 115:227–241 46. Gambardella L, Taillard E, Agazzi G (1999) MACS-VRPTW: a multiple ant colony system for vehicle routing problems with time windows. In: New ideas in optimization. McGraw-Hill, New York, pp 63–76 47. http://branchandcut.org/VRP/data/
410
E. Alba and B. Dorronsoro
A Best Found Solutions In this appendix we include all the details on the new best solutions found by JCell2o1i for future precise studies. The improved results are included in the benchmarks of Van Breedam (Tables 5 to 12), and Taillard (Table 13). In these tables, we show the number of routes composing the solution, its global length (sum of the distance of every route), and the composition of each route (vehicle capacity used, distance of the route, and visiting order of the customers). Note that, due to the problem representation we use (see Sect. 3.1), the customers are numbered from 0 to c − 1 (0 stands for the first customer, and not for the depot), being c the number of customers for the instance at hand. The appendix is included to make the work self-contained and meaningful for future extensions and comparisons. Table 5. New best solution for instance Bre-1 Number of routes Total solution length Capacity Route 1 Distance Visited customers Capacity Route 2 Distance Visited customers Capacity Route 3 Distance Visited customers Capacity Route 4 Distance Visited customers Capacity Route 5 Distance Visited customers Capacity Route 6 Distance Visited customers Capacity Route 7 Distance Visited customers Capacity Route 8 Distance Visited customers Capacity Route 9 Distance Visited customers Capacity Route 10 Distance Visited customers
10 1106.0 100 121.0 69, 89, 99, 34, 39, 44, 49, 90, 70, 50 100 98.0 1, 6, 11, 64, 84, 26, 21, 16, 73, 53 100 96.0 0, 5, 10, 62, 82, 25, 20, 15, 71, 51 100 124.0 56, 76, 96, 47, 42, 37, 32, 95, 85, 65 100 101.0 2, 7, 12, 66, 86, 27, 22, 17, 75, 55 100 122.0 58, 78, 98, 48, 43, 38, 33, 97, 87, 67 100 98.0 4, 9, 14, 60, 80, 29, 24, 19, 79, 59 100 123.0 54, 74, 94, 46, 41, 36, 31, 93, 83, 63 100 124.0 52, 72, 92, 45, 40, 35, 30, 91, 81, 61 100 99.0 57, 77, 18, 23, 28, 88, 68, 13, 8, 3
A Hybrid cGA for the CVRP
411
Table 6. New best solution for instance Bre-2 Number of routes Total solution length Capacity Route 1 Distance Visited customers Capacity Route 2 Distance Visited customers Capacity Route 3 Distance Visited customers Capacity Route 4 Distance Visited customers Capacity Route 5 Distance Visited customers Capacity Route 6 Distance Visited customers Capacity Route 7 Distance Visited customers Capacity Route 8 Distance Visited customers Capacity Route 9 Distance Visited customers Capacity Route 10 Distance Visited customers
10 1506.0 100 121.0 97, 87, 100 121.0 93, 83, 100 184.0 48, 38, 100 149.0 91, 81, 100 184.0 41, 31, 100 158.0 45, 35, 100 108.0 94, 84, 100 149.0 99, 89, 100 166.0 47, 37, 100 166.0 43, 33,
77, 67, 57, 56, 66, 76, 86, 96
73, 63, 53, 52, 62, 72, 82, 92
28, 18, 8, 9, 19, 29, 39, 49
71, 61, 51, 50, 60, 70, 80, 90
21, 11, 1, 0, 10, 20, 30, 40
25, 15, 5, 4, 14, 24, 34, 44
74, 64, 54, 55, 65, 75, 85, 95
79, 69, 59, 58, 68, 78, 88, 98
27, 17, 7, 6, 16, 26, 36, 46
23, 13, 3, 2, 12, 22, 32, 42
B Results The tables containing the results obtained by JCell2o1i when solving all the instances proposed in this work are shown in this appendix. The values included in Tables 14 to 23 are obtained after making 100 independent runs (for statistical significance), except for the benchmark of Golden et al. (Table 21), where only 30 runs were made due to its difficulty. The values in the tables belonging to this section correspond to the best solution found in our runs for each instance, the average number of evaluations made to find that solution, the success rate, the average value of the solutions found in all the independent runs, the deviation (∆) between our best solution and the previously best-known one (in percentage), and the known
412
E. Alba and B. Dorronsoro Table 7. New best solution for instance Bre-4
Number of routes Total solution length Capacity Route 1 Distance Visited customers Capacity Route 2 Distance Visited customers Capacity Route 3 Distance Visited customers Capacity Route 4 Distance Visited customers Capacity Route 5 Distance Visited customers Capacity Route 6 Distance Visited customers Capacity Route 7 Distance Visited customers Capacity Route 8 Distance Visited customers Capacity Route 9 Distance Visited customers Capacity Route 10 Distance Visited customers
20 1470.00 50 82.0 46, 47, 48, 49, 41 50 51.0 69, 66, 61, 62, 67 50 60.0 99, 98, 97, 96, 90 50 74.0 78, 81, 76, 74, 71 50 111.0 88, 95, 93, 86, 82 50 39.0 83, 91, 92, 84, 73 50 111.0 20, 14, 10, 5, 12 50 82.0 58, 50, 51, 52, 53 50 74.0 29, 22, 19, 24, 27 50 51.0 64, 59, 60, 65, 68
Route 11
Route 12
Route 13
Route 14
Route 15
Route 16
Route 17
Route 18
Route 19
Route 20
Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers
50 51.0 35, 40, 39, 34, 31 50 111.0 11, 4, 6, 13, 17 50 111.0 87, 94, 89, 85, 79 50 74.0 28, 21, 18, 23, 25 50 82.0 54, 55, 56, 57, 63 50 74.0 70, 77, 80, 75, 72 50 60.0 9, 3, 2, 1, 0 50 39.0 26, 15, 7, 8, 16 50 82.0 36, 42, 43, 44, 45 50 51.0 30, 33, 38, 37, 32
best solution for each instance. The execution times of our tests range from 153.50 hours for the most complex instance to 0.32 seconds in the case of the simplest one.
A Hybrid cGA for the CVRP
413
Table 8. New best solution for instance Bre-5 Number of routes Total solution length Capacity Distance Visited customers
Route 2
Capacity Distance Visited customers
Route 3
Capacity Distance Visited customers
Route 4
Capacity Distance Visited customers
Route 5
Capacity Distance Visited customers
5 950.0 200 180.0 97, 88, 98, 78, 89, 69, 99, 79, 59, 40, 30, 20, 39, 29, 49, 38, 28, 48, 57, 77 200 228.0 45, 34, 44, 63, 83, 93, 73, 53, 64, 84, 94, 74, 54, 65, 85, 95, 75, 55, 66, 86 200 135.0 96, 76, 56, 46, 35, 25, 14, 24, 13, 23, 12, 3, 4, 5, 15, 6, 16, 36, 67, 87 200 153.0 37, 17, 8, 18, 9, 19, 0, 10, 21, 41, 31, 42, 32, 22, 11, 2, 1, 7, 27, 47 200 254.0 26, 33, 43, 52, 72, 92, 82, 62, 51, 71, 91, 81, 61, 50, 70, 90, 80, 60, 58, 68
Table 9. New best solution for instance Bre-6 Number of routes Total solution length Capacity Route 1 Distance Visited customers
Route 2
Route 3
Route 4
Route 5
Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers Capacity Distance Visited customers
5 969.0 200 151.0 97, 96, 49, 73, 200 217.0 78, 74, 27, 34, 200 218.0 42, 94, 52, 53, 200 196.0 99, 92, 22, 29, 200 187.0 46, 39, 26, 33,
37, 30, 70, 23, 16, 64, 9, 65, 68, 31, 83, 38, 90, 91, 45, 98
19, 12, 5, 54, 55, 56, 57, 6, 13, 20, 41, 82, 86, 93, 95, 48
89, 85, 79, 35, 28, 21, 14, 7, 0, 50, 51, 1, 8, 72, 77, 44
84, 24, 69, 17, 10, 3, 2, 60, 59, 58, 15, 75, 80, 87, 36, 43
32, 71, 25, 18, 66, 67, 11, 61, 4, 62, 63, 76, 81, 88, 40, 47
414
E. Alba and B. Dorronsoro Table 10. New best solution for instance Bre-9
Number of routes Total solution length Capacity Route 1 Distance Visited customers Capacity Route 2 Distance Visited customers Capacity Route 3 Distance Visited customers Capacity Route 4 Distance Visited customers Capacity Route 5 Distance Visited customers Capacity Route 6 Distance Visited customers Capacity Route 7 Distance Visited customers Capacity Route 8 Distance Visited customers Capacity Route 9 Distance Visited customers Capacity Route 10 Distance Visited customers
10 1690.0 100 143.0 47, 48, 100 146.0 32, 78, 100 208.0 72, 22, 100 232.0 20, 13, 100 64.0 97, 45, 100 146.0 96, 37, 100 140.0 79, 83, 100 234.0 16, 64, 100 157.0 93, 31, 100 220.0 19, 12,
41, 90, 88, 84, 82, 34, 86, 40, 46
73, 26, 27, 75, 81, 33, 39
15, 8, 58, 59, 60, 2, 9, 65, 68
6, 57, 56, 55, 54, 5, 11, 25, 92
94, 91, 38, 95, 99, 98
30, 70, 23, 77, 29, 74, 80, 85, 36, 44
87, 89, 35, 42, 43
1, 53, 52, 51, 50, 0, 7, 14, 21, 28
76, 49, 69, 66, 67, 18, 71
63, 62, 61, 4, 3, 10, 17, 24
A Hybrid cGA for the CVRP
415
Table 11. New best solution for instance Bre-10 Number of routes Total solution length
10 1026.0
Route 1
Capacity Distance Visited customers
100 80.0 83, 81, 67, 63, 65, 64, 69, 80, 82, 1
Route 2
Capacity Distance Visited customers
100 63.0 6, 2, 27, 23, 20, 22, 24, 26, 28, 29
Route 3
Capacity Distance Visited customers
100 135.0 35, 34, 31, 30, 32, 36, 38, 39, 13
Route 4
Capacity Distance Visited customers
100 135.0 49, 44, 41, 58, 54, 59, 60, 61, 62, 66, 68
Route 5
Capacity Distance Visited customers
100 206.0 37, 33, 56, 52, 50, 51, 53, 55, 57, 70, 72
Route 6
Capacity Distance Visited customers
100 68.0 17, 19, 18, 16, 14, 15, 12, 10, 11
Route 7
Capacity Distance Visited customers
100 55.0 7, 88, 89, 87, 85, 84, 86, 3, 5, 9
Route 8
Capacity Distance Visited customers
100 136.0 93, 78, 79, 77, 75, 73, 71, 74, 76
Route 9
Capacity Distance Visited customers
100 67.0 92, 90, 91, 94, 95, 97, 99, 98, 96
Route 10
Capacity Distance Visited customers
100 81.0 8, 4, 0, 47, 45, 43, 40, 42, 46, 48, 21, 25
416
E. Alba and B. Dorronsoro Table 12. New best solution for instance Bre-11
Number of routes Total solution length
10 1128.0
Route 1
Capacity Distance Visited customers
100 117.0 36, 26, 16, 6, 7, 17, 27, 46, 56, 66
Route 2
Capacity Distance Visited customers
100 105.0 93, 82, 91, 90, 80, 70, 71, 72, 83
Route 3
Capacity Distance Visited customers
100 60.0 95, 85, 75, 55, 54, 64, 74, 84, 94
Route 4
Capacity Distance Visited customers
100 105.0 73, 62, 61, 51, 41, 31, 42, 52, 63
Route 5
Capacity Distance Visited customers
100 107.0 86, 77, 58, 48, 38, 37, 47, 57, 67, 76
Route 6
Capacity Distance Visited customers
100 152.0 32, 21, 11, 1, 0, 10, 20, 30, 40, 50, 60, 81, 92
Route 7
Capacity Distance Visited customers
100 105.0 44, 34, 24, 14, 4, 5, 15, 25, 35, 45, 65
Route 8
Capacity Distance Visited customers
100 116.0 43, 33, 23, 13, 3, 2, 12, 22, 53
Route 9
Capacity Distance Visited customers
100 112.0 87, 88, 78, 68, 79, 89, 99, 98, 97, 96
Route 10
Capacity Distance Visited customers
100 149.0 69, 59, 49, 39, 29, 19, 9, 8, 18, 28
A Hybrid cGA for the CVRP
417
Table 13. New best solution for instance ta-n76-k10b Number of routes Total solution length
10 1344.6186241386285
Route 1
Capacity Distance Visited customers
1654 258.5993825897365 73, 9, 11, 68, 54, 55, 60, 57, 56, 59, 61, 58, 67, 69, 35
Route 2
Capacity Distance Visited customers
1659 174.46520984752948 39, 22, 15, 17, 31, 46, 45, 13
Route 3
Capacity Distance Visited customers
1590 72.761072402546 12, 42, 49, 47
Route 4
Capacity Distance Visited customers
1594 321.3083147478987 7, 72, 62, 65, 66, 63, 64, 52
Route 5
Capacity Distance Visited customers
1219 13.508270432752166 6, 10, 0
Route 6
Capacity Distance Visited customers
1677 214.36282712025178 25, 33, 23, 16, 30, 27, 34, 24, 28, 19, 26, 20, 21, 29, 32
Route 7
Capacity Distance Visited customers
1617 90.52827230772104 50, 38, 41, 53, 48, 36, 43, 5
Route 8
Capacity Distance Visited customers
1547 32.385178886685395 1, 4, 74, 71, 14, 8
Route 9
Capacity Distance Visited customers
687 6.0 3
Route 10
Capacity Distance Visited customers
1662 160.7000958035075 2, 51, 37, 18, 44, 40, 70
418
E. Alba and B. Dorronsoro Table 14. Computational Facts. Benchmark of Augerat et al. Set A
Instance A-n32-k5 A-n33-k5 A-n33-k6 A-n34-k5 A-n36-k5 A-n37-k5 A-n37-k6 A-n38-k5 A-n39-k5 A-n39-k6 A-n44-k7 A-n45-k6 A-n45-k7 A-n46-k7 A-n48-k7 A-n53-k7 A-n54-k7 A-n55-k9 A-n60-k9 A-n61-k9 A-n62-k8 A-n63-k9 A-n63-k10 A-n64-k9 A-n65-k9 A-n69-k9 A-n80-k10
Our Best Best-Known Solution Found Avg. Evaluations Hit (%) 784.00 10859.36 ± 3754.35 100.00 661.00 10524.08 ± 2773.86 100.00 742.00 9044.89 ± 3087.92 100.00 778.00 13172.95 ± 4095.90 100.00 799.00 21699.29 ± 5927.13 86.00 669.00 15101.04 ± 4227.04 100.00 949.00 20575.88 ± 6226.19 95.00 730.00 17513.31 ± 5943.05 96.00 822.00 25332.52 ± 7269.13 21.00 831.00 22947.67 ± 4153.08 12.00 937.00 39208.71 ± 11841.16 34.00 944.00 52634.00 ± 10484.88 6.00 1146.00 46516.30 ± 13193.60 76.00 914.00 32952.40 ± 8327.32 100.00 1073.00 41234.69 ± 11076.23 55.00 1010.00 56302.26 ± 15296.18 19.00 1167.00 58062.13 ± 12520.94 52.00 1073.00 50973.67 ± 10628.18 48.00 1354.00 97131.75 ± 13568.88 8.00 1034.00 87642.33 ± 8468.60 6.00 1288.00 166265.86 ± 27672.54 7.00 1616.00 131902.00 ± 12010.92 2.00 1314.00 90994.00 ± 0.0 1.00 1401.00 100446.00 ± 17063.90 2.00 1174.00 88292.50 ± 20815.10 2.00 1159.00 90258.00 ± 0.0 1.00 1763.00 154976.00 ± 33517.14 4.00
∆ (%) 784.00 ± 0.00 0.00 661.00 ± 0.00 0.00 742.00 ± 0.00 0.00 778.00 ± 0.00 0.00 800.39 ± 3.62 0.00 669.00 ± 0.00 0.00 949.12 ± 0.67 0.00 730.07 ± 0.36 0.00 824.77 ± 1.71 0.00 833.16 ± 1.35 0.00 939.53 ± 2.69 0.00 957.88 ± 7.85 0.00 1147.17 ± 2.23 0.00 914.00 ± 0.00 0.00 1078.68 ± 6.89 0.00 1016.10 ± 4.17 0.00 1170.68 ± 4.84 0.00 1073.87 ± 2.15 0.00 1359.82 ± 4.53 0.00 1040.50 ± 7.43 0.00 1300.92 ± 7.12 0.00 1630.59 ± 5.65 0.00 1320.97 ± 3.69 0.00 1416.12 ± 6.38 0.00 1181.60 ± 3.51 0.00 1168.56 ± 3.64 0.00 1785.78 ± 10.88 0.00 Avg. Solution
Previously Best-Known 784.0 [41] 661.0 [41] 742.0 [41] 778.0 [41] 799.0 [41] 669.0 [41] 949.0 [41] 730.0 [41] 822.0 [41] 831.0 [41] 937.0 [41] 944.0 [41] 1146.0 [41] 914.0 [41] 1073.0 [41] 1010.0 [41] 1167.0 [41] 1073.0 [41] 1354.0 [41] 1034.0 [41] 1288.0 [41] 1616.0 [41] 1314.0 [41] 1401.0 [41] 1174.0 [41] 1159.0 [41] 1763.0 [41]
Table 15. Computational Facts. Benchmark of Augerat et al. Set B Instance B-n31-k5 B-n34-k5 B-n35-k5 B-n38-k6 B-n39-k5 B-n41-k6 B-n43-k6 B-n44-k7 B-n45-k5 B-n45-k6 B-n50-k7 B-n50-k8 B-n51-k7 B-n52-k7 B-n56-k7 B-n57-k7 B-n57-k9 B-n63-k10 B-n64-k9 B-n66-k9 B-n67-k10 B-n68-k9 B-n78-k10
Our Best Best-Known Solution Found Avg. Evaluations Hit (%) 672.00 9692.02 ± 3932.91 100.00 788.00 11146.94 ± 4039.59 100.00 955.00 12959.81 ± 4316.64 100.00 805.00 23227.87 ± 4969.47 92.00 549.00 21540.88 ± 6813.76 85.00 829.00 31148.89 ± 5638.03 79.00 742.00 27894.05 ± 5097.48 60.00 909.00 31732.34 ± 5402.31 85.00 751.00 36634.68 ± 4430.40 28.00 678.00 58428.21 ± 9709.70 33.00 741.00 16455.25 ± 3159.24 100.00 1312.00 59813.00 ± 0.0 1.00 1032.00 59401.30 ± 12736.87 79.00 747.00 32188.17 ± 5886.91 70.00 707.00 45605.11 ± 7118.89 17.00 1153.00 106824.57 ± 9896.32 7.00 1598.00 71030.38 ± 7646.15 16.00 1496.00 67313.50 ± 10540.84 2.00 861.00 97043.32 ± 15433.00 19.00 1316.00 94125.50 ± 15895.05 2.00 1032.00 130628.60 ± 26747.01 5.00 1273.00 —– 0.00 1221.00 145702.71 ± 16412.80 14.00
Avg. Solution 672.00 ± 0.00 788.00 ± 0.00 955.00 ± 0.00 805.08 ± 0.27 549.20 ± 0.57 829.21 ± 0.41 742.86 ± 1.22 910.34 ± 4.17 754.52 ± 3.88 680.94 ± 3.51 741.00 ± 0.00 1317.22 ± 4.32 1032.26 ± 0.56 747.56 ± 1.18 710.56 ± 2.22 1168.97 ± 15.52 1601.64 ± 3.66 1530.51 ± 14.07 866.69 ± 6.04 1321.36 ± 2.94 1042.48 ± 11.63 1286.19 ± 3.88 1239.86 ± 13.05
∆ (%) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.08 0.00
Previously Best-Known 672.0 [41] 788.0 [41] 955.0 [41] 805.0 [41] 549.0 [41] 829.0 [41] 742.0 [41] 909.0 [41] 751.0 [41] 678.0 [41] 741.0 [41] 1312.0 [41] 1032.0 [41] 747.0 [41] 707.0 [41] 1153.0 [41] 1598.0 [41] 1496.0 [41] 861.0 [41] 1316.0 [41] 1032.0 [41] 1272.0 [41] 1221.0 [41]
A Hybrid cGA for the CVRP
419
Table 16. Computational Facts. Benchmark of Augerat et al. Set P Instance
Our Best
P-n16-k8 P-n19-k2 P-n20-k2 P-n21-k2 P-n22-k2 P-n22-k8 P-n23-k8 P-n40-k5 P-n45-k5 P-n50-k7 P-n50-k8 P-n50-k10 P-n51-k10 P-n55-k7 P-n55-k8 P-n55-k10 P-n55-k15 P-n60-k10 P-n60-k15 P-n65-k10 P-n70-k10 P-n76-k4 P-n76-k5 P-n101-k4
450.00 212.00 216.00 211.00 216.00 603.00 529.00 458.00 510.00 554.00 631.00 696.00 741.00 568.00 576.00 694.00 989.00 744.00 968.00 792.00 827.00 593.00 627.00 681.00
Best-Known Solution Avg. Evaluations 1213.92 ± 325.60 2338.79 ± 840.45 3122.74 ± 1136.45 1704.60 ± 278.63 2347.77 ± 505.00 2708.12 ± 877.52 3422.29 ± 1275.23 16460.94 ± 4492.84 25434.43 ± 6480.60 45280.10 ± 9004.78 62514.50 ± 8506.60 49350.00 ± 0.0 55512.21 ± 9290.20 57367.27 ± 11916.64 48696.50 ± 12961.64 47529.00 ± 0.0 108412.75 ± 8508.23 77925.62 ± 12377.87 87628.35 ± 20296.24 66024.53 ± 12430.83 131991.00 ± 11429.62 89428.14 ± 17586.60 150548.60 ± 38449.91 107245.00 ± 26834.63
Found Hit (%) 100.00 100.00 100.00 100.00 100.00 100.00 100.00 99.00 72.00 29.00 4.00 1.00 19.00 22.00 78.00 1.00 4.00 8.00 23.00 30.00 4.00 7.00 5.00 7.00
Avg. Solution 450.00 ± 0.00 212.00 ± 0.00 216.00 ± 0.00 211.00 ± 0.00 216.00 ± 0.00 603.00 ± 0.00 529.00 ± 0.00 458.01 ± 0.10 510.89 ± 1.68 556.08 ± 1.94 638.80 ± 5.58 698.58 ± 2.65 747.25 ± 6.65 573.03 ± 3.09 576.42 ± 1.15 698.22 ± 1.81 1002.96 ± 9.82 749.35 ± 4.95 972.42 ± 3.12 798.44 ± 5.03 836.28 ± 5.70 599.73 ± 5.07 632.89 ± 4.51 687.36 ± 5.22
∆ Previously (%) Best-Known 0.00 450.0 [41] 0.00 212.0 [41] 0.00 216.0 [41] 0.00 211.0 [41] 0.00 216.0 [41] 0.00 603.0 [41] 0.00 529.0 [41] 0.00 458.0 [41] 0.00 510.0 [41] 0.00 554.0 [41] 0.00 631.0 [41] 0.00 696.0 [41] 0.00 741.0 [41] 0.00 568.0 [41] 0.00 576.0 [41] 0.00 694.0 [41] 0.00 989.0 [41] 0.00 744.0 [41] 0.00 968.0 [41] 0.00 792.0 [41] 0.00 827.0 [41] 0.00 593.0 [41] 0.00 627.0 [41] 0.00 681.0 [41]
Table 17. Computational Facts. Benchmark of Van Breedam Instance Our Best Bre-1 Bre-2 Bre-3 Bre-4 Bre-5 Bre-6 Bre-9 Bre-10 Bre-11
1106.00 1506.00 1751.00 1470.00 950.00 969.00 1690.00 1026.00 1128.00
Best-Known Solution Found Avg. Evaluations Hit (%) 428896.92 ± 365958.66 65.00 365154.72 ± 196112.12 53.00 146429.00 ± 29379.40 100.00 210676.00 ± 82554.15 100.00 977540.00 ± 554430.07 5.00 820180.39 ± 436314.36 51.00 1566300.00 ± 0.0 1.00 1126575.00 ± 428358.44 4.00 1742600.00 ± 840749.96 2.00
Avg. Solution 1113.03 ± 10.80 1513.68 ± 11.07 1751.00 ± 0.00 1470.00 ± 0.00 955.36 ± 4.16 970.79 ± 2.46 1708.48 ± 10.48 1033.34 ± 6.22 1153.35 ± 10.81
∆ (%) 3.24 4.68 0.00 0.41 2.56 1.02 5.59 1.82 2.76
Previously Best-Known 1143.0 [40] 1580.0 [40] 1751.0 [40] 1476.0 [40] 975.0 [40] 979.0 [40] 1790.0 [40] 1045.0 [40] 1160.0 [40]
420
E. Alba and B. Dorronsoro Table 18. Computational Facts. Benchmark of Christofides and Eilon
Instance
Our Best
E-n13-k4 247.00 E-n22-k4 375.00 E-n23-k3 569.00 E-n30-k3 534.00 E-n30-k4 503.00 E-n31-k7 379.00 E-n33-k4 835.00 E-n51-k5 521.00 E-n76-k7 682.00 E-n76-k8 735.00 E-n76-k10 830.00 E-n76-k14 1021.00 E-n101-k8 817.00 E-n101-k14 1071.00
Best-Known Solution Found
Avg. Solution
Avg. Evaluations Hit (%) 4200.00 ± 0.00 100.00 247.00 ± 0.00 3831.60 ± 1497.55 100.00 375.00 ± 0.00 2068.78 ± 414.81 100.00 569.00 ± 0.00 6400.86 ± 2844.69 99.00 534.03 ± 0.30 4644.87 ± 1256.38 100.00 503.00 ± 0.00 274394,03 ± 162620,36 67.00 380.67 ± 3.36 14634.89 ± 4463.32 100.00 835.00 ± 0.00 40197.43 ± 10576.82 30.00 526.54 ± 4.75 80594.00 ± 20025.23 3.00 688.19 ± 3.42 86700.33 ± 26512.46 3.00 741.25 ± 4.14 166568.33 ± 11138.72 3.00 841.09 ± 6.16 235643.50 ± 55012.20 2.00 1033.05 ± 6.14 192223.62 ± 43205.58 8.00 822.84 ± 3.92 1832800.00 ± 0.00 1.00 1089.60 ± 7.59
∆ Previously (%) Best-Known 0.00 0.00 0.00 0.00 —– 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
247.0 375.0 569.0 534.0 379.0 835.0 521.0 682.0 735.0 830.0 1021.0 815.0 1071.0
[41] [41] [41] [41] —– [41] [41] [41] [41] [41] [41] [41] [41] [41]
Table 19. Computational Facts. Benchmark of Christofides, Mingozzi and Toth Instance CMT-n51-k5 CMT-n76-k10 CMT-n101-k8 CMT-n101-k10 CMT-n121-k7 CMT-n151-k12 CMT-n200-k17 CMTd-n51-k6 CMTd-n76-k11 CMTd-n101-k9 CMTd-n101-k11 CMTd-n121-k11 CMTd-n151-k14 CMTd-n200-k18
Our Best
Best-Known Solution Found
Avg. Solution
524.61 835.26 826.14 819.56 1042.12 1035.22 1315.67 555.43 909.68 865.94 866.37 1543.63 1165.10 1410.92
Avg. Evaluations Hit (%) 27778.16 ± 6408.68 51.00 118140.50 ± 65108.27 2.00 146887.00 ± 42936.94 2.00 70820.85 ± 14411.87 89.00 352548.00 ± 0.00 1.00 —– 0.00 —– 0.00 27413.30 ± 6431.00 23.00 69638.50 ± 12818.35 6.00 91882.00 ± 0.00 1.00 87576.85 ± 13436.17 68.00 —– 0.00 —– 0.00 —– 0.00
525.70 ± 2.31 842.90 ± 5.49 833.21 ± 4.22 821.01 ± 4.89 1136.95 ± 31.36 1051.74 ± 7.27 1339.90 ± 11.30 558.10 ± 2.15 919.30 ± 7.22 877.01 ± 5.92 867.53 ± 3.65 1568.53 ± 19.57 1182.42 ± 10.50 1432.94 ± 11.41
∆ Previously (%) Best-Known 0.00 0.00 0.00 0.00 0.00 0.66 1.88 0.00 0.00 0.00 0.00 0.16 0.22 1.08
524.61 835.26 826.14 819.56 1042.11 1028.42 1291.45 555.43 909.68 865.94 866.37 1541.14 1162.55 1395.85
[42] [36] [36] [36] [36] [36] [43] [36] [36] [36] [30] [36] [36] [28]
Table 20. Computational Facts. Benchmark of Fisher Instance
Our Best Best-Known Solution Found Avg. Solution Avg. Evaluations Hit (%) F-n45-k4 724.00 14537.45 ± 4546.13 96.00 724.16 ± 0.93 F-n72-k4 237.00 25572.12 ± 5614.46 84.00 237.30 ± 0.83 F-n135-k7 1162.00 167466.33 ± 30768.37 3.00 1184.76 ± 20.04
∆ Previously (%) Best-Known 0.00 724.0 [41] 0.00 237.0 [41] 0.00 1162.0 [41]
A Hybrid cGA for the CVRP
421
Table 21. Computational Facts. Benchmark of Golden et al. Instance gol-n241-k22 gol-n253-k27 gol-n256-k14 gol-n301-k28 gol-n321-k30 gol-n324-k16 gol-n361-k33 gol-n397-k34 gol-n400-k18 gol-n421-k41 gol-n481-k38 gol-n484-k19 gold-n201-k5 gold-n241-k10 gold-n281-k8 gold-n321-k10 gold-n361-k9 gold-n401-k10 gold-n441-k11 gold-n481-k12
Our Best 714.73 872.59 589.20 1019.19 1107.24 754.95 1391.27 1367.47 935.91 1867.30 1663.87 1126.38 6460.98 5669.82 8428.18 8488.29 10227.51 11164.11 11946.05 13846.94
Best-Known Solution Found Avg. Evaluations —– —– —– —– —– —– —– —– —– —– —– —– 1146050.00 ± 89873.27 —– —– —– —– —– —– —–
Hit (%) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Avg. Solution 724.96 ± 7.11 884.05 ± 5.98 601.10 ± 5.43 1037.31 ± 18.59 1122.32 ± 9.71 766.31 ± 6.29 1409.42 ± 7.74 1388.93 ± 8.52 950.07 ± 8.57 1890.16 ± 12.28 1679.55 ± 9.64 1140.03 ± 7.11 6567.35 ± 104.68 5788.72 ± 44.79 8576.61 ± 89.38 8549.17 ± 47.61 10369.83 ± 69.64 11293.47 ± 69.95 12035.17 ± 63.45 14052.71 ± 140.07
∆ (%)
Previously Best-Known
0.98 1.57 0.99 2.04 2.43 1.74 1.79 1.65 1.90 2.53 2.54 1.73 0.00 0.75 0.18 0.48 0.31 1.16 2.42 1.63
707.79 859.11 583.39 998.73 1081.31 742.03 1366.86 1345.23 918.45 1821.15 1622.69 1107.19 6460.98 5627.54 8412.88 8447.92 10195.56 11036.23 11663.55 13624.52
[43] [43] [43] [43] [43] [44] [43] [43] [43] [43] [43] [43] [45] [43] [45] [45] [45] [45] [43] [45]
Table 22. Computational Facts. Benchmark of Taillard Instance
Our Best
ta-n76-k10a 1618.36 ta-n76-k10b 1344.62 ta-n76-k9a 1291.01 ta-n76-k9b 1365.42 ta-n101-k10a 1411.66 ta-n101-k10b 1584.20 ta-n101-k11a 2047.90 ta-n101-k11b 1940.36 ta-n151-k14a 2364.08 ta-n151-k14b 2654.69 ta-n151-k15a 3056.41 ta-n151-k15b 2732.75 ta-n386-k47 24941.71
Best-Known Solution Found
Avg. Solution
Avg. Evaluations Hit (%) 107401.97 ± 51052.17 65.00 1619.79 ± 2.18 75143.36 ± 20848.50 11.00 1345.10 ± 0.63 116404.75 ± 60902.73 72.00 1294.81 ± 10.02 98309.19 ± 36107.58 94.00 1365.53 ± 0.88 —– 0.00 1419.41 ± 4.05 —– 0.00 1599.57 ± 4.69 —– 0.00 2070.39 ± 9.39 —– 0.00 1943.60 ± 4.68 —– 0.00 2407.75 ± 25.57 —– 0.00 2688.06 ± 14.12 —– 0.00 3124.28 ± 51.65 —– 0.00 2782.21 ± 28.61 —– 0.00 25450.87 ± 165.41
∆ Previously (%) Best-Known 0.00 1618.36 0.00 1344.64 0.00 1291.01 0.00 1365.42 0.39 1406.20 0.19 1581.25 0.32 2041.34 0.02 1939.90 0.95 2341.84 0.35 2645.39 0.04 3055.23 2.87 2656.47 2.09 24431.44
[36] [36] [36] [36] [46] [43] [46] [46] [36] [46] [36] [36] [28]
422
E. Alba and B. Dorronsoro Table 23. Computational Facts. Translated instances from TSP
Instance
Our Best
att-n48-k4 40002.00 bayg-n29-k4 2050.00 bays-n29-k5 2963.00 dantzig-n42-k4 1142.00 fri-n26-k3 1353.00 gr-n17-k3 2685.00 gr-n21-k3 3704.00 gr-n24-k4 2053.00 gr-n48-k3 5985.00 hk-n48-k4 14749.00 swiss-n42-k5 1668.00 ulysses-n16-k3 7965.00 ulysses-n22-k4 9179.00
Best-Known Solution Found
Avg. Solution
Avg. Evaluations Hit (%) 121267.06 ± 69462.74 85.00 40030.28 ± 69.04 32326.00 ± 20669.26 100.00 2050.00 ± 0.00 18345.00 ± 6232.28 100.00 2963.00 ± 0.00 71399.00 ± 45792.68 100.00 1142.00 ± 0.00 12728.00 ± 6130.78 100.00 1353.00 ± 0.00 4200.00 ± 0.00 100.00 2685.00 ± 0.00 5881.00 ± 3249.57 100.00 3704.00 ± 0.00 13835.00 ± 6313.48 100.00 2053.00 ± 0.00 175558.97 ± 123635.58 39.00 5986.96 ± 1.75 117319.00 ± 71898.28 100.00 14749.00 ± 0.00 64962.00 ± 28161.78 100.00 1668.00 ± 0.00 4200.00 ± 0.00 100.00 7965.00 ± 0.00 7603.00 ± 4401.78 100.00 9179.00 ± 0.00
∆ Previously (%) Best-Known 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
40002.0 2050.0 2963.0 1142.0 1353.0 2685.0 3704.0 2053.0 5985.0 14749.0 1668.0 7965.0 9179.0
[47] [47] [47] [47] [47] [47] [47] [47] [47] [47] [47] [47] [47]
Particle Swarm Optimization with Mutation for High Dimensional Problems Jeff Achtnig
Summary. Particle Swarm Optimization (PSO) is an effective algorithm for solving many global optimization problems. Much of the current research, however, focusses on optimizing functions involving relatively few dimensions (typically around 20 to 30, and generally fewer than 100). Recently PSO has been applied to a different class of problems – such as neural network training – that can involve hundreds or thousands of variables. High dimensional problems such as these pose their own unique set of challenges for any optimization algorithm. This chapter investigates the use of PSO for very high dimensional problems. It has been found that combining PSO with some of the concepts from evolutionary algorithms, such as a mutation operator, can in many cases significantly improve the performance of PSO on very high dimensional problems. Further improvements have been found with the addition of a random constriction coefficient.
1 Introduction 1.1 Outline This chapter is organized as follows. The remainder of this section describes the basic particle swarm algorithm along with some of the difficulties encountered with high dimensional optimization problems. Section 2 describes the modifications made to the basic PSO algorithm. Section 3 outlines the experimental settings used in our tests, followed by a discussion of the test results in section 4. Finally, section 5 offers some conclusions. 1.2 Particle Swarm Optimization The PSO algorithm was first proposed by Kennedy and Eberhart [1] as a new method for function optimization. The inspiration for PSO came from observing the social behavior found in nature in such things as a flock of birds, a swarm of bees, or a school of fish. J. Achtnig: Particle Swarm Optimization with Mutation for High Dimensional Problems, Studies in Computational Intelligence (SCI) 82, 423–439 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
424
J. Achtnig
PSO consists of a population of particles – called a swarm – that moves in an n-dimensional, real-valued problem space. Each particle keeps track of its current position in the problem space, along with its current velocity and its best position found so far. The basic PSO algorithm starts by randomly positioning a number of particles uniformly throughout the problem space. At each time step, a particle updates its position, x, and velocity, v, according to the following equations: vt+1 = χ (vt + ϕ1 (p − x) + ϕ2 (g − x))
(1)
xt+1 = xt + vt+1
(2)
The constriction coefficient, χ = 0.729844, is due to Clerc and Kennedy [2], who introduced it as a way of keeping the particles’ velocities from exploding. A particle’s best position is denoted by p, and the global best position amongst all particles is denoted by g. ϕ1 and ϕ2 are scalar values whose purpose is to add some randomness to the particle’s velocity. They are randomly drawn from a uniform distribution over [0, 2.05] at each time step and for each dimension. One of the problems with the basic PSO algorithm is that of premature convergence [8, 11]. PSO has a tendency to converge quite rapidly, which often means that it will converge to a suboptimal solution. A number of researchers have investigated various solutions to this problem, which include adding a spatial extension to PSO [10], changing the neighborhood topology [3], and increasing the diversity of the particles [11]. 1.3 Curse of Dimensionality The “curse of dimensionality” [5] refers to the exponential growth of volume associated with adding extra dimensions to a problem space. For population based optimization algorithms such as PSO, this rapid growth in volume means that as the dimensionality of the problem increases, each of the particles has to potentially search a larger and larger area in the problem space. Consider a one dimensional problem. We might assign two particles to search for a solution to this problem along a line segment −1.0 x +1.0. In this case, one particle might search the negative half of x, while the other searches the positive half of x. If we move to two dimensions, each particle might now be assigned to searching half of a square – a considerably larger area. Moving to three dimensions increases the search space even more. Every additional dimension potentially requires that each particle search a larger and larger area. Some of our initial tests with the basic PSO algorithm in 400 dimensions indicated that PSO might have difficulties in higher dimensions. For many of the problems we looked at, PSO’s predisposition towards premature convergence worsened as the dimensionality of the problem grew. The increased problem space, along with the sparse population of particles within
PSO with Mutation
425
that problem space, often caused the swarm to quickly converge on relatively poor solutions. PSO’s difficulty with high dimensional problems can be understood as follows. A problem with 1000 variables, for example, would require 2000 iterations in order for a given particle to explore the effects of changing just a single variable at a time (i.e. two iterations per variable would be needed to explore both a positive and a negative change in direction from its current position). Exploring the effects of simultaneous variable changes would require even more iterations. In that time, the swarm would have already begun converging on one of the particles. Once the swarm converges the search becomes much more local, and the ability to explore large-scale changes in each of the variables diminishes.
2 PSO Modifications Our initial goal was to examine the ability of PSO in high dimensional problem spaces, and see if there might be a simple way of improving its performance on those types of problems. To that end, our research was narrowed to two techniques that did not require any large or computationally expensive changes to the PSO algorithm. The first was the inclusion of a mutation operator. The second involved using a random constriction coefficient instead of a fixed one. 2.1 Mutation The basic PSO algorithm is modelled around the concept of swarming, and therefore does not make use of an explicit mutation operator. In typical evolutionary algorithms, however, mutation is an important operation that helps to maintain the diversity of the population. It was therefore hypothesized that combining the basic PSO algorithm with a mutation operator would help the swarm avoid premature convergence in high dimensional problems. One possible approach for creating a mutation operator for PSO is to give each problem dimension of every particle a random chance of getting mutated. For a 1,600 dimensional problem with 100 particles, however, this would result in 160,000 random numbers generated in each iteration of the swarm. To reduce the number of random number calculations required, it was decided instead to give each particle a chance of getting selected for mutation. If selected, then a second random variable would be generated to determine which single dimension/variable for that particle would get mutated. This approach considerably reduces the number of random number calculations required. Note that limiting the mutation to a single dimension/variable does not preclude the ability to explore the effects of multiple variable changes. A particle can still be selected for mutation multiple times in succession, thus allowing potentially more than one variable in each particle to be affected by mutation.
426
J. Achtnig
Once a dimension/variable of a particle has been selected for mutation, it is replaced by a uniformly distributed random number drawn over the entire range of the problem. The standard velocity and position update formulas from 1 and 2 are then applied, as usual, to all of the particle’s dimensions. After some preliminary testing, a mutation rate of 2.5% was chosen for our experiments. 2.2 Random Constriction Coefficient A common variant of PSO utilizes a linearly decreasing inertial weight term [4] instead of the constriction coefficient, χ, of equation 1. However, in order to use the inertial weight term in this way requires advance knowledge as to how many iterations the PSO algorithm will run for, so that the weight term can be decreased accordingly. Instead of arbitrarily – or through trial and error – deciding on what fixed number of iterations to choose, it was decided instead to use Clerc’s version of PSO (which utilizes a static constriction coefficient) for our experiments. However, in an attempt to improve some of our initial results with using a mutation operator, we also experimented with a number of other PSO modifications in combination with the mutation operator. After a few initial tests, we settled on utilizing a random constriction coefficient in place of a static one. In this approach, we generated a uniform random number in the range of [0.2, 0.9] for each particle in the swarm, for every iteration. This random number was then used as the value of the constriction coefficient for that particle, and was applied to every dimension/particle of that particle.
3 Experimental Settings Four common numerical optimization problems and two neural-network training problems were chosen to compare the performance of the standard PSO algorithm (with a constriction coefficient) against the modifications presented in this chapter. For each test function we compare: 1) the standard PSO algorithm with constant constriction coefficient χ = 0.729844; 2) PSO with mutation and a constant constriction coefficient (as above); 3) PSO with a random constriction coefficient [0.2, 0.9], but no mutation; and 4) PSO with a combination of both mutation and a random constriction coefficient. For all of the trials, ϕ1 and ϕ2 were randomly chosen from the range [0, 2.05]. Also, in addition to a constriction coefficient, each particle was constrained by a vmax parameter. If a particle’s velocity exceeded vmax , then the
PSO with Mutation
427
velocity for that particle at that time step was set equal to vmax , where vmax was determined according to the following equation: vmax = 0.50 ·
F unctionU pperLimit − F unctionLowerLimit . 2
(3)
The performance of each algorithm was compared on each test function in 400, 800, and 1600 dimensions (the dimensionality of the neural network problems are roughly the same). With the exception of the neural network problems of 800 and 1600 dimensions, each test ran for 50 runs and the average fitness was used in our comparisons. For the previously mentioned neural network problems, 25 runs were averaged due to time constraints. For each run, each of the four PSO algorithms being tested was initialized with the same random seed, ensuring that the initial positions of the particles would be the same. The population size for all tests was fixed at 100 particles, while the number of iterations varied depending on the test function and its dimensionality. Larger populations of particles – equal to the number of dimensions of the problem – were also experimented with, but the few tests we did seemed to suggest that running 100 particles for more iterations resulted in equal or better solutions. We were also looking at reducing the computation time involved, and decided that using fewer particles would best meet that goal. All particles in our tests were connected in the standard “star” (fully connected) topology. Finally, we decided to compare the PSO algorithms with another evolutionary algorithm: Differential Evolution (DE) [12]. After some preliminary tests with a few DE variants and settings, we chose the DE/rand/1/best strategy with C = 0.80 and F = 0.50 for our comparisons. As in the PSO variants, the population size was fixed at 100 particles. 3.1 Standard Test Functions The four numerical optimization problems used in our tests are the unimodal Sphere and Rosenbrock functions, and the multimodal Rastrigin and Schwefel functions. n x2i (4) Sphere = i=1
Rosenbrock =
n−1
%2 $ (100 · xi+1 − x2i + (1 − xi )2 )
(5)
i=1
Rastrigin = 10n +
n
(x2i − 10 cos (2πxi ))
(6)
i=1 n " "3 ## Schwef el = 418.9829 · n + −xi sin |xi | i=1
(7)
428
J. Achtnig
Figure 1, below, depicts each of the standard test functions in two dimensions (i.e., n = 2, in the above equations). For problems with a global minimum at or near the origin, symmetrically positioning the particles in the search space can give a misleading indication of performance [6, 7]. Following the advice of Angeline, for those problems with a global minimum at or near the origin (Sphere, Rastrigin, and Rosenbrock), the initial population of the swarm was asymmetrically positioned in the problem space. The Schwefel function is a deceptive function, in that it’s
(a) Sphere Function
(b) Rosenbrock’s Valley
(c) Rastrigin’s Function
(d) Schwefel’s Function
Fig. 1. 2D Test Function Plots Table 1. Function Ranges Initialization Range Function Sphere Rosenbrock Rastrigin Schwefel
Function Range
Min
Max
Min
Max
−100 −10 −5.12 −500
−50 −5 −2.56 +500
−100 −10 −5.12 −500
+100 +10 +5.12 +500
PSO with Mutation
429
global minimum (at about [420.9687, 420.9687, . . . ]) is positioned far away from the next best local minimum. For this function we chose to initialize the particles over the entire range of the function. Table 1 lists the full ranges of each of the four standard test functions, along with the initialization range used. 3.2 Neural Network Test Functions Both a feed-forward neural network and a recurrent neural network architecture were used as test problems. In both cases, the performance of each algorithm was determined solely by the error between the outputs of the neural network and the expected output values as given by the training data. We did not take into consideration the generalization abilities of the networks. Also, the design of the neural networks – including the number of hidden nodes – was based solely on creating an approximate number of weights/variables for the PSO algorithms to optimize (specifically, we were trying to create networks with approximately 400, 800, and 1600 weights). As such, the networks themselves may not be optimal for the problems they are trying to solve. Feed-forward Neural Network For the feed-forward network, Kramer’s nonlinear PCA neural network [9] was used. As shown in Figure 2, this is a five layer auto-associative neural network with a bottleneck layer in the middle to reduce the dimensionality of the input variables (the number of nodes in the bottleneck layer is less than the number of nodes in the input/output layers). The following network configurations (nodes per layer) were tested: 1) 12–11–5–11–12, which – including the bias weights – results in 413 weights to optimize; 2) 18–15–8–15–18, which results in 836 weights to optimize; and 3) 24–24–11–24–24, which results in 1763 weights to optimize. Each network was trained on 200 test cases, and the average error of all of the test cases between the input and output values was used as our
Fig. 2. Feedforward PCA Neural Network
430
J. Achtnig
fitness function. The test cases were created such that there would be some correlation between the input variables – a non-linear correlation and a linear one. The following scheme was used to generate the inputs for each test case: 1) input[1] = 3 random variable uniformly chosen over the range [0.0, 1.0]; 2) input[2] = input[1]; and 3) input[3] = 0.75 · input[1]. This pattern was repeated for the remainder of the inputs (4..6, 7..9, 10..12, etc.) for each test case. As such, the number of inputs was always chosen to be a multiple of three. Recurrent Neural Network The recurrent neural network consisted of a three layer network with the previous values of both the middle and output layers fed back into the input layer, similar to the one shown in Figure 3. Three external inputs were connected to the network. The first two of these inputs were generated from a chaotic time series (the logistic map with r = 4), while the third input was a nonlinear combination of the first two inputs. The outputs of the network were the future values of each of the three input values (ranging from two future
Fig. 3. Recurrent Neural Network Architecture
PSO with Mutation
431
values per input, to four future values per input depending on the network problem). The three inputs were: 1) input[1] = logistic map function, with initial value of 0.3; 2) input[2] = 4 logistic map function, with initial value of 0.5; and 2 3) input[3] = (input[1]) + (input[2])2 . The logistic map function, with r = 4, is: f (xt+1 ) = 4xt · (1 − xt ). The following network configurations (nodes per layer) were tested: 1) (3 + 20) − 14 − 6, which – including the bias weights – results in 426 weights to optimize; 2) (3 + 29) − 20 − 9, which results in 849 weights to optimize; and 3) (3 + 40) − 28 − 12, which results in 1580 weights to optimize. The number of input nodes was always 3 plus the number of middle and output layer nodes. Each network was trained on 200 successive values of the input series.
4 Results and Discussion Tables 2, 3, and 4 present the final results of the optimization problems defined earlier. In all of the functions tested, the combination of a mutation operator together with a random constriction coefficient (PSO.cmb) resulted in the best performance. With respect to the PSO variants only, the mutation operator by itself (PSO.mut) resulted in the next best performance, while the random constriction coefficient by itself (PSO.rnd) performed similarly to the basic PSO algorithm (bPSO). The bPSO algorithm had a tendency to converge prematurely to relatively poor solutions. The graphs on the following pages (figures 4, 5, and 6) plot the results of the test functions, over all iterations, for each of the PSO variants. The Sphere function (the first plot shown in each of the 400, 800, and 1600 dimensional groupings) is useful for testing the efficiency of each of the algorithms being Table 2. Fitness Results: 400 Dimensions (400)
bPSO
DE
Sphere
539, 918
101.8
5,000 iterations
σ=66,782
σ=48.9
σ=1.947
Rastrigin
4, 455
3, 600
674
10,000 iterations
σ=295
σ=307
Rosenbrock 46, 719, 745 Schwefel
NN − PCA∗
σ=105
918
σ=122
81, 092
133, 310
σ=5,543
σ=4,605
14.85
13.42
20,000 iterations ∗
σ=1.99
20,000 iterations
σ=3.65
RNN
7.163
σ=108
10,000 iterations σ=8,543,113 15,000 iterations
843
PSO.mut PSO.rnd
17.08
σ=0.57
21.86
σ=4.43
389, 551 σ=83,469
4, 391 σ=311
27, 606, 006
PSO.cmb 3.82 × 10−6 σ=2.57×10−6
395
σ=57
413
σ=5,388,505
σ=139
22, 715
90, 801
21, 825
σ=1,070
σ=6,555
σ=1,317
10.80
14.08
σ=1.41
14.91
σ=3.82
6.56
σ=1.92
σ=2.63
18.32
13.76
σ=4.04
σ=5.03
432
J. Achtnig Table 3. Fitness Results: 800 Dimensions (800)
bPSO
DE
Sphere
1, 213, 534
3, 897
10,000 iterations
σ=189,567
σ=2,446
σ=10.4
σ=143,965
Rastrigin
9, 453
3, 491
2, 200
9, 608
10,000 iterations
σ=679
σ=920
σ=153
σ=564
σ=156
Rosenbrock 1.05 × 10
59, 367
5, 340
69, 771, 572
1, 235
σ=456
σ=6,679,159
σ=152
173, 138
274, 675
47, 668
199, 200
44, 917
σ=14,659
σ=17,594
σ=1,963
σ=11,560
σ=1,430
14.41
12.14
10.89
14.35
8
PSO.mut PSO.rnd
10,000 iterations σ=12,675,401 σ=31,251
Schwefel
25,000 iterations
NN − PCA∗
25,000 iterations ∗
σ=1.47
25,000 iterations
σ=2.61
RNN
14.85
σ=1.98
16.56
σ=0.61
54.8
σ=0.95
11.62
σ=2.79
958, 082
PSO.cmb 1.09 × 10−5 σ=3.82×10−6
1, 307
9.80
σ=1.56
σ=1.12
13.84
10.74
σ=2.92
σ=3.25
Table 4. Fitness Results: 1600 Dimensions (1600)
bPSO
DE
PSO.mut PSO.rnd PSO.cmb
Sphere
2, 897, 640
165, 626
6, 762
2, 682, 835
12,500 iterations
σ=289,354
σ=21,200
σ=829
σ=304,273
σ=0.075
Rastrigin
19, 871
5, 836
6, 609
21, 262
3, 262
15,000 iterations
σ=1,222
σ=224
σ=312
σ=1,009
σ=345
55, 786
1.90 × 108
3, 539
Rosenbrock 2.34 × 108 3, 230, 033
15,000 iterations σ=21,149,219
Schwefel
30,000 iterations
NN − PCA∗
σ=822,702
σ=6,176
σ=14,150,730
σ=252
361, 471
556, 345
119, 508
441, 430
102, 464
σ=24,915
σ=83,514
σ=5,119
σ=18,460
σ=4,279
13.97
17.45
11.08
13.54
20,000 iterations ∗
σ=1.17
20,000 iterations
σ=2.04
RNN
0.073
12.36
σ=0.31
13.19
σ=0.34
σ=1.15
10.44
σ=2.82
9.44
σ=1.40
σ=0.45
11.73
10.20
σ=2.66
σ=2.94
compared. There are no local minimums to get stuck in; there is only one global optimal solution centered at the origin, and – regardless of where a particle is positioned in the search space – following the gradient information will lead directly to that optimal solution. Yet, despite these favorable conditions, the bPSO algorithm still gets stuck in a non-optimal solution. PSO’s problem of premature convergence appears to have a much more noticeable effect in higher dimensions. Adding a small mutation to the PSO algorithm, however, seems to give the swarm a needed push that helps to keep it from getting stuck. While the bPSO algorithm converges prematurely in each of the Sphere tests, the PSO.mut algorithm continues to improve its fitness throughout each of the iterations. The random constriction coefficient (PSO.rnd) by itself didn’t offer much improvement over the bPSO algorithm, but combining it with the mutation operator (PSO.cmb) resulted in the best performance. The relative performance of the algorithms was found to be similar on all of the other test problems – including the two neural network problems. In each case the bPSO algorithm performed relatively poorly, while the algorithms that performed the best were those with the mutation operator. The combination of the mutation operator along with the random constriction coefficient performed the best in all instances.
PSO with Mutation
(a) Sphere
(b) Rastrigin
(c) Rosenbrock
(d) Schwefel
(e) NN-PCA (413)
(f) RNN (426)
Fig. 4. Test Functions in 400 Dimensions
433
434
J. Achtnig
(a) Sphere
(b) Rastrigin
(c) Rosenbrock
(d) Schwefel
(e) NN-PCA (836)
(f) RNN (849)
Fig. 5. Test Functions in 800 Dimensions
PSO with Mutation
(a) Sphere
(b) Rastrigin
(c) Rosenbrock
(d) Schwefel
(e) NN-PCA (1763)
(f) RNN (1580)
Fig. 6. Test Functions in 1600 Dimensions
435
436
J. Achtnig
(a) Sphere
(b) Rastrigin
(c) Rosenbrock
(d) Schwefel
Fig. 7. Diversity Plots (400 Dimensions)
All of the PSO algorithms being tested still had a tendency – albeit to varying degrees – to converge to non-optimal solutions. This was most noticeable for the Schwefel function in all dimensions, and in general for all functions it became more noticeable as the dimensionality of the problem increased. Even the PSO.cmb algorithm appeared to have moments of difficulties on the Sphere function in 1600 dimensions (although it did manage to recover). The final set of graphs plot the diversity of the swarm for the Sphere, Rastrigin, Rosenbrock, and Schwefel functions in 400 dimensions. The diversity of the swarm is given by: |S| N $ %2 1 · pij − pj , (8) Diversity = |S| · D i=1 j=1
PSO with Mutation
437
where |S| is the swarm size, D is the length of the largest diagonal in the problem, N is the Number of Dimensions of the problem, pij is the j’th value of the i’th particle, and pj is the j’th value of the average point p. In all cases, the PSO.rnd algorithm quickly achieves the lowest diversity compared with all of the other algorithms, suggesting that the random constriction coefficient has the effect of speeding up the convergence of the swarm. The PSO.mut algorithm, on the other hand, has a diversity greater than or roughly equal to the bPSO algorithm in all cases. This is to be expected, as the random mutations should increase the diversity of the swarm. The best performing variant – the PSO.cmb algorithm – was subjected both to forces trying to increase its diversity (i.e. the mutation operator), and to those trying to decrease its diversity (i.e. the random constriction coefficient), with results somewhere in-between depending on the problem. In all cases the diversity of the PSO.cmb algorithm very quickly decreased in the first few iterations – much more so than the bPSO algorithm, and generally matching the decreased diversity of the PSO.rnd algorithm. After the first few iterations of the Rastrigin and Schwefel functions, however, the PSO.cmb’s diversity started to match that of the PSO.mut algorithm. Conversely, on the Sphere and Rosenbrock functions the PSO.cmb algorithm continued to decrease its diversity similar to the PSO.rnd for an extended period of time, before levelling off (eventually, near the end of the iterations for both of these latter cases, the bPSO algorithm’s diversity continued to decrease until it fell below that of the PSO.cmb algorithm). 4.1 Comparison with Differential Evolution The DE algorithm performed quite well on its own, beating the “standard” bPSO algorithm on most – but not all – test functions. Note especially the Rastrigin function, where the DE algorithm in 800 dimensions performed better than it did on the same problem in 400 dimensions. However, the PSO.cmb algorithm still performed the best out of all of the algorithms on all of the test problems. The PSO.mut algorithm was also better than the DE algorithm on all but two problems (Rosenbrock in 400 dimensions, and the Rastrigin function in 1600 dimensions). In general, the two PSO variants with a mutation operator (PSO.mut and PSO.cmb) performed well across all of the problems tested, where-as the DE algorithm performed relatively well on some problems, but not so well on others (such as the Schwefel function). Also, on problems such as the Rosenbrock function, the DE algorithm performed noticeably worse as the dimensionality of the problem increased (as compared with the PSO.mut and PSO.cmb algorithms, which appear to handle the increasing problem dimensions better than the DE algorithm).
438
J. Achtnig
5 Conclusions and Future Work This chapter investigated the use of Particle Swarm Optimization on very high dimensional problems. PSO has been known to have a problem with premature convergence; in higher dimensions this issue becomes much more noticeable. However, adding a mutation operator to the basic PSO algorithm noticeably improves PSO’s performance on these types of problems. Further improvement was found with the addition of a random constriction coefficient. The two PSO algorithms with a mutation operator (especially the PSO.cmb algorithm) also seemed to perform relatively well across all of the problem types tested. This is in contrast to the variant of DE that we experimented with, which performed well on some problems but not so well on others. Although the addition of a mutation operator improved the performance of the basic PSO algorithm – in many cases significantly – the modified PSO algorithm is still prone to premature convergence. This problem becomes more noticeable as the dimensionality of the function increases. As has been previously mentioned, other techniques – such as Adaptive-Repulsive PSO – have been suggested by others that address the issue of premature convergence in PSO. It would be interesting to try these techniques on these higher dimensional problems, both with and without the mutation operator, to see if further improvement could be obtained. Additional experiments could also be carried out to investigate various rates of mutation, as well as different values for the constriction coefficient (including various random ranges such as [0.2, 1.1]), or even alternatives to the random constriction coefficient. We are also currently investigating the use of a mutation operator on PSO in lower dimensional problems, with promising results.
References 1. Kennedy J, Eberhart R (2001) Swarm intelligence. Morgan Kaufmann, San Francisco, CA 2. Clerc M, Kennedy J (2002) The particle swarm: explosion, stability, and convergence in a multidimensional complex space. IEEE Trans Evol Comput 6:58–73 3. Kennedy J (1999) Small worlds and mega-minds: effects of neighborhood topology on particle swarm performance. In: Proceedings of the 1999 Congress of Evolutionary Computation, vol 3. IEEE, pp 1931–1938 4. Eberhart R, Shi Y (1998) A modified particle swarm optimizer. IEEE Int Conf Evol Comput. Anchorage, Alaska 5. Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press 6. Fogel D, Beyer H-G (1996) A note on the empirical evaluation of intermediate recombination. Evol Comput 3(4):491–495 7. Angeline PJ (1998) Evolutionary optimization versus particle swarm optimization. In: Porto VW, Saravanan N, Waagen D, Eiben AE (eds) Evolutionary
PSO with Mutation
8.
9. 10. 11.
12.
439
programming VII. Lecture notes in computer science, vol 1447. Springer, Berlin Heidelberg New York, pp 601–610 Vesterstroem J, Riget J, Krink T (2002) Division of labor in particle swarm optimisation. In: Proceedings of the IEEE congress on evolutionary computation (CEC 2002), Honolulu, Hawaii Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE J 37(2):233–243 Riget J, Vesterstroem J, Krink T (2002) Proceedings of the Congress on Evolutionary Computation, CEC ’02 2:1474–1479 Riget J, Vesterstroem J (2002) A diversity-guided particle swarm optimizer – the arPSO. In: EVALife, Department of Computer Science, University of Aarhus, Denmark Lampinen J, Storn R (2004) Differential evolution. In: Onwubolu GC, Babu BV (eds) New optimization techniques in engineering. Studies in fuzziness and soft computing, vol 141. Springer, Berlin Heidelberg, New York, pp 123–166
Index
λ-Interchange, 388, 393 2-Opt, 388, 393
back-propagation, 180, 329, 338 back-propagation learning algorithm, 23 backpropagation algorithm, 290 brain-computer interface, 313 breeder genetic algorithm, 302
switch, 353–356 chromosome, 338, 339, 342, 346, 347 classification, 180 Clustering, 215, 217–219, 222, 223, 225, 236 coastal engineering, 340 codon, 253 combined Anti Retroviral Therapy (cART), 254 competing convention problem, 295 comprehensibility, 186 connection weights, 291 Constructive algorithms, 7, 295 Constructive backpropagation, 138, 139, 147–152, 154 cooperative coevolutionary models, 296 correlation value, 346, 347 crossover, 338, 339, 342, 346–349 curse of dimensionality, 424
Capacitated Vehicle Routing Problem (CVRP), 380, 382 CD4+ T cell, 255 Cellular Genetic Algorithm (cGA), 381 cellular mobile networks, 354, 355, 358, 360, 362, 372–374 cellular networks, 353, 358, 372 base station, 353 handoff, see handoff cell assignment, 354–356, 358, 364, 365, 369, 373 problem definition, 354 costs, 354
Data clustering, 158, 159, 163, 164, 172, 173, 175 Data decomposition, 157, 159, 165, 166, 172, 173 degree of saturation, 340, 341, 343–345, 350 destructive algorithms, 7, 295 direct encoding, 294 Divide-and-conquer, 130, 131 drug resistance, 253 drug susceptibility, 257 dynamic recurrent neural networks, 109
accuracy, 186 activation function, 178, 301 Akaike Information Criterion (AIC), 272 amino acid, 252 analytical solutions, 337 ANN model, 337–339, 342, 344, 346, 347, 350 ANN training, 338, 346, 350 architecture, 292 artificial neural networks, 290, 337
442
Index
edge recombination operator (ERX), 386 elitist model, 302 engineering applications, 337, 338, 346, 347 Estimation of Distribution Algorithms, 1 evolution of learning rules, 8 evolution programs, 292 Evolution Strategies, 1, 292 Evolutionary algorithms, 130, 131, 135, 136, 138, 140, 144, 149, 150, 154, 158, 184, 289 Evolutionary architecture adaptation, 7 Evolutionary Artificial Neural Network, 5, 290 Evolutionary Clustering, 14 evolutionary computation, 289 Evolutionary Design of Complex Paradigms, 15 Evolutionary Intelligent Systems, 2 Evolutionary Programming, 1, 292 Evolutionary search for connection weights, 6 Evolutionary Search of Fuzzy Membership Functions, 12 Evolutionary Search of Fuzzy Rule Base, 12 evolved inductive self-organizing networks, 109 evolving ANN behaviors, 296 exponential distributions, 298 feature selection, 271 feedforward, 328, 341 fitness, 293, 300 fitness function, 338, 342 Fuzzy Connective Based (FCB) crossover, 267 fuzzy feature selection, 272 Fuzzy Genetic Algorithms (FGA), 267 Fuzzy Inference System (FIS), 283 Fuzzy logic controller, 211, 213, 226 fuzzy medical diagnosis, 258 fuzzy polynomial neurons, 59 fuzzy relational composition, 258 fuzzy rule-based computing, 59
GA operation, 339, 346, 348 GAs, 337, 338, 341, 342, 346, 347 Gaussian function, 329 gene, 252 Generalization, 131, 138, 140, 144–146, 149, 150, 152–154, 330 generation, 338, 347, 348 Genetic algorithms, 1, 213, 214, 216, 267, 292, 337, 338, 355, 359–361, 381 crossover, 359 evolution, 360 mutation, 359 genetic design approach, 61 Genetic Programming, 1, 292 genetic tuning, 12 genetically optimized Hybrid Fuzzy Neural Networks, 24 genetically optimized multilayer perceptron, 59 genotype, 252, 291 GEX method, 197 global methods, 189 Gradient descent, 129–131, 133, 135–138, 149, 154, 290 gradient descent-based techniques, 338 Group Method of Data Handling, 59 Hamacher norm, 264 handoff, 353, 356, 357 complex, 354, 356–358 cost, 354, 357, 358, 366, 372, 373 frequency, rate, 354, 358, 367 frequency,rate, 367 probability, 367 soft, 353 hidden layer, 341 Highly Active Anti Retroviral Therapy (HAART), 254 HIV-1 dynamics, 265 HIV-1 replication, 252 HIV-1 treatments, 253 Hybrid Fuzzy Neural Networks, 23 hybrid technique, 337–339 Hybrid training, 129, 132, 137, 149 in-vitro cultures, 254 indirect encoding, 294 input data, 344 input space, 328
Index JCell, 383 joint optimization of structure and weights, 291 Learning Classifier Systems, 1 learning rate, 294 learning rules, 291 Linear regression, 211, 215, 219, 220, 222–224, 231, 243, 248 Local optima, 129, 131, 136, 137, 144 loss function, 270 marine structures, 337, 339 maximum liquefaction depth, 341, 344, 347, 350 Messy genetic algorithms, 134 meta-learning evolutionary artificial neural network, 10 modelling nonlinear systems, 109 momentum, 294 multi-layer perceptrons, 109, 297 Multiobjective Evolutionary Design, 16 mutations, 253, 302, 338, 339, 342, 346–349 Neural networks, 129, 131–134, 137, 140–143, 149, 157–160, 162, 170, 178 feedforward, 179 recurrent, 179 neural-genetic, 337, 341, 344, 346–350 neuro-genetic systems, 289 neurons, 178, 337, 342 normal distribution, 298 optimization algorithms, 266 output data, 344 Output parallelism, 130, 131, 144, 147, 149–151, 153, 154, 158, 161, 170, 172 parametric optimization, 23 partial descriptions, 110 Particle Swarm Optimization, 423 Pattern distributor, 129, 130, 139, 142, 146, 147, 149–152, 154 permutation problem, 295 perspective clinical trials, 255 phenotype, 254, 291
443
Polynomial Neural Networks, 23 polynomial neurons, 59 pore pressure, 343, 344 poro-elastic model, 340–342, 344, 347 prediction, 337, 338, 340–342, 345–347, 349 prepositional rule, 181 fuzzy, 181 Pseudo global optima, 129, 130, 137, 138, 144, 149 radial basis function, 109 radial basis function network (RBFN), 328, 329, 332 Random Search (RS), 270 resilient backpropagation algorithm, 299 retrospective clinical cohorts, 255 REX methods, 190 root mean square error, 346 roulette wheel, 342 rule extraction, 181 global methods, 183 local methods, 182, 186 Michigan approach, 190, 197 Pitt approach, 190 seabed liquefaction, 337, 339, 344, 349, 350 selection, 302, 338, 342, 346 self-organising map (SOM), 328 self-organizing neural networks, 59 Self-organizing Polynomial Neural Networks, 110 soil permeability, 344, 345, 347, 350 Supervised learning, 157 System modelling, 109 tabu search, 355, 360, 364, 369, 372–374 list, 360 properties, 361 Takagi and Sugeno approach, 211, 213–215, 220, 238, 248 tangent sigmoid transfer function, 301 Task decomposition, 129, 130, 154 training, 338, 341, 342, 345, 346, 350 truncation, 302
444
Index
variance, 298 Vehicle Routing Problem (VRP), 380, 382 viral load, 255
Weak learners, 157, 158 weights, 338, 339, 341, 342, 346 wild type HIV-1, 253