Adaptive and Natural Computing Algorithms, Part I - ICANNGA 2011

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Author: Andrej Dobnikar | Uros Lotric | Branko Ster (Editors)

78 downloads 1605 Views 10MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6593

Andrej Dobnikar Uroš Lotriˇc Branko Šter (Eds.)

Adaptive and Natural Computing Algorithms 10th International Conference, ICANNGA 2011 Ljubljana, Slovenia, April 14-16, 2011 Proceedings, Part I

13

Volume Editors Andrej Dobnikar Uroš Lotriˇc Branko Šter University of Ljubljana Faculty of Computer and Information Science Tržaška 25, 1000 Ljubljana, Slovenia E-mail: {andrej.dobnikar, uros.lotric, branko.ster}@fri.uni-lj.si

ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-20281-0 e-ISBN 978-3-642-20282-7 DOI 10.1007/978-3-642-20282-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011923992 CR Subject Classification (1998): F.1-2, I.2, D.2.2, D.4.7, D.1, I.4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

The 2011 edition of ICANNGA marked the 10th anniversary of the conference series, started in 1993 in Innsbruck, Austria, where it was decided to have a similar scientiﬁc meeting organized biennially. Since then, and with considerable success, the conference has taken place in Ales in France (1995), Norwich in the UK (1997), Portoroˇz in Slovenia (1999), Prague in the Czech Republic (2001), Roanne in France (2003), Coimbra in Portugal (2005), Warsaw in Poland (2007), and Kuopio in Finland (2009), while this year, for the second time, in Slovenia, in its capital Ljubljana (2011). The Faculty of Computer and Information Science of the University of Ljubljana was pleased and honored to host this conference. We chose the old university palace as the conference site in order to keep the traditionally good academic atmosphere of the meeting. It is located in the very centre of the capital and is surrounded by many cultural and touristic sights. The ICANNGA conference was originally limited to neural networks and genetic algorithms, and was named after this primary orientation: International Conference on Artiﬁcial Neural Networks and Genetic Algorithms. Very soon the conference broadened its outlook and in Coimbra (2005) the same abbreviation got a new meaning: International Conference on Adaptive and Natural computiNG Algorithms. Thereby the popular short name remained and yet the conference is widely open to many new disciplines related to adaptive and natural algorithms. This year we received 144 papers from 33 countries. After a peer-review process by at least two reviewers per paper, 83 papers were accepted and included in the proceedings. The papers were divided into seven groups: neural networks, evolutionary computation, pattern recognition, soft computing, system theory, support vector machines, and bio-informatics. The submissions were recommended for oral and for poster presentation. The ICANNGA 2011 plenary lectures were planned to combine several compatible disciplines like adaptive computation (Rudolf Albrecht), artiﬁcial intelligence (Ivan Bratko), synthetic biology and biomolecular modelling of new biological systems (Roman Jerala), computational neurogenetic modelling (Nikola Kasabov), and robots with biological brains (Kevin Warwick). We believe these discussions served as an inspiration for future contributions. One of the traditions of all ICANNGA conferences so far has been to combine pleasantness and usefulness. The cultural and culinary traditions of the organizing country helped to create an atmosphere for a successful and friendly meeting. We would like to thank the Advisory Committee for their guidance, advice and discussions. Furthermore, we wish to express our gratitude to the Program Committee, the reviewers and sub-reviewers for their substantial work in revising

VI

Preface

the papers. Our recognition also goes to Springer, our publisher, and especially to Alfred Hofmann, Editor-in-Chief of LNCS, for his support and collaboration. ˇ Many thanks go to the agency Go-mice and its representative Natalija Bah Cad for her help and eﬀort. And last but not least, on behalf of the Organizing Committee of ICANNGA 2011, we want to express our special recognition to all the participants, who contributed enormously to the success of the conference. We hope that you will enjoy reading this volume and that you will ﬁnd it inspiring and stimulating for your future work and research. April 2011

Andrej Dobnikar Uroˇs Lotriˇc ˇ Branko Ster

Organization

ICANNGA 2011 was organized by the Faculty of Computer and Information Science, University of Ljubljana, Slovenia.

Advisory Committee Rudolf Albrecht Bartlomiej Beliczynski Andrej Dobnikar Mikko Kolehmainen Vera Kurkova David Pearson Bernardete Ribeiro Nigel Steele

University of Innsbruck, Austria Warsaw University of Technology, Poland University of Ljubljana, Slovenia University of Eastern Finland, Finland Academy of Sciences of the Czech Republic, Czech Republic University Jean Monnet of Saint-Etienne, France University of Coimbra, Portugal Coventry University, UK

Program Committee Andrej Dobnikar, Slovenia (Chair) Jarmo Alander, Finland Rudolf Albrecht, Austria Rub´en Arma˜ nanzas, Spain Bartlomiej Beliczynski, Poland Ernesto Costa, Portugal Janez Demˇsar, Slovenia Antonio Dourado, Portugal Stefan Figedy, Slovakia Alexandru Floares, Romania Juan A. Gomez-Pulido, Spain Barbara Hammer, Germany Honggui Han, China Osamu Hoshino, Japan Marcin Iwanowski, Poland Martti Juhola, Finland Paul C. Kainen, USA Helen Karatza, Greece Kostas D. Karatzas, Greece Nikola Kasabov, New Zealand Mikko Kolehmainen, Finland Igor Kononenko, Slovenia Jozef Korbicz, Poland

Vera Kurkova, Czech Republic Kauko Leiviska, Finland Aleˇs Leonardis, Slovenia Uroˇs Lotriˇc, Slovenia Danilo P. Mandic, UK Francesco Masulli, Italy Roman Neruda, Czech Republic Stanislaw Osowski, Poland David Pearson, France Jan Peters, Germany Bernardete B. Ribeiro, Portugal Juan M. Sanchez-Perez, Spain Catarina Silva, Portugal Nigel Steele, UK ˇ Branko Ster, Slovenia Miroslaw Swiercz, Poland Ryszard Tadeusiewicz, Poland Tatiana Tambouratzis, Greece Miguel A. Vega-Rodriguez, Spain Kevin Warwick, UK Blaˇz Zupan, Slovenia

VIII

Organization

Organizing Committee Andrej Dobnikar Uroˇs Lotriˇc ˇ Branko Ster Nejc Ilc

Davor Sluga Jernej Zupanc ˇ Natalija Bah Cad

Reviewers Jarmo Alander Rudolf Albrecht Ana de Almeida M´ ario Joao Antunes Rub´en Arma˜ nanzas Iztok Lebar Bajec Bartlomiej Beliczynski Zoran Bosni´c Ernesto Costa Janez Demˇsar Andrej Dobnikar Antonio Dourado Stefan Figedy Alexandru Floares Juan A. Gomez-Pulido ˇ Crtomir Gorup Barbara Hammer Honggui Han Jorge Henriques Osamu Hoshino Marcin Iwanowski Martti Juhola Paul C. Kainen Helen Karatza Kostas D. Karatzas Nikola Kasabov Mikko Kolehmainen Igor Kononenko Jozef Korbicz Vera Kurkova Kauko Leiviska Aleˇs Leonardis Pedro Luis L´ opez-Cruz Uroˇs Lotriˇc

Danilo P. Mandic Francesco Masulli Neˇza Mramor Kosta Miha Mraz Roman Neruda Dominik Olszewski Stanislaw Osowski David Pearson Jan Peters Matija Polajnar Mengyu Qiao Bernardete B. Ribeiro ˇ Marko Robnik Sikonja Mauno R¨ onkk¨ o Gregor Rot Aleksander Sadikov Juan M. Sanchez-Perez Catarina Silva Danijel Skoˇcaj Nigel Steele Miroslaw Swiercz ˇ Miha Stajdohar ˇ Branko Ster Ryszard Tadeusiewicz Tatiana Tambouratzis Marko Toplak Miguel A. Vega-Rodriguez Alen Vreˇcko Kevin Warwick Blaˇz Zupan ˇ Jure Zabkar ˇ Lan Zagar ˇ Jure Zbontar

Table of Contents – Part I

Plenary Session Autonomous Discovery of Abstract Concepts by a Robot . . . . . . . . . . . . . . Ivan Bratko

1

Neural Networks Kernel Networks with Fixed and Variable Widths . . . . . . . . . . . . . . . . . . . . Vˇera K˚ urkov´ a and Paul C. Kainen

12

Evaluating Reliability of Single Classiﬁcations of Neural Networks . . . . . . ˇ Darko Pevec, Erik Strumbelj, and Igor Kononenko

22

Nonlinear Predictive Control Based on Multivariable Neural Wiener Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej L awry´ nczuk

31

Methods of Integration of Ensemble of Neural Predictors of Time Series - Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stanislaw Osowski and Krzysztof Siwek

41

A Rejection Option for the Multilayer Perceptron Using Hyperplanes . . . Eduardo Gasca A., Sergio Salda˜ na T., Jos´e S. S´ anchez G., Valent´ın Vel´ asquez G., Er´endira Rend´ on L., Itzel M. Abundez B., Rosa M. Valdovinos R., and Rafael Cruz R.

51

Parallelization of Algorithms with Recurrent Neural Networks . . . . . . . . . Jo˜ ao Pedro Neto and Fernando Silva

61

Parallel Training of Artiﬁcial Neural Networks Using Multithreaded and Multicore CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olena Schuessler and Diego Loyola

70

Supporting Diagnostics of Coronary Artery Disease with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matjaˇz Kukar and Ciril Groˇselj

80

The Right Delay: Detecting Speciﬁc Spike Patterns with STDP and Axonal Conduction Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arvind Datadien, Pim Haselager, and Ida Sprinkhuizen-Kuyper

90

X

Table of Contents – Part I

New Measure of Boolean Factor Analysis Quality . . . . . . . . . . . . . . . . . . . . Alexander A. Frolov, Dusan Husek, and Pavel Yu. Polyakov

100

Mechanisms of Adaptive Spatial Integration in a Neural Model of Cortical Motion Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Ringbauer, Stephan Tschechne, and Heiko Neumann

110

Self-organized Short-Term Memory Mechanism in Spiking Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Kiselev

120

Approximation of Functions by Multivariable Hermite Basis: A Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bartlomiej Beliczynski

130

Using Pattern Recognition to Predict Driver Intent . . . . . . . . . . . . . . . . . . . Firas Lethaus, Martin R.K. Baumann, Frank K¨ oster, and Karsten Lemmer

140

Neural Networks Committee for Improvement of Metal’s Mechanical Properties Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga A. Mishulina, Igor A. Kruglov, and Murat B. Bakirov

150

Logarithmic Multiplier in Hardware Implementation of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uroˇs Lotriˇc and Patricio Buli´c

158

Eﬃciently Explaining Decisions of Probabilistic RBF Classiﬁcation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Marko Robnik-Sikonja, Aristidis Likas, Constantinos Constantinopoulos, Igor Kononenko, and Erik Strumbelj

169

Evolving Sum and Composite Kernel Functions for Regularization Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petra Vidnerov´ a and Roman Neruda

180

Optimisation of Concentrating Solar Thermal Power Plants with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Pascal Richter, Erika Abrah´ am, and Gabriel Morin

190

Emergence of Attention Focus in a Biologically-Based Bidirectionally-Connected Hierarchical Network . . . . . . . . . . . . . . . . . . . . . . Mohammad Saifullah and Rita Kovord´ anyi

200

Visualizing Multidimensional Data through Multilayer Perceptron Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Neme and Antonio Nido

210

Table of Contents – Part I

Input Separability in Living Liquid State Machines . . . . . . . . . . . . . . . . . . . Robert L. Ortman, Kumar Venayagamoorthy, and Steve M. Potter Predictive Control of a Distillation Column Using a Control-Oriented Neural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej L awry´ nczuk Neural Prediction of Product Quality Based on Pilot Paper Machine Process Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paavo Nieminen, Tommi K¨ arkk¨ ainen, Kari Luostarinen, and Jukka Muhonen A Robotic Scenario for Programmable Fixed-Weight Neural Networks Exhibiting Multiple Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guglielmo Montone, Francesco Donnarumma, and Roberto Prevete Self-Organising Maps in Document Classiﬁcation: A Comparison with Six Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jyri Saarikoski, Jorma Laurikkala, Kalervo J¨ arvelin, and Martti Juhola Analysis and Short-Term Forecasting of Highway Traﬃc Flow in Slovenia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Primoˇz Potoˇcnik and Edvard Govekar

XI

220

230

240

250

260

270

Evolutionary Computation A New Method of EEG Classiﬁcation for BCI with Feature Extraction Based on Higher Order Statistics of Wavelet Components and Selection with Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Kolodziej, Andrzej Majkowski, and Remigiusz J. Rak Regressor Survival Rate Estimation for Enhanced Crossover Conﬁguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alina Patelli and Lavinia Ferariu A Study on Population’s Diversity for Dynamic Environments . . . . . . . . . Anabela Sim˜ oes, Rui Carvalho, Jo˜ ao Campos, and Ernesto Costa Eﬀect of the Block Occupancy in GPGPU over the Performance of Particle Swarm Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel C´ ardenas-Montes, Miguel A. Vega-Rodr´ıguez, Juan Jos´e Rodr´ıguez-V´ azquez, and Antonio G´ omez-Iglesias Two Improvement Strategies for Logistic Dynamic Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingjian Ni and Jianming Deng

280

290

300

310

320

XII

Table of Contents – Part I

Digital Watermarking Enhancement Using Wavelet Filter Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Lipi´ nski and Jan Stolarek

330

CellularDE: A Cellular Based Diﬀerential Evolution for Dynamic Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vahid Noroozi, Ali B. Hashemi, and Mohammad Reza Meybodi

340

Optimization of Topological Active Nets with Diﬀerential Evolution . . . . Jorge Novo, Jos´e Santos, and Manuel G. Penedo Study on the Eﬀects of Pseudorandom Generation Quality on the Performance of Diﬀerential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ¨ am¨ Ville Tirronen, Sami Ayr¨ o, and Matthieu Weber Sensitiveness of Evolutionary Algorithms to the Random Number Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel C´ ardenas-Montes, Miguel A. Vega-Rodr´ıguez, and Antonio G´ omez-Iglesias

350

361

371

New Eﬃcient Techniques for Dynamic Detection of Likely Invariants . . . Saeed Parsa, Behrouz Minaei, Mojtaba Daryabari, and Hamid Parvin

381

Classiﬁcation Ensemble by Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . Hamid Parvin, Behrouz Minaei, Akram Beigi, and Hoda Helmi

391

Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm for Electric Circuit Units (ECUs) . . . . . . . . . . . . . . . . . . . . . . . . . Umair F. Siddiqi, Yoichi Shiraishi, Mona A. El-Dahb, and Sadiq M. Sait Taxi Pick-Ups Route Optimization Using Genetic Algorithms . . . . . . . . . . Jorge Nunes, Lu´ıs Matos, and Ant´ onio Trigo

400

410

Optimization of Gaussian Process Models with Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dejan Petelin, Bogdan Filipiˇc, and Juˇs Kocijan

420

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

431

Table of Contents – Part II

Pattern Recognition and Learning Asymmetric k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Olszewski

1

Gravitational Clustering of the Self-Organizing Map . . . . . . . . . . . . . . . . . . Nejc Ilc and Andrej Dobnikar

11

A General Method for Visualizing and Explaining Black-Box Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Erik Strumbelj and Igor Kononenko

21

An Experimental Study on Electrical Signature Identiﬁcation of Non-Intrusive Load Monitoring (NILM) Systems . . . . . . . . . . . . . . . . . . . . . Marisa B. Figueiredo, Ana de Almeida, and Bernardete Ribeiro

31

Evaluation of a Resource Allocating Network with Long Term Memory Using GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernardete Ribeiro, Ricardo Quintas, and Noel Lopes

41

Gabor Descriptors for Aerial Image Classiﬁcation . . . . . . . . . . . . . . . . . . . . Vladimir Risojevi´c, Snjeˇzana Momi´c, and Zdenka Babi´c

51

Text Representation in Multi-label Classiﬁcation: Two New Input Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rodrigo Alfaro and H´ector Allende

61

Fraud Detection in Telecommunications Using Kullback-Leibler Divergence and Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Olszewski

71

Classiﬁcation of EEG in a Steady State Visual Evoked Potential Based Brain Computer Interface Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˙scan, Ozen ¨ ¨ Zafer I¸ Ozkaya, and Z¨ umray Dokur

81

Fast Projection Pursuit Based on Quality of Projected Clusters . . . . . . . . Marek Grochowski and Wlodzislaw Duch A New N-gram Feature Extraction-Selection Method for Malicious Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamid Parvin, Behrouz Minaei, Hossein Karshenas, and Akram Beigi

89

98

XIV

Table of Contents – Part II

A Robust Learning Model for Dealing with Missing Values in Many-Core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noel Lopes and Bernardete Ribeiro A Model of Saliency-Based Selective Attention for Machine Vision Inspection Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao-Feng Ding, Li-Zhong Xu, Xue-Wu Zhang, Fang Gong, Ai-Ye Shi, and Hui-Bin Wang Grapheme-Phoneme Translator for Brazilian Portuguese . . . . . . . . . . . . . . Danilo Picagli Shibata and Ricardo Luis de Azevedo da Rocha

108

118

127

Soft Computing Improvement of Inventory Control under Parametric Uncertainty and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Nechval, Konstantin Nechval, Maris Purgailis, and Uldis Rozevskis Modiﬁed Jakubowski Shape Transducer for Detecting Osteophytes and Erosions in Finger Joints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marzena Bielecka, Andrzej Bielecki, Mariusz Korkosz, Marek Skomorowski, Wadim Wojciechowski, and Bartosz Zieli´ nski Using CMAC for Mobile Robot Motion Control . . . . . . . . . . . . . . . . . . . . . . Krist´ of G´ ati and G´ abor Horv´ ath

136

147

156

Optimizing the Robustness of Scale-Free Networks with Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre Buesser, Fabio Daolio, and Marco Tomassini

167

Numerically Eﬃcient Analytical MPC Algorithm Based on Fuzzy Hammerstein Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr M. Marusak

177

Online Adaptation of Path Formation in UAV Search-and-Identify Missions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Willem H. van Willigen, Martijn C. Schut, A.E. Eiben, and Leon J.H.M. Kester Reconstruction of Causal Networks by Set Covering . . . . . . . . . . . . . . . . . . Nick Fyson, Tijl De Bie, and Nello Cristianini The Noise Identiﬁcation Method Based on Divergence Analysis in Ensemble Methods Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryszard Szupiluk, Piotr Wojewnik, and Tomasz Zabkowski

186

196

206

Table of Contents – Part II

Eﬃcient Predictive Control and Set–Point Optimization Based on a Single Fuzzy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr M. Marusak Wind Turbines States Classiﬁcation by a Fuzzy-ART Neural Network with a Stereographic Projection as a Signal Normalization . . . . . . . . . . . . Tomasz Barszcz, Marzena Bielecka, Andrzej Bielecki, and Mateusz W´ ojcik Binding and Cross-Modal Learning in Markov Logic Networks . . . . . . . . . Alen Vreˇcko, Danijel Skoˇcaj, and Aleˇs Leonardis

XV

215

225

235

Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents in Nondeterministic Environments . . . . . . . . . . . . . . . . . . Akram Beigi, Nasser Mozayani, and Hamid Parvin

245

Parallel Graph Transformations Supported by Replicated Complementary Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leszek Kotulski and Adam S¸edziwy

254

Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach . . . . . . Olgierd Unold

265

Systems Theory Adaptive Finite Automaton: A New Algebraic Approach . . . . . . . . . . . . . . Reginaldo Inojosa Silva Filho and Ricardo Luis de Azevedo da Rocha

275

Cryptanalytic Attack on the Self-Shrinking Sequence Generator . . . . . . . . Maria Eugenia Pazo-Robles and Amparo F´ uster-Sabater

285

About Nonnegative Matrix Factorization: On the posrank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana de Almeida

295

Stability of Positive Fractional Continuous-Time Linear Systems with Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tadeusz Kaczorek

305

Output-Error Model Training for Gaussian Process Models . . . . . . . . . . . . Juˇs Kocijan and Dejan Petelin

312

Support Vector Machines Learning Readers’ News Preferences with Support Vector Machines . . . . Elena Hensinger, Ilias Flaounas, and Nello Cristianini

322

XVI

Table of Contents – Part II

Incorporating a Priori Knowledge from Detractor Points into Support Vector Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Orchel

332

A Hybrid AIS-SVM Ensemble Approach for Text Classiﬁcation . . . . . . . . M´ ario Antunes, Catarina Silva, Bernardete Ribeiro, and Manuel Correia

342

Regression Based on Support Vector Classiﬁcation . . . . . . . . . . . . . . . . . . . Marcin Orchel

353

Two One-Pass Algorithms for Data Stream Classiﬁcation Using Approximate MEBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˜ Ricardo Nanculef, H´ector Allende, Stefano Lodi, and Claudio Sartori

363

Bioinformatics X-ORCA - A Biologically Inspired Low-Cost Localization System . . . . . . Enrico Heinrich, Marian L¨ uder, Ralf Joost, and Ralf Salomon On the Origin and Features of an Evolved Boolean Model for Subcellular Signal Transduction Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Branko Ster, Monika Avbelj, Roman Jerala, and Andrej Dobnikar

373

383

Similarity of Transcription Proﬁles for Genes in Gene Sets . . . . . . . . . . . . Marko Toplak, Tomaˇz Curk, and Blaˇz Zupan

393

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

401

Autonomous Discovery of Abstract Concepts by a Robot Ivan Bratko University of Ljubljana, Faculty of Computer and Information Sc., Tržaška 25, 1000 Ljubljana, Slovenia [email protected]

Abstract. In this paper we look at the discovery of abstract concepts by a robot autonomously exploring its environment and learning the laws of the environment. By abstract concepts we mean concepts that are not explicitly observable in the measured data, such as the notions of obstacle, stability or a tool. We consider mechanisms of machine learning that enable the discovery of abstract concepts. Such mechanisms are provided by the logic based approach to machine learning called Inductive Logic Programming (ILP). The feature of predicate invention in ILP is particularly relevant. Examples of actually discovered abstract concepts in experiments are described. Keywords: autonomous discovery, robot learning, discovery of abstract concepts, inductive logic programming, predicate invention.

1 Introduction Robot programming can be done at various levels of generality and abstraction. A most direct approach to program a robot to carry out a given task is to explicitly tell the robot a complete sequence of actions that accomplish the task. These actions may of course include actions for acquiring information about the current state of the world through sensors (camera, proximity sensors, etc.). This may be needed also to check whether the actual effects of actions were as expected. The remaining actions may depend on the actual state of the world which is reflected in the observations. This approach is rather limited in that each new task requires a new program for the robot. A more general approach to programming a robot is to define a model of the robot’s environment and the effects of the robot’s actions. This can be then used by a general planning program to automatically construct plans that accomplish given tasks. This approach is limited by a fixed model of the world. If the model is not quite adequate, or the robot’s environment is not sufficiently known in advance, or it changes in time, then a human intervention is needed to update the model. Another step towards an increased robot’s autonomy and robot’s program generality is to include a learning capability. This will enable the robot to learn the laws of the world from observations. Such a robot is able to function in an unknown or A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 1–11, 2011. © Springer-Verlag Berlin Heidelberg 2011

2

I. Bratko

partially known environment. The use of machine learning (ML) to automatically induce models of a planning domain to be used for task planning is a traditional topic in AI. Zimmerman and Kambhampati [15] give an overview of this research. [13,14] are representative papers in this field. However, even this approach again has limitations. Most machine learning methods assume a fixed language in which the learned theory of the robot’s environment can be stated. Such a language is also called hypothesis language because hypothesized theories of the world are expressed in it. Even if such a language may be general in principle, it typically imposes practical limitations on what can be learned. The representations of theories in such a language are likely to become too awkward and complex to enable effective statement of plausible theories. Therefore a desirable feature of a learning system is the ability of extending its hypothesis language by new language constructs that would enable elegant statements of theories for particular robot’s environments. In this paper we look at how the robot’s hypothesis language can be extended during learning. We consider a setting for automatic modelling of the robot’s world by autonomous discovery through experiments in the environment. The subject of discovery are various quantitative or qualitative laws in the robot’s world. Discovery of such laws of (possibly naïve, or qualitative) physics aims at enabling the robot to make predictions about the results of its actions, and thus enable the robot to construct plans to achieve the robot’s goals. In this paper we look at mechanisms that enable extensions of the robot’s hypothesis language. Such language extensions can be done through the discovery of abstract concepts which can be added to the hypothesis language. Such abstract concepts may be called “insights”, the term introduced in this context in the European research project XPERO (www.xpero.org). By an insight we mean something conceptually more abstract than a typical law in physics. A law usually states the dependences between observed variables. On the other hand, an insight may introduce a new variable or quantity that was never explicitly observed in the measurements. In this sense, an insight is a new piece of knowledge that makes it possible to simplify the current agent’s theory about its environment. So an insight may enhance the agent’s description language, and thus it should also make further discovery easier because the hypothesis language becomes more powerful and suitable for the domain of application. Also, we would like discovered concepts to be used by the agent in reasoning about the domain. For example, a robot may use these concepts in task planning, and not only to be able to do further learning better. What typically counts as an insight in this sense? Suppose the robot is exploring its physical environment and trying to make sense of the measured data. Assume the robot has no prior theory of the physical world, nor any knowledge of relevant mathematics. Then, examples of insights would be the discoveries of notions like absolute coordinate system, arithmetic operations, notion of gravity, notion of support between objects, etc. These concepts were never explicitly observed in the robot’s measured data. They are made up by the robot as useful abstract concepts that cannot be directly observed. An insight would thus ideally be a new concept that makes the current domain theory more flexible and enables more efficient reasoning about the domain.

Autonomous Discovery of Abstract Concepts by a Robot

3

The expected effect of an insight is illustrated in Figure 1. Initially, the robot starts with a small theory, which will become better, and also larger, after first experiments and learning steps. Newly discovered relations and laws of the domain are added to the theory, so the theory keeps growing. Then, when an insight occurs, the insight may enable a simplification of the theory, so the theory shrinks. A simplification is possible because an insight gives rise to a better representation language which facilitates a more compact representation of the current knowledge. Then the theory may start growing again. This process is reminiscent of the evolution of scientific theories. In this paper an approach to the discovery of abstract concepts is discussed. The approach employs a logic-based framework to machine learning, called Inductive Logic Programming (ILP). Examples of actually discovered abstract concepts in experiments are given. These include the concepts of movable object, an obstacle, and a tool. A comment regarding the goals of this research is in order here. It should be noted that the scientific goals of discovering abstract concepts in this paper are considerably different from typical goals in robotics research. In a typical robotics project, the goal may be to improve the robot’s performance at carrying out some physical task. To this end, any relevant methods, as powerful as possible, will be applied. In contrast to this, here we are less interested in improving the robot’s performance at some specific task, but in making the robot improve its theory and “understanding” of the world. We are interested in finding mechanisms, as generic as possible, that enable the gaining of insights. For such a mechanism to be generic, it has to make only a few rather basic assumptions about the agent’s prior knowledge. We are interested in minimizing such “innate knowledge” because we would like to demonstrate how discovery and gaining insights may come about from only a minimal set of “first principles”. Not all machine learning methods are appropriate. Our aim requires that the induced insights can be interpreted and understood, flexibly used in robot’s reasoning, and are not only useful for making direct predictions.

2 Discovery of Abstract Notions with ILP 2.1 Experimental Loop We assume that the robot’s discovery process takes the form of an indefinite “experimental loop”. The robot starts with some initial knowledge (possibly zero). This is the robot’s initial theory of the domain. Then it repeats the steps: 1. 2. 3. 4. 5.

Perform experiments and collect observation data Apply a ML method to the data, which results in a new theory Design new experiments aiming at collecting most informative new data Plan the execution of these experiments using the current theory of the domain Go to step 1 to repeat the loop.

Experiments with this experimental scenario, using a number of learning methods at step 2 of the loop, are described in [1]. [2] is a comparison of the used ML techniques

4

I. Bratko

Fig. 1. Evolution of a theory during execution of experimental loop

w.r.t. a collection of learning tasks and criteria relevant for autonomous robot discovery. ML methods used in these experiments include regression trees, decision trees (implementations in Orange [3]), induction of qualitative trees with QUIN [4], induction of equations with Goldhorn [5], Inductive Logic Programming with Aleph [6] and Hyper [7], and statistical relational learning with Alchemy [8]. There are many aspects of this experimental loop that require further research. We will here consider one such key question, that is how to discover abstract concepts in this process. 2.2 Discovering Abstract Concepts by Predicate Invention in ILP I will here describe a concrete technical approach to the discovery of abstract concepts. The approach employs Inductive Logic Programming (ILP) -- an approach to machine learning that uses first-order logic to represent induced theories from data. Usually, the syntax in which new theories are represented is that of the logic programming language Prolog, see e.g. [7]. A logic program consists of a definition of a number of predicates. First we look at an example to illustrate the notation and the need for inventing new predicates. Let us consider a robot that would like to learn when an object can be safely grasped in a blocks world. For simplicity, assume all the blocks are cubes of the same size. Blocks are arranged to form stacks of different heights (Figure 2). The robot can grasp blocks by its fingers attached to its very big arm. If the arm gets too far down towards an object surrounded by taller stacks, the arm will knock down the blocks in these stacks. So the rule of safe grasping is: a block B can be grasped only if there is no other block in the scene positioned higher than B. Suppose the robot does not know anything about this, but would like to learn when a block is safe to grasp. To this end, the robot has made a number of experiments and collected the results of these experiments: whether the grasping was successful or not. We say that the robot has collected positive and negative examples of grasping, and now wants to induce general rules to predict the outcome. Now, assume that the robot has no information about the height of the objects, that is their Z-coordinates. The robot only knows the X and Y coordinates (top view camera), it may only sense the proximity of a block by a proximity sensor attached to the arm. But instead of the height information, the robot keeps track for each block what other object the block is standing on, either on another block or on the floor. This information will be represented by “on” and “onfloor” relations (see Figure 2). Now suppose also that the robot has no knowledge of arithmetic that would allow

Autonomous Discovery of Abstract Concepts by a Robot

c d e

a b

on(a,b). onfloor(b).

5

on(c,d). onfloor(e).

on(d,e).

Fig. 2. A state of a blocks world and some relations in this state

the robot to work out the heights of blocks. Therefore the robot needs to discover something new to work to the same effect as numbers and arithmetic. For this example, such a new useful concept is the predicate above(B1,B2). That is, block B1 is positioned higher than block B2, where B1 and B2 are either in the same stack or in different stacks. Here is a precise statement about grasping: A block B is graspable if there is no other block above B. This can be written formally in Prolog as: graspable( Block) :not above( AnotherBlock, Block).

% Block is graspable if % no other block is above Block

This is read as: Block is graspable if it is not true that there is AnotherBlock above Block. Text following ”%” is program comment. Of course, this definition of graspable requires in turn a definition of predicate above(B1,B2). Here is a possible definition of this predicate in English which considers two possible cases: (1) Block B1 is above block B2 if B2 is on the floor, and B1 is on any other block. (2) Block B1 is above B2 if B1 is on some block B1a, and B2 is on some block B2a, and B1a is above B2a. These two possibilities about B1 being above B2 can be written in Prolog as: above( B1, B2) :onfloor( B2), on( B1, AnotherBlock).

% B1 is above B2 if % B2 is on floor, and % B1 is on AnotherBlock

above( B1, B2) :on( B1, B1a), on( B2, B2a), above( B1a, B2a).

% B1 is above B2 if % B1 is on B1a, and % B2 is on B2a, and % B1a is above B2a

6

I. Bratko

This is now an executable definition of “graspable”. Notice that “above” is defined recursively and it works for stacks of any height. Given the relation “on” as in Figure 2, this Prolog program will logically derive for example that block c is graspable, and blocks a, b, d and e are not. Let us now consider how would an ILP learning system learn the above definitions from examples. In general, the problem of ILP is defined as follows: Given: some initial Prolog program, called B (for “background knowledge”), and a set of positive examples E and a set of negative examples N, Find: a hypothesis H (also in the form of a Prolog program), such that: (1) B and H |-- E, and (2) For all n in N: B and H |-- not(n) The symbol “|--“ means derivation. E.g. E can be logically derived from B and H. The formal definition above says: all the positive examples E can be derived from hypothesis H and background knowledge B, and no negative example n in N can be derived from H and B. ILP algorithms find such hypotheses H by searching in one way or another among possible hypotheses. Possible hypotheses are in principle all logical formulas that can be constructed with the hypothesis language the system uses. This means that the ILP approach to machine learning is extremely difficult because of its combinatorial complexity. On the other hand, the advantage is that the hypothesis language is very powerful and flexible. Also, the learning system can use the knowledge that the robot knows before the learning starts. So the robot does not have to start practically from nothing as in most other approaches to machine learning. There are several ILP learning methods available, see e.g. [17]. In the experiments described in this paper, the ILP learning system HYPER [7] was used. Now, how can we apply an ILP system to our exercise of learning about “graspable”? Background knowledge may consist of the definition of the predicates “on” and “onfloor”, for example as in Figure 2: on( a, b). onfloor(b). ...

on( c,d). onfloor(e),

The learning examples would be: positive_example( graspable(c)). negative_example( graspable(a)). negative_example( graspable(b)). negative_example( graspable(d)). ...

on(d,e).

Autonomous Discovery of Abstract Concepts by a Robot

7

However, applying ILP to our grasping example has another dimension of difficulty. Namely, the target predicate “graspable” cannot be formulated without introducing an auxiliary predicate “above”. This predicate then enables an elegant recursive formulation of the target hypothesis as given above. Such an introduction of a new predicate (that is mentioned neither in the examples nor in the background knowledge) is in ILP called predicate invention. Predicate invention is combinatorially even more demanding than the usual ILP because there are so many ways of introducing new predicates and it is hard to see in advance which ways are promising and which are not. On the other hand, predicate invention is of great importance in discovery from data because it enhances the learner’s hypothesis language, and enables the statement of theories that would otherwise not be possible. From the point of view of automated discovery, predicate invention is, among other things, so attractive because it is a most natural form of automated discovery. This holds in particular for the discovery of abstract concepts. With the invention of new predicates, the automated discovery goes beyond “shallow” laws that directly state relations between the variables observed in experiments. Predicate invention is a way of finding new theoretical terms, or abstract new concepts that are not directly seen in the measured data. As Russell and Norvig (2009) observe: “Some of the deepest revolutions in science come from the invention of new predicates and functions – for example, Galileo’s invention of acceleration, or Joule’s invention of thermal energy. Once these terms are available, the discovery of new laws becomes (relatively) easy.”

3 Experimental Results In this section I describe experiments with the discovery of abstract concepts using the mechanism of predicate invention in ILP. The experimental setting consisted of real or simulated mobile robot(s) manipulating blocks in a plane. In these experiments, the concepts of a movable object, an obstacle, and a tool were discovered from measured data. These notions were expressed as new predicates. First, let us consider how the concept of a movable object emerged. When given commands to move specified objects by given distances, the robot was able to actually move some of the blocks, but some of the blocks could not be moved because they were too heavy. After some time, the robot had collected a number of experimental data recorded as Prolog facts about predicates: • •

at( Obj,T,P), meaning object Obj was observed at position P at time T; move( Obj,P1,D,P2), meaning command "move Obj from P1 by distance D" resulted in Obj at P2.

Here all positions are two-dimensional vectors. The robot's prior knowledge (communicated to the ILP program as “background knowledge”) consisted of the predicates: • • •

approx_equal( X, Y), meaning X ≈ Y; different(X,Y), meaning X and Y are not approximately equal; add( X, Y, Z), meaning Z ≈ X + Y.

8

I. Bratko

These relations are defined as approximations so that they are useful in spite of noise in numerical data. For example, according to these “fuzzy” definitions, it also holds add( 3.3, 4.4, 7.72). It should be noted that neither the observations nor the prior knowledge contain the concept of mobility of objects, or any mention of it. There are no examples given of movable and immovable objects. The ILP program Hyper [7] was used on this learning problem to induce a theory of moving in this world. That is to learn predicate move/4 which would, for a given command “move Obj from position P1 by distance D”, predict the position P2 of Obj after the command has been executed. From this observed data, HYPER induced a theory of moving blocks in this world. The theory induced by Hyper was stated in logic by the following three Prolog clauses: move(Obj,Pos1,Dist,Pos2):approx_equal( Pos1, Pos2), not p(Obj). move(Ob,Pos1,Dist,Pos2):add(Pos1,Dist,Pos2), p(Obj). p(Obj):at(Obj,T1,Pos2), at(Obj,T2,Pos2), different(Pos1,Pos2). In the clauses above, the variables were renamed for easier reading. The first clause deals with immovable objects (after the move command, the position of the object remains unchanged). The second clause handles movable objects. The point of interest is that HYPER invented a new predicate, p(Object) which is true for objects that can be moved by the robot. At least the “intention” is to define the movability property of an object. The definition above says that an object is movable if it has been observed at two different positions in time. This is of course not exactly equal to the common understanding of movable. In the common understanding, an object may be movable even if it was never moved in the available history. But the invented predicate also makes good sense. It defines the property that the object has actually been observed to move (“confirmed movable”). The robot has come up with a new concept never mentioned in the data or problem definition. The new concept p(Object) enabled the learning system to divide the problem in the two cases. A meaningful name of the newly invented predicate p(Object) would be confirmed_movable(Object). In another experiment where many objects were present in the scene, the robot invented another predicate which corresponds to the notion of obstacle. If an immovable object appears on the trajectory of a moving object then the stationary object impedes the movement. The ILP learner found it useful to introduce a new predicate whose definition corresponded to such an object – an obstacle. Again, the notion of obstacle was never explicitly mentioned in the problem definition, nor in the learning data. The learner just figured it out that such a new notion was useful for explaining the behavior of objects in the robot’s world. One induced definition that includes the new predicate which corresponds to obstacle was:

Autonomous Discovery of Abstract Concepts by a Robot

9

p( Start, Dist, Obj) :at( Obj, Time, Pos), pointBetween(Start, Dist, Pos), not movable(Obj). move(Start, Dist, End) :add(Start, Dist, End), not p(Start, Dist, Obj). move(Start, Dist, End) :p(Start, Dist, Obj), at(Obj, Time, Pos), approxEqual(Pos, End). The invented new predicate p(Start,Dist,Obj) can be interpreted as: an object Obj is an obstacle with respect the location Start and distance Dist if Obj is immobile and occupies a location that is between the locations Start and Start+Dist (both Start and Dist are two dimensional vectors). The predicate move(Start,Dist,End) makes prediction about the end position End as: if there is no obstacle on the way then End is equal to Start+Dist, otherwise End is approximately equal to the location of the obstacle. Details of these experiments are described by Leban et al. [10]. That paper also explains a method for generating negative examples needed by HYPER. This is a problem that we have not mentioned till now. HYPER, like most other ILP systems, needs negative examples in addition to positive examples. In experiments with a robot, usually the observations just produce positive examples, that is those outcomes that can actually happen in nature. A negative example in our case would consist of a robot moving to its intended location regardless of an immovable block on the way. As this cannot happen, such a negative example cannot be observed. A technique for generating negative examples from positive examples is described in [10]. Further, more complex experiments led to the discovery of the notion of a tool . In these experiments, a robot was carrying out block moving tasks whose goals were of the form: at(Block,Pos). These tasks required the robot’s planning of sequences of actions that resulted in the specified goals. Occasionally, the robot could not directly push a block through a narrow passage, because the robot could not fit into the passage. Therefore the robot had to use another block as a tool with which the first block could be pushed by the robot indirectly. The plans were also “explained” by a meansends planner in terms of the goals that each action is supposed to achieve. Macro operators were then learned from this collection of concrete plans as generalized subsequences of actions. The generalization was accomplished by replacing block names by variables whenever possible. The robot then induced a classification of macro operators in terms of logic definitions that discriminated between macro operators. At this stage two new concepts were invented that can be interpreted as definitions of the concept of a tool and of an obstacle. For example, the following definition defines that an object Obj has the role of a tool in a macro operator MacroOp:

10

I. Bratko

tool( MacroOp, Obj) :object( Obj), member( Action --> Goals, MacroOp), % Purpose of Action is Goals argument( Obj, Action), % Obj appears as an argument in Action not argument( Obj, Goals). % Obj does not appear as argument in Goals Essentially, this definition says that Obj has the role of a tool in a macro operator if there is an action Action in the macro operator such that the purpose of Action is Goals, and Obj appears as one of the arguments that describe Action, and it does not appear as an argument of Goals that Action achieves.

4 Conclusions In this paper, approaches to robot programming were considered, the most general being the robot’s autonomous learning of a model of the robot’s domain and using this model for task planning. Even this approach may be limited by a fixed hypothesis language used by the learning program. We focussed on the question of alleviating this limitation through automatic discovery of abstract new concepts that can be used as extensions of the hypothesis language. An approach to discovery of abstract concepts was discussed based on predicate invention in ILP. Some experiments in the discovery of abstract concepts through predicate invention in ILP in some robotic domains were reviewed. We showed examples where new predicates were automatically invented that correspond to the notions of the object’s movability, an obstacle and a tool. In these experiments, the number of examples sufficient to induce sensible definitions of such concepts was typically small, in the order of tens or hundreds. The data that was used for learning was collected using either a simulated or a real robot and contained noise. For the task of abstract discovery considered in this paper, in experiments with a number of ML methods, ILP was the only approach with potential of success. This was analysed in [2]. Predicate logic, the hypothesis language of ILP, is sufficiently expressive to enable, at least in principle, the applicability also to tasks where the other approaches seem to be insufficient. In particular, such tasks are the discovery of aggregate and functional notions. For example, the key mechanism in discovery of aggregate and functional notions is that of predicate invention. In the discovery and general handling of aggregate notions, the recursion facility is essential. Also, the use of logic theories as background knowledge allows very natural transition between the learning of theories in increasingly complex worlds. The general problem with ILP is its high computational complexity. Predicate invention in ILP [12] further increases complexity very critically (as also commented by Domingos in [9]). Further progress with the discussed approach to robot learning therefore critically depends on the progress in making predicate invention more efficient.

Acknowledgements Research described in this paper was supported by the European Commission, 6th Framework project XPERO, and the Slovenian research agency ARRS, Research

Autonomous Discovery of Abstract Concepts by a Robot

11

program Artificial Intelligence and Intelligent Systems. A number of people contributed to the related experimental work, including G. Leban and J. Žabkar.

References 1. Bratko, I., Šuc, D., Awaad, I., Demšar, J., Gemeiner, P., Guid, M., Leon, B., Mestnik, M., Prankl, J., Prassler, E., Vincze, M., Žabkar, J.: Initial experiments in robot discovery in XPERO. In: ICRA 2007 Workshop Concept Learning for Embodied Agents, Rome (2007) 2. Bratko, I.: An Assessment of Machine Learning Methods for Robotic Discovery. Journal of Computing and Information Technology – CIT 16, 247–254 (2008) 3. Demšar, J., Zupan, B.: Orange: Data Mining Fruitful & Fun - From Experimental Machine Learning to Interactive Data Mining (2006), http://www.ailab.si/orange 4. Šuc, D.: Machine Reconstruction of Human Control Strategies. In: Frontiers Artificial Intelligence Appl., vol. 99, IOS Press, Amsterdam (2003) 5. Križman, V.: Automatic Discovery of the Structure of Dynamic System Models. PhD thesis, Faculty of Computer and Information Sciences, University of Ljubljana (1998) 6. Srinivasan, A.: The Aleph Manual. Technical Report, Computing Laboratory, Oxford University (2000), http://web.comlab.ox.ac.uk/oucl/research/areas/ machlearn/Aleph/ 7. Bratko, I.: Prolog Programming for Artificial Intelligence, 3rd edn. Addison-Wesley / Pearson (2001) 8. Richardson, M., Domingos, P.: Markov Logic Networks. Machine Learning 62, 107–136 (2006) 9. Dietterich, T.G., Domingos, P., Getoor, L., Muggleton, S., Tadepalli, P.: Structured machine learning: the next ten years. Machine Learning 73, 3–23 (2008) 10. Leban, G., Žabkar, J., Bratko, I.: An experiment in robot discovery with ILP. In: Železný, F., Lavrač, N. (eds.) ILP 2008. LNCS (LNAI), vol. 5194, pp. 77–90. Springer, Heidelberg (2008) 11. Stahl, I.: Predicate invention in Inductive Logic Programming. In: De Raedt, L. (ed.) Advances in Inductive Logic Programming, pp. 34–47. IOS Press, Amsterdam (1996) 12. Garcia-Martinez, R., Borrajo, D.: An integrated approach of learning, planning and execution. Journal of Intelligent and Robotic Systems 29, 47–78 13. Veloso, M., Carbonell, J., Perez, A., Borrajo, D., Fink, E., Blythe, J.: Integrating planning and learning. J. of Experimental and Theoretical AI 7(1) (1995) 14. Zimmerman, T.L., Kambhampati, S.: Learning-assisted automated planning: Looking back, taking stock, going forward. AI Magazine 24(2), 73–96 (2003) 15. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Pearson, London (2009) 16. De Raedt, L.: Logical and Relational Learning. Springer, Heidelberg (2008)

Kernel Networks with Fixed and Variable Widths Věra Kůrková1 and Paul C. Kainen2 1

Institute of Computer Science, Academy of Sciences of the Czech Republic Pod Vodárenskou věží 2, Prague 8, Czech Republic 2 Department of Mathematics, Georgetown University, Washington, D.C. 20057-1233, USA [email protected], [email protected]

Abstract. The role of width in kernel models and radial-basis function networks is investigated with a special emphasis on the Gaussian case. Quantitative bounds are given on kernel-based regularization showing the eﬀect of changing the width. These bounds are shown to be d-th powers of width ratios, and so they are exponential in the dimension of input data. Keywords: Kernel models, Gaussian kernel networks, Minimization of error functionals, Regularization.

1

Introduction

Radial-basis-function (RBF) networks compute as their input-output functions linear combinations of radial functions (typically Gaussians) with both centers and widths being optimized during learning. In contrast to RBF, in kernel models merely centers of computational units are adjustable. The width is given a priori by the choice of a kernel. Both computational models have been successfully applied in a variety of classiﬁcation and regression tasks (see, e.g., [1,2]). RBF with rather general radial functions are known to be universal approximators [3,4]. In particular, Gaussian RBF are dense in the space (C(X), .sup) of continuous functions on a compact X with the supremum norm as well as in (Lp (Rd ), .Lp ). However, variability of widths is not necessary for the universal approximation property of Gaussian kernel models. It was proven in [5] that for any ﬁxed width, sets of linear combinations of translations of Gaussians with this width are dense in (C(X), .sup). Yet, it is obvious that higher ﬂexibility in the choice of parameters leads to smaller model complexity in function approximation (see [6] for some estimates). On the other hand, a ﬁxed width of a kernel provides a useful framework for theoretical investigation of learning in Hilbert spaces generated by kernels. Good properties of these spaces allow to apply methods of regularization to model generalization, to describe minima of error functionals and to design new learning algorithms [7,8,9]. Thus both types of computational models, the one with a ﬁxed width and the one with variable widths, have its advantages. In this paper, the role of width in kernel models and radial-basis function networks is investigated theoretically with special emphasis on Gaussian kernels. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 12–21, 2011. c Springer-Verlag Berlin Heidelberg 2011

Kernel Networks

13

We use properties of operators deﬁned by kernels to show that function spaces deﬁned by kernels with monotonically changing widths are nested, their embeddings are continuous, and the ratio between their norms grows exponentially with the number of variables d. It is shown that error functionals on spaces of continuous functions have many argminima composed from Gaussians of all sizes of widths and that these argminima are linearly independent. The paper is organized as follows. In section 2, notations are introduces and basic concepts on kernel models and reproducing kernel Hilbert spaces are reviewed. In section 3, properties of minima of error functionals over kernel models are described. In section 4, properties of spaces induced by families of kernels with growing widths are derived. In section 5, the results are applied to Gaussian networks with diﬀerent widths.

2

Radial and Kernel Models

Radial-basis function networks as well as kernel models belong to a class of onehidden layer networks with one linear unit. Such networks compute functions from a set n spann G := wi gi | wi ∈ R, gi ∈ G , i=1

where the set G is called a dictionary [10] and n is the number of hidden units. The set of input-output functions of networks with an arbitrary number of units is denoted n wi gi | wi ∈ R, gi ∈ G, n ∈ N+ , span G := i=1

where N+ denotes the set of positive integers. Often, dictionaries are parameterized families of functions modelling computational units, i.e., they are of the form Gφ (X, Y ) := {φ(., y) : X → R | y ∈ Y } where φ : X × Y → R is a function of two variables, an input vector x ∈ X ⊆ Rd and a parameter y ∈ Y ⊆ Rs . When X = Y , we write brieﬂyly Gφ (X) and when X = Y = Rd , Gφ . We investigate two classes of such families. The ﬁrst class is formed by dictionaries induced by kernels K : X × Y → R, where X ⊆ Rd and Y ⊆ Rr . We denote GK (X, Y ) := {Ky : X → R | y ∈ Y }, where for y ∈ Y , Ky : X → R is the function deﬁned as Ky (x) = K(x, y). Often X = Y , K is symmetric and positive semidefinite, i.e., for any positive integer m, any x1 , . . . , xm ∈ X and any a1 , . . . , am ∈ R m m i=1 j=1

ai aj K(xi , xj ) ≥ 0.

14

V. Kůrková and P.C. Kainen

For a convolution kernel K(x, y) = k(x − y), the dictionary GK (X) contains translations of the function k. The second class of parameterized families contains dictionaries which have in addition to translations also scaling parameters. For K : Rd × Rd → R, we denote by K a : Rd × Rd → R the kernel deﬁned as K a (x, y) = K(ax, ay). When it is clear from the context, we also use K a to denote the restriction of K a to X × X, where X ⊆ Rd . A kernel K induces a dictionary with varying widths FK (X, Y ) := {Kya : X → R | a > 0, y ∈ Y } = GK a (X, Y ). a>0

So the set spann FK (X, Y ) consists of all input-output functions of networks with n units with varying width parameters a > 0 and varying parameters y ∈ Y . Note that RBF networks form a subclass of kernel models with varying widths. When X = Y , we write FK (X) and when X = Y = Rd , we write merely FK . For K symmetric positive semideﬁnite, the sets span GK (X) are contained in Hilbert spaces deﬁned by kernels called reproducing kernel Hilbert spaces (RKHS). Such spaces were deﬁned by Aronszajn [11] as Hilbert spaces of pointwise deﬁned functions with all evaluation functionals continuous. Each such space is induced by a symmetric positive kernel K : X × X → R. The space is denoted HK (X), and it is formed by functions from span GK (X) together with limits of their Cauchy sequences in the norm .K . The norm .K is induced by the inner product ., .K , which is deﬁned on GK (X) as Kx , Ky K := K(x, y). For convolution kernels K(x, y) = k(x − y), where k has a positive Fourier transform, spaces HK (Rd ) can be characterized in terms of weighted Fourier transforms. The d-dimensional Fourier transform is the operator F deﬁned on L2 ∩ L1 as 1 F (f )(s) = fˆ(s) = eix·s f (x) dx (2π)d/2 Rd and extended as an isometry to L2 [12]. The next theorem is from [13] (see also [14] for an earlier less rigorous formulation). ˆ is integrable, for all s ∈ Theorem 1. Let k ∈ L2 (Rd ) ∩ L1 (Rd ) be such that k d d ˆ R , k(s) ≥ 0, S = {s ∈ R | k(s) = 0}, and let K : Rd × Rd → R be the convolution kernel induced by k, i.e., K(x, y) = k(x − y) for all x, y ∈ Rd . Then K : Rd × Rd → R is positive semidefinite, HK (Rd ) = {f ∈ L2 (Rd ) | f K < ∞}, and ˆ 2 1 f(s) f 2K = ds. (1) ˆ (2π)d/2 S k(s)

Kernel Networks

3

15

Minimization of Error Functionals

Various learning algorithms (see, e.g., [1,2]) aim to minimize error functionals over kernel and RBF networks. An empirical error functional is determined by a training sample z = {(ui , vi ) ∈ X × Y | i = 1, . . . , m} of input-output pairs of data and a loss function. Denote by Ez empirical error with quadratic loss 1 (f (ui ) − vi )2 . m i=1 m

Ez (f ) :=

(2)

To model generalization, Girosi and Poggio [15] introduced into learning theory Tikhonov regularization which adds to the empirical error a functional called the stabilizer, penalizing undesired properties for solutions of learning tasks. Originally, they considered as stabilizers weighted Fourier transform of the form fˆ(s)2 1 ds [16], later Girosi [14] realized that such stabilizers are squares ˆ (2π)d/2 S k(s) of norms on RKHSs as stated in Theorem 1. We denote by Ez,α,K := Ez + α.2K

(3)

the regularized empirical error with the stabilizer .2K and the regularization parameter α. The next theorem characterizes argminima of Ez and its regularization. The theorem was proven in [17] by methods from theory of inverse problems. Several authors [18,7,8] proved the part (ii) earlier using Fréchet derivatives. By 1 K[u] is denoted the matrix K[u]i,j = K(ui , uj ), Km [u] = m K[u], and K[u]+ is the Moore-Penrose pseudoinverse of the matrix K[u]. Theorem 2. Let X ⊆ Rd , K : X × X → R be a symmetric positive semidefinite kernel, m be a positive integer, z = (u, v) with u = (u1 , . . . , um ) ∈ X m , v = (v1 , . . . , vm ) ∈ Rm , then (i) there exists an argminimum f + of Ez over HK (X), which satisfies f+ =

m

ci Kui ,

where

c = (c1 , . . . , cm ) = K[u]+ v,

i=1

and for all f ∈ argmin(HK (X), Ez ), f + K ≤ f o K ; (ii) for all α > 0, there exists a unique argminimum f α of Ez,α,K over HK (X), which satisfies o

fα =

m

cα i Kui ,

where

α −1 cα = (cα v; 1 , . . . , cm ) = (Km [u] + α Im )

i=1

(iii) limα→0 f α − f + K = 0. Note that both argminima, f + and f α , are elements of the set spanm GK (X) and thus they belong to sets of functions computable by one hidden-layer networks with units from the dictionary GK (X). It is easy to show that for any sample z, the empirical error Ez is continuous on (C(X), .sup ) and also on any RKHS (HK (X), .K ).

16

V. Kůrková and P.C. Kainen

Proposition 1. Let X ⊆ Rd , z = (u, v), where u = (u1 , . . . , um ) ∈ X m and v = (v1 , . . . , vm ) ∈ Rm . Then (i) Ez is continuous on (C(X), .sup ); (ii) for any symmetric positive semidefinite kernel K : X × X → Rd , Ez is continuous on (HK (X), .K ). Proof. (i) Let f, g ∈ C(X) be such that f − hsup < δ. Then |Ez (f ) − Ez (h)| =

m 1 | ( (f (ui ) − h(ui )(f (ui ) + h(ui ) − 2vi ) ) | ≤ δ(C + mδ), m i=1

where C = maxi=1,...,m 2f (ui ). So Ez is continuous at f . (ii) Let Fu : HK (X) → Rd be an evaluation operator deﬁned for every f ∈ HK (X) as Fu (f ) = (f (u1 ), . . . , f (um )). Then Ez (f ) = Fu (f ) − v22,m , where m 2 1 the norm .2,m on Rm is deﬁned as x22,m = m i=1 xi . By the deﬁnition of RKHS, Fu is continuous and so its composition with two continuous operators, the translation by v and the norm .2,m , is continuous, too. The next proposition shows that any argminimum of a continuous functional over a dense subset is also an argminimum over the whole space. Proposition 2. Let Y be a dense subset of a normed linear space (X , .X ), T : (X , .X ) → R be a continuous functional, and f ∈ Y be an argminimum of T over Y. Then f is an argminimum of T over X . Proof. Assume by contradiction that there exists g ∈ X such that T (g) < T (f ). Let η > 0 be such that T (g) + η < T (f ). By continuity of T at g, there exists δ > 0 such that for all g ∈ X , g − g X < δ implies |T (g) − T (g )| < η. By density of Y, there exist g ∈ Y with g − g X < δ and so |T (g) − T (g )| < η. As g ∈ Y, we have T (g) + η < T (f ) ≤ T (g ) which gives a contradiction.

4

Scaling of Kernels

In this section, we investigate relationships between RKHSs induced by diﬀerent scalings of the same kernel. To describe such relationships we take advantage of properties of orthogonal bases of RKHSs formed by eigenvalues of integral operators induced by these kernels. For X ⊆ Rd and μ a σ-ﬁnite measure on X, L2μ (X) denotes the space of real valued functions on X satisfying X f (x)2 dμ(x) < ∞ with the norm f L2μ =

1/2 |f (x)|2 dμ(x) . When μ is Lebesgue measure, we omit it in the notation. For X ⊆ Rd , a kernel K : X × X → R, a σ-ﬁnite measure μ on X, deﬁne an integral operator LK,μ = LK on the subspace of L2μ (X) formed by those g for which for every x ∈ X the integral LK (g)(x) := g(y) K(x, y)dμ(y) X

Kernel Networks

17

is ﬁnite. The following proposition from [17] gives a condition on a kernel K which implies that the RKHS HK (X) induced by K is actually a linear subspace of L2μ (X) with a continuous inclusion operator JK : (HK (X), .K ) → (L2μ (X), .L2μ ). Recall that every bounded linear operator T : (X , · X ) → (Y, · Y ) between two Hilbert spaces has an adjoint operator T ∗ : (Y, · Y ) → (X , · X ) [19]. An operator T on a Hilbert space is called a Hilbert-Schmidt operator if for any orthonormal basis {ej | j ∈ I} of (X , .X ), j∈I T (ej )2Y < ∞. Let TK := JK LK : (L2μ (X), .L2μ ) → (L2μ (X), .L2μ ) Proposition 3. Let X ⊆ Rd , μ be a σ-finite measure on X, K : X × X → R be a symmetric positive semidefinite kernel such that X K(x, x) dμ(x) < ∞. Then (i) HK (X) ⊆ L2μ (X) and JK : (HK (X), .K ) → (L2μ (X), .L2μ ) is continuous; ∗ (ii) LK = JK : (L2μ (X), .L2μ ) → (HK (X), .K ) and so LK is continuous; (iii) TK is a Hilbert-Schmidt operator and both JK and LK are compact. To describe a relationship between RKHSs induced by two diﬀerent scalings, K a and K b , of the same kernel K we apply the spectral theorm to the operator TK := JK LK obtained by composing LK with JK . The next theorem from [17] summarizes some properties of eigenfunctions and eigenvalues of these operators. Theorem 3. Let X ⊆ Rd be measurable, μ a σ-finite measure on X, K : X × X → R symmetric positive semidefinite with X K(x, x)dμ(x) < ∞. Then (i) TK : (L2μ (X), .L2μ ) → (L2μ (X), .L2μ ) is compact, self-adjoint, and positive; (ii) there exists at most countable orthonormal family{ψj | j ∈ I}in(L2μ (X), .L2μ) formed by eigenfunctions of TK with the corresponding family of non-negative eigenvalues {λj | j ∈ I} ordered non increasingly, which in the case of I infinite converges to zero, such that for every f ∈ L2μ (X), TK (f ) =

j∈I

(iii) {

λj f, ψj L2μ ψj

and

K(x, y) =

λj ψj (x)ψj (y);

(4)

j∈I

λj ψj | j ∈ I} is an orthonormal basis of (HK (X), .K ), j∈I λj < ∞.

Note that for convolution kernels with K(x, y) = k(x − y), X K(x, x)dμ(x) = k(0)dμ(x) = k(0) μ(X) so the assumption of Theorem 3 holds if and only if X μ(X) < ∞. For Lebesgue measure μ, it holds when X is bounded. Using Theorem 3 one can easily prove that RKHSs induced by a sequence of scalings of the same kernel are nested and the ratio between their norms grows exponentially with the number of variables d. For a > 0 and X ⊆ Rd denote by 1 X = { xa | x ∈ X}, and for a function ψ : X → R ψ a : a1 X → R denotes the a function deﬁned by ψ a (x) = ψ(ax).

18

V. Kůrková and P.C. Kainen

Lemma 1. Let X ⊆ Rd be Lebesgue measurable, K : Rd × Rd → R be a symmetric positive semidefinite kernel, and a > 0. Then for every eigenfunction ψ of TK : L2 (X) → L2 (X) and its eigenvalue λ, ψ a is an eigenfunction of TK a : L2 ( 1a X) → L2 ( a1 X) and its eigenvalue is aλd . Proof. For y = ay, we have TK a (ψ a )(x) = 1 X ψ(ay)K(ax, ay)dy = a 1 ψ(y )K(ax, y )dy = a1d TK (ψ)(ax) = aλd ψ(ax) = aλd ψ a (x). ad X Theorem 4. Let K : Rd × Rd → R be a symmetric positive semidefinite kernel such that X K(x, x)dx < ∞, and 0 < a ≤ b. Then there exists a one-to-one continuous mapping Jb,a : HK b ( 1b X) → HK a ( a1 X) such that for all f ∈ HK b ( 1b X) Jb,a (f )K a =

a d/2 b

f K b .

Proof. By Theorem 3 and Lemma 1, there exists an orthonormal family {ψj } ofeigenfunctionsof TK : L2 (X) → L2 (X) with eigenvalues {λj } such that { adj ψja } and { bdj ψjb } are orthonormal bases of HK a ( a1 X) and HK b ( 1b X), resp. Thus every f ∈ HK b ( 1b X) can be represented as f = j∈I cj ψjb , where c2j bd c2j ad c2j bd j∈I λj < ∞. As a ≤ b implies j∈I λj ≤ j∈I λj , and thus Jb,a deﬁned c2j ad by Jb,a (ψjb ) = ψja maps HK b ( 1b X) to HK a ( a1 X). As f 2K a = j∈I λj and

d/2 c2 bd J (f )K a f 2K b = j∈I λj j , we have b,a = ab . Hence Jb,a is continuous. f b λ

λ

K

Theorem 4 shows that spaces induced by “sharper” modiﬁcations of a kernel K are embedded into spaces induced by “ﬂatter” modiﬁcations of K. Thus we have a nested family of RKHSs with continuous contractive embeddings. Theorem 4 also shows that “ﬂattening” of a kernel increases the penalty represented by the stabilizer .2K b in the Tikhonov regularization (3). Moreover, the impact of such modiﬁcation of the kernel norm depends on the input dimension d exponentially. In practical applications instead of .2K , simpler stabilizers, such as the 1 or the 2 -norm of output weights, are used [1]. If GK a (X) is linearly independent (which holds for any strictly positive kernel K), each f ∈ span GK a (X) n deﬁnite a has a unique representation f = i=1 w i Kxi . Hence, one can deﬁne a functional n d d W : span GK a (X) → R by W (f ) = i=1 |wi |. When K : R × R → R is bounded with cK = supx∈Rd |K(x, x)|, we have for all a > 0, and all X ⊆ Rd , supx∈X |K a (x, x)| ≤ cK . Thus, f K a ≤

n

1/2

|wi |K a (xi , xi )K a ≤ W (f ) cK .

i=1

Therefore, by decreasing the 1 -norms of the output weights, one also decreases .K a -norms for all a > 0.

Kernel Networks

5

19

Gaussian Kernels

The paradigmatic example of a kernel is the Gaussian kernel. It is the convolution kernel induced by the Gaussian function. Let γd : Rd → R denote the d-dimensional Gaussian, 2 γd (x) := e−x and Kγd : Rd × Rd → R the Gaussian kernel, 2

Kγd (x, y) := e−x−y . Note that for the Gaussian kernel Kγd on Rd × Rd and the Lebesgue measure, Kγd (x, x)dx = ∞. Thus we cannot derive the ratio between the norms .K a Rd and .K b from Theorem 4 because its assumption is not satisﬁed. Instead, we can take advantage of the characterization of kernel norms in terms of Fourier transforms from Theorem 1. The next corollary gives a characterization of spaces HKγd (Rd ) induced by Gaussian kernels on Rd . Corollary 1. For all a > 0, HKγa (Rd ) = {f ∈ L2 (Rd ) | f Kγa < ∞}, where d

f 2Kγa = d

a √ π

d

fˆ(s)2

Rd

e− 2a s

2

d

ds and for all b ≥ a, f Kγb

d/2 b ≤ f Kγa . d a

d

Proof. By Theorem 1 and the formula √ −d s a γ γd ( ), d (s) = ( 2a) 2a one gets

f 2K b

γd

f 2Kγa

d

2

d ˆ 2 s 2 2b ds b d f (s) e = . R s 2 2 ˆ a 2a ds d f (s) e R

2

As a ≤ b implies e 2b ≤ e 2a , we have s

s

f K b

γd

f Kγa

d

≤

b d/2 a

.

So Corollary 1 shows that with sharpening of the Gaussian, the norm on the induced RKHS is decreasing exponentially fast. The next theorem summarizes some properties of Gaussian kernel models; for (i), see Mhaskar [5]; for (ii) Kůrková and Neruda [20]; (iii) follows from (ii). Theorem 5. Let d be a positive integer and X ⊂ Rd be compact. Then (i) for all a > 0, span GKγa (X)and HKγa (X)are dense subspaces of (C(X),.sup ); d d (ii) the set FKγd (Rd ) = a>0 GKγa (Rd ) is linearly independent; d (iii) for all a, b > 0 such that a = b, span GKγa (Rd ) span GKγb (Rd ) = ∅. d

d

20

V. Kůrková and P.C. Kainen

By Proposition 2, for any width a > 0, all argminima of Ez over the RKHS HKγa (X) induced by the Gaussian kernel with the width a > 0 are also argmind ima of Ez over the whole space C(X). The next theorem describe a class of argminima of the empirical error in the space C(X) obtained as linear combinations of Gaussians with various widths. For a set A, conv A denotes the convex hull of A. Corollary 2. Let X be a compact subset of Rd ,z = (u, v),where u = (u1 , . . . , um ) ∈ X m , v = (v1 , . . . , vm ) ∈ Rm . Then the set of argminima of the expected error m Ez over C(X) contains the set conv{fa+ | a > 0 }, where fa+ = i=1 cai Kγad u with i ca = Kγad [u]+ v. Proof. By Theorem 2(i), the argminimum of Ez over HKγa (X) has the form fa+= d m a + a a i=1 ciKγd ui and thus fa ∈ span GKγd (X). By Theorem 5 (i), span GKγd (X) is dense in (C(X), .sup ) and by Proposition 1, Ez is continuous on (C(X), .sup). Thus by Proposition 2, fa+ is an argminimum of Ez over C(X). As the set of argminima is convex, the statement holds. Corollary 2 shows that in the space of continuous functions C(X), for each width a > 0 there is an argminimum of the empirical error formed by a linear combination of Gaussians with the width a. All these Gaussians have the same centers given by the input data. The set of all these argminima is linearly independent and all their convex combinations are also argminima. By Corollary 1, the ra ˆ(s)2 tio of sizes of stabilizers in the form of ﬁlters (2π)1d/2 S fk(s) ds with k being ˆ the Gaussian with widths b and a, resp., where a < b, grows with increasing

d dimension exponentially as ab .

Acknowledgements V. K.was partially supported by MŠMT program COST grant INTELLIOC10047 and the Institutional Research Plan AV0Z10300504. Collaboration of V. K. and P. C. K. was partially supported by MŠMT program KONTAKT grant ALNN ME10023.

References 1. Fine, T.L.: Feedforward Neural Network Methodology. Springer, Heidelberg (1999) 2. Kecman, V.: Learning and Soft Computing. MIT Press, Cambridge (2001) 3. Park, J., Sandberg, I.: Universal approximation using radial–basis–function networks. Neural Computation 3, 246–257 (1991) 4. Park, J., Sandberg, I.: Approximation and radial basis function networks. Neural Computation 5, 305–316 (1993) 5. Mhaskar, H.N.: Versatile Gaussian networks. In: Proceedings of IEEE Workshop of Nonlinear Image Processing, pp. 70–73 (1995)

Kernel Networks

21

6. Kainen, P.C., Kůrková, V., Sanguineti, M.: Complexity of Gaussian radial basis networks approximating smooth functions. J. of Complexity 25, 63–74 (2009) 7. Cucker, F., Smale, S.: On the mathematical foundations of learning. Bulletin of AMS 39, 1–49 (2002) 8. Poggio, T., Smale, S.: The mathematics of learning: dealing with data. Notices of AMS 50, 537–544 (2003) 9. Kůrková, V.: Neural network learning as an inverse problem. Logic Journal of IGPL 13, 551–559 (2005) 10. Gribonval, R., Vandergheynst, P.: On the exponential convergence of matching pursuits in quasi-incoherent dictionaries. IEEE Trans. on Information Theory 52, 255–261 (2006) 11. Aronszajn, N.: Theory of reproducing kernels. Transactions of AMS 68, 337–404 (1950) 12. Strichartz, R.: A Guide to Distribution Theory and Fourier Transforms. World Scientiﬁc, NJ (2003) 13. Loustau, S.: Aggregation of SVM classiﬁers using Sobolev spaces. Journal of Machine Learning Research 9, 1559–1582 (2008) 14. Girosi, F.: An equivalence between sparse approximation and support vector machines. Neural Computation (AI memo 1606) 10, 1455–1480 (1998) 15. Girosi, F., Poggio, T.: Regularization algorithms for learning that are equivalent to multilayer networks. Science 247(4945), 978–982 (1990) 16. Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures. Neural Computation 7, 219–269 (1995) 17. Kůrková, V.: Learning from data as an inverse problem in reproducing kernel Hilbert spaces. Inverse Problems in Science and Engineering (2010) (submitted) 18. Wahba, G.: Splines Models for Observational Data. SIAM, Philadelphia (1990) 19. Friedman, A.: Modern Analysis. Dover, New York (1982) 20. Kůrková, V., Neruda, R.: Uniqueness of functional representations by Gaussian basis function networks. In: Proceedings of ICANN 1994, pp. 471–474. Springer, London (1994)

Evaluating Reliability of Single Classifications of Neural Networks ˇ Darko Pevec, Erik Strumbelj, and Igor Kononenko University of Ljubljana Faculty of Computer and Information Science Trˇzaˇska 25, 1000 Ljubljana, Slovenia {darko.pevec,erik.strumbelj,igor.kononenko}@fri.uni-lj.si

Abstract. Current machine learning algorithms perform well on many problem domains, but in risk-sensitive decision making, for example in medicine and ﬁnance, common evaluation methods that give overall assessments of models fail to gain trust among experts, as they do not provide any information about single predictions. We continue the previous work on approaches for evaluating the reliability of single classiﬁcations where we focus on methods that are model independent. These methods have been shown to be successful in their narrow ﬁelds of application, so we constructed a testing methodology to evaluate these methods in straightforward, general-use test cases. For the evaluation, we had to derive a statistical reference function, which enables comparison between the reliability estimators and the model’s own predictions. We compare ﬁve diﬀerent approaches and evaluate them on a simple neural network with several artiﬁcial and real-world domains. The results indicate that reliability estimators CNK and LCV can be used to improve the model’s predictions. Keywords: Reliability estimation, Classiﬁcation, Prediction accuracy, Prediction error.

1

Introduction

In supervised learning, one of the goals is to get the best possible prediction accuracy on new and unknown examples. Common evaluation methods like the mean square error give an averaged accuracy assessment of models, however in cases, where predictions may have signiﬁcant consequences, common methods become insuﬃcient as we want to back individual predictions up with a more credible reliability statement. In risk-sensitive decision making, for-say in medicine or ﬁnance, having information on single prediction reliability could be of a great beneﬁt. Various methods have been developed to enable the users of classiﬁcation and regression models to gain more insight into the reliability of individual predictions [2,4]. We take the model-independent black box approach because of it’s generality and exploit the fact that it is possible to compute class probability distributions for every classiﬁcation model. We adopted four approaches to reliability estimation of individual examples from [2] and a transduction based ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 22–30, 2011. c Springer-Verlag Berlin Heidelberg 2011

Evaluating Reliability of Single Classiﬁcations of Neural Networks

23

method from [5]. They were evaluated with a simple neural network on twenty domains gathered from the UCI Machine Learning Repository [1]. 1.1

Related Work

The most relevant related work is in the area of reliability estimation for individual examples. An appropriate criterion for diﬀerentiating between various approaches is whether they target a speciﬁc predictive model or they are modelindependent. There exists a broad community developing methods speciﬁc to the neural network model, but in this paper, we focus on model independent black box approaches. The idea of reliability estimation for individual predictions originated in statistics, where conﬁdence values and intervals are used to express the reliability of estimates. In machine learning, statistical properties of predictive models were used to extend predictions with reliability estimations [4]. Because the modelindependent approaches are general, they do not exploit parameters that are speciﬁc to a given predictive model. They rather focus on inﬂuencing the parameters that are available in the standard supervised learning framework (e.g. the learning set and attributes)[2]. The paper is organized as follows. Section 2 presents a general framework in which we perform the reliability estimation, and describes reliability estimators we used. The testing methodology, typical test cases and our main results follow in Section 3. Lastly, Section 4 provides conclusions and some ideas for further work.

2

Reliability Estimation

In general, reliability is the ability of a person or system to perform and maintain its functions in routine circumstances. Even the engineering deﬁnition is not of great help, as the engineering reliability is the ability of a system or component to perform its required functions under stated conditions for a speciﬁed period of time. We expect from the reliability estimators to give insight into the prediction accuracy and we ﬁrst expect to see some positive correlation between the two [2]. To get a more formal deﬁnition of reliability, we need to start with deﬁning what we consider to be the error of a single prediction. Let x be a single example, that has a true class y. Let the true conditional probabilities of the i-th class be pi (x) = P (Y = i|X = x) and let fi (x) be the predicted probability of the i-th class. In the case of a hypothetical optimal model, it would stand that ∀i : fi (x) = pi (x). We write the single classiﬁcation (Laplace) error as e(x) = |y − f (x)| .

(1)

When we let yˆ = maxi fi (x) be the models’ prediction, the above equation equals 1 − yˆ in the case of a correct classiﬁcation and e(x) = yˆ in the case of a misclassiﬁcation. The expected value of our error function is E[e(x)] = py (x)(1 − fy (x)) + (1 − py (x))fy (x) .

24

ˇ D. Pevec, E. Strumbelj, and I. Kononenko

Because the true conditional probabilities are inaccessible, for approximation we can substitute them with the model’s predicted probabilities. Substituting py (x) with fy (x) and further fy (x) with the prediction yˆ, we get a reference estimator of the expected error: Oref = 2(ˆ y − yˆ2 ) . (2) This reference reliability Eq. (2) has two desirable properties. First, we can compute it for every prediction, because yˆ is easily accessible. Second, in case of an optimal (or near-optimal) model, this reference becomes optimal as well (that is, equals the error function). Our use of the reference estimator is analogous to using the relative frequency of the majority class as a reference for evaluating the overall accuracy achieved by a classiﬁer. For example, 90% accuracy might appear good. However, in cases where the relative frequency of the majority class is over 90%, it is not. Similarly, if a reliability estimator does not outperform the reference reliability, it is not considered to be useful. We consider four additional estimators of single prediction reliability in the following. When measuring the distance between two probability distributions, the Hellinger distance was used. 2.1

Local Modeling of Prediction Error

Let K be the predictor’s class probability distribution for a given unlabeled example (x, ). This approach to local estimation of prediction reliability is based on the nearest neighbors’ labels. Given a set of nearest neighbors N = [(x1 , C1 ), . . . , (xk , Ck )], where Ci is the true label of the i-th nearest neighbor, the estimate CNK (CN eighbors − K) is for the unlabeled example deﬁned as the average distance between the prediction based on k nearest neighbors and the example’s prediction K: k

(Ci , K) (3) k CNK is obviously not a suitable reliability estimate for the k-nearest neighbors algorithm, as they both work by the same principle. CN K = 1 −

2.2

i=1

Local Cross-Validation Reliability Estimator

The LCV (Local Cross-Validation) reliability estimate is computed using the local leave-one-out procedure. Suppose that we are given an unlabeled example for which we wish to compute the prediction and the LCV estimate. Focusing on the subspace deﬁned by k nearest neighbors (parameter k is selected in advance), we then generate k local models, each of them excluding one of the k nearest neighbors. Using the generated models, we compute the leave-one-out predictions Ki , i = 1, . . . , k for each of the nearest neighbors. Since the labels Ci , i = 1, . . . , k of the nearest neighbors are known, we are able to calculate the absolute local leave-one-out prediction error Ei = (Ci , Ki ). The DENS estimate is then computed as 1

Evaluating Reliability of Single Classiﬁcations of Neural Networks

25

minus the average of the nearest neighbors’ local errors Ei . The procedure is schematically illustrated and accompanied by a pseudo-code algorithm in [2]. In experimental work, the algorithm was implemented to be adaptive with respect to the size of the neighborhood, that is to the number of examples in the learning set. The parameter k was therefore assigned to one tenth of the size of the learning set. 2.3

Variance of a Bagged Model

The variance of predictions in bagged aggregates was ﬁrst used to indirectly estimate the reliability of the aggregated prediction with artiﬁcial neural networks. Since an arbitrary regression model can be used with the bagging technique, the technique was generalized and used as a reliability estimate for use with other regression models [2]. Given a bagged aggregate of m predictive models, where each of the models yields a prediction Bk , k = 1, . . . , m, the reliability estimator BAGV is deﬁned as 1 minus the variance of predictions’ class probability distribution: m

BAGV = 1 −

1 (Bk,i , Ki )2 . m k=1

2.4

(4)

i

Density-Based Reliability Estimator

The density-based estimation of prediction error assumes that error is lower for predictions which are made in denser training problem subspaces, and higher for predictions which are made in sparser subspaces. Based on this assumption, we trust the prediction with respect to the quantity of information available for its computation. A typical use is with decision and regression trees, where we trust each prediction according to the number of learning examples that fall in the same leaf of a tree as the predicted example. The reliability estimator DENS is a value of the estimated probability density function for a given unlabeled example. To estimate the density, Parzen windows were used, taking the Gaussian kernel. The problem of computing the multidimensional Gaussian kernel was reduced to computing the two-dimensional kernel by using a distance function applied to pairs of example vectors. Given the learning set L = [(x1 , c1 ), . . . , (xl , cl )], the density estimate for unlabeled example (x, ) is therefore deﬁned as κ((x, e)) p(x) = e∈L , l where κ denotes a kernel function (in our case Gaussian). Therefore the reliability estimate is given by: DEN S(x) = maxe∈L (p(e)) − p(x) .

(5)

26

3

ˇ D. Pevec, E. Strumbelj, and I. Kononenko

Results

First, we describe the testing methodology, then we examine three main types of cases. We proceed and examine whether there are any performance diﬀerences between the estimators. 3.1

Testing Methodology

For testing, 20 benchmark data sets were used, available from the UCI Machine Learning Repository [1]. Each data set is a classiﬁcation problem, the application domains vary. A brief summary of the data sets is given in Table 1, where we see that there are 5 domains with only discrete attributes, 9 with only continuous attributes and the remaining 6 have a mixture of both. Testing is performed using the leave-one-out cross-validation procedure. For each learning example that is left out in the current iteration, we compute the prediction and all the reliability estimates. The performance of reliability estimators is measured by computing the Spearman’s rank correlation coefﬁcient between each set of reliability estimates and the real prediction error e(x) = 1 − fy (x). We also compute the Spearman’s coeﬃcients for the reference function (Eq. 2) and our error function (Eq. 1) as control. Next, we compare the coeﬃcient of each estimator with the coeﬃcient of the reference function. If one of the two coeﬃcients is not signiﬁcantly diﬀerent from the control, we count that as an insigniﬁcant case. If both coeﬃcients are signiﬁcant, we further test their diﬀerence with a Z-test [3]. If the Z-test conﬁrms that the two coeﬃcients are Table 1. Brief summary of the testing data sets dataset housevotes wine parkinsons zoo tic-tac-toe postoperative monks-3 irisset glass hungarian ecoli heart haberman ﬂag wdbc breast-cancer sonar hepatitis lungcancer

#instances 435 178 195 101 958 90 432 150 214 294 336 303 306 194 569 369 111 155 32

#discrete 16 0 0 16 8 7 5 0 0 7 0 7 0 18 0 0 0 13 56

#continuous 0 13 22 0 0 1 0 4 9 6 7 6 3 10 30 9 60 6 0

Evaluating Reliability of Single Classiﬁcations of Neural Networks

27

signiﬁcantly diﬀerent, we measure whether the coeﬃcient of the reliability estimator is greater (better) or smaller (worse) than that of the reference function. The performance of the reliability estimates was tested using a three-layered perceptron with ﬁve hidden neurons implemented in a package for R [6,7]. 3.2

Typical Cases

0.6 0.0

0.0

0.2

0.4

relative frequency

0.8 0.6 0.4 0.2

prediction accuracy

false classification correct classification

0.8

1.0

1.0

Here we present three typical cases we found during testing. We present plots of achieved prediction accuracy against the calculated reliability estimates from which it is possible to examine the correlation between the two. The second plots show the separation power of correct and false predictions by means of the reliability estimates. First example in Fig. 1 shows positive performance of the CNK estimator on the dataset wine. We see some linear correlation and good separation of correctly and falsely classiﬁed examples. In the majority of experiments, reliability estimators did not perform as good at predicting accuracy as the underlaying learning algorithms did. An example of such a negative test case for the estimator LCV is presented in Fig. 2. We see there is some correlation, however we also see that there is no good separation of correct classiﬁcations and misclassiﬁcations. The third example, shown in Fig. 3 presents the typical response of the DENS reliability estimator. The learning spaces were uniformly dense and there is no evident correlation between the estimates and the prediction accuracy.

0.0

0.2

0.4

0.6

reliability estimates

0.8

1.0

0

0.1

0.3

0.5

0.7

0.9

reliability estimates

Fig. 1. Example of 86% positive correlation between reliability estimates and the prediction accuracy (estimator CNK and dataset wine)

0.6 0.0

0.0

0.2

0.4

relative frequency

0.8 0.6 0.4 0.2

prediction accuracy

false classification correct classification

0.8

1.0

1.0

ˇ D. Pevec, E. Strumbelj, and I. Kononenko

28

0.0

0.2

0.4

0.6

0.8

1.0

0

0.1

0.3

0.5

0.7

0.9

reliability estimates

reliability estimates

0.6 0.0

0.0

0.2

0.4

relative frequency

0.8 0.6 0.4 0.2

prediction accuracy

false classification correct classification

0.8

1.0

1.0

Fig. 2. Example of average underperformance (29% correlation) of the reliability estimator LCV on the dataset glass

0.0

0.2

0.4

0.6

reliability estimates

0.8

1.0

0

0.1

0.3

0.5

0.7

0.9

reliability estimates

Fig. 3. Example of the insigniﬁcant behavior of the DENS estimator (dataset wine)

3.3

Main Results

We read Table 2 for the estimator CNK (ﬁrst row) in the following way: the estimator CNK had a signiﬁcantly higher correlation coeﬃcient with the prediction error than the reference function in three datasets and in nine datasets the estimator had signiﬁcantly lower correlation than the reference function. In three datasets both coeﬃcients were signiﬁcant, but not diﬀerent according to a Z-test and in ﬁve datasets, at least one of the two correlation coeﬃcients was not signiﬁcant.

Evaluating Reliability of Single Classiﬁcations of Neural Networks

29

Table 2. Experimental evaluation of reliability estimators. The correlation of reliability estimates with the prediction accuracy is compared to the correlation of the reference function with the prediction accuracy. Each column gives the number of datasets with the achieved diﬀerence in correlations. cnk lcv dens lcv.dist trans.last trans.second bagv.all bagv.class trans.ﬁrst prediction reference

better 3 2 0 0 0 0 0 0 0 1 -

worse 9 11 8 7 9 9 12 13 15 1 -

equal 3 3 0 2 0 0 0 0 1 16 18

insigniﬁcant 5 4 12 11 11 11 8 7 4 2 2

The last row indicates in how many tests the reference function had insignificant correlation with the prediction error and the notion of better or worse performance is inapplicable for this row. The second to last row presents the control test, where we compare the correlation of the model’s predictions with the prediction error and that of the reference function. The rows of this table sum to 20, the number of datasets. These results are visualized in Fig. 4. In 10% of experiments, the reference function did not produce signiﬁcant results (the last row of Table 2), so we should interpret this as the initial error (or insigniﬁcance) of the results. The control test (the second last row of

cnk lcv dens lcv.dist trans.last trans.second bagv.all bagv.class trans.first prediction reference 0% positive

25% insignificant

50%

75% equal

100% negative

Fig. 4. Visualization of the experimental evaluation of reliability estimators. Correlation with prediction accuracy is compared to the reference function.

30

ˇ D. Pevec, E. Strumbelj, and I. Kononenko

Table 2) shows that the model’s predictions diﬀer from the reference function with accordance to the beforehand chosen α = 0.05 (5% positive, 5% negative cases). From the sorted results, the estimators CNK and LCV stand apart from other estimators. In Fig. 1 we have seen best results for CNK, with other positive cases they represent 15%. For the estimator LCV from Fig. 2, the positive cases sum to 10%.

4

Conclusions

We wanted to know if we can get new insight into the reliability assessments of single classiﬁcations and we made an evaluation of existing methods. We derived a reference function, which enables comparison between the reliability estimators and the model’s own predictions. With the reference function as a reference point, we evaluated ﬁve methods on twenty general-use datasets. Our main results show that in the majority of experiments, reliability estimators did not perform as well at predicting accuracy as did the reference function. The results suggest that methods CNK and LCV are in some cases able to better separate the model’s results. If further work, we will focus on interval estimates rather than on point-wise estimates and we will continue the exploration of similar reliability concepts in regression modeling. Acknowledgements. This work was supported by a grant from the Slovenian Research Agency (P2-0209).

References 1. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2007), http://archive.ics.uci.edu/ml/ 2. Bosni´c, Z., Kononenko, I.: Comparison of approaches for estimating reliability of individual regression predictions. Data Knowl. Eng. 67(3), 504–516 (2008) 3. Kanji, G.K.: 100 statistical tests. SAGE Publications, Thousand Oaks (2006) 4. Kukar, M., Kononenko, I.: Reliable classiﬁcations with machine learning. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 1–8. Springer, Heidelberg (2002) 5. Kukar, M.: Quality assessment of individual classiﬁcations in machine learning and data mining. Knowledge and Information Systems 9(3), 364–384 (2006) 6. Ripley, B.D.: Pattern Recognition and Neural Networks, Cambridge (1996) 7. R Development Core Team: A Language and Environment for Statistical Computing. In: R Foundation for Statistical Computing, Vienna (2006)

Nonlinear Predictive Control Based on Multivariable Neural Wiener Models Maciej L awry´ nczuk Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland Tel.: +48 22 234-76-73 [email protected]

Abstract. This paper describes a nonlinear Model Predictive Control (MPC) scheme in which a neural Wiener model of a multivariable process is used. The model consists of a linear dynamic part in series with a steady-state nonlinear part represented by neural networks. A linear approximation of the model is calculated on-line and used for prediction. Thanks to it, the control policy is calculated from a quadratic programming problem. Good control accuracy and computational eﬃciency of the discussed algorithm are shown in the control system of a chemical reactor for which the classical MPC strategy based on a linear model is unstable. Keywords: Process control, Model Predictive Control, Wiener systems, neural networks, optimisation, soft computing.

1

Introduction

In Model Predictive Control (MPC) algorithms an explicit dynamic model of the process is used on-line to predict its future behavior and to optimise the future control policy [1,6,11]. The MPC technique has a few important advantages. First, because the model is used for prediction, constraints can be easily imposed on process inputs (manipulated variables) and outputs (controlled variables). Secondly, MPC can be eﬃciently used for multivariable processes, with many inputs and outputs, and for processes with diﬃcult dynamic properties (e.g. with signiﬁcant time-delays, with the inverse response). As a result, MPC algorithms have been successfully used for years in numerous advanced applications, ranging from chemical engineering to aerospace [10]. Classical MPC algorithms use linear models. Although such an approach may lead to quite good control quality in many cases, for really nonlinear processes the classical linear MPC technique is likely to result in unacceptable system behaviour, e.g. instability. Current research and practical applications concentrate on nonlinear MPC algorithms which use nonlinear models [11]. In particular, MPC algorithms based on neural models attract attention [5,8,11]. Neural model are worth considering not only because they oﬀer good accuracy, but also because they have a reasonably small number of parameters and a simple structure. As a result, neural models can be eﬃciently used on-line in MPC. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 31–40, 2011. c Springer-Verlag Berlin Heidelberg 2011

32

M. L awry´ nczuk

Neural models are purely black-box ones. The model structure has nothing to do with the technological nature of the process and its parameters have no physical interpretation. An alternative is to use block-oriented models which consists of a linear part and a nonlinear steady-state part. In particular, Hammerstein and Wiener models with neural steady-state parts can be eﬃciently used for modelling, fault detection and control of various technological processes [3]. This paper details a nonlinear MPC algorithm for multivariable processes. The algorithm uses a neural Wiener model. A linear approximation of the model is successively calculated on-line and next used for prediction. Thanks to such an approach, the control policy is found from an easy to solve (convex) quadratic programming problem. To show good control accuracy and computational eﬃciency of the algorithm, the control system of a chemical reactor is considered. It is demonstrated that the classical MPC algorithm based on a linear model is unstable, whereas the discussed algorithm is stable and precise.

2

Model Predictive Control Algorithms

In MPC algorithms at each consecutive sampling instant k, k = 0, 1, 2, . . ., a set of future control increments ⎡ ⎤ u(k|k) ⎢ ⎥ .. u(k) = ⎣ (1) ⎦ . u(k + Nu − 1|k) is calculated [1,6,11]. It is assumed that u(k + p|k) = 0 for p ≥ Nu , where Nu is the control horizon. The objective is to minimise diﬀerences between the reference trajectory y ref (k + p|k) and predicted outputs yˆ(k + p|k) (i.e. predicted control errors) over the prediction horizon N . The MPC optimisation task is (hard output constraints [6,11] are used for simplicity) N N u −1

2

2

ref

min +

y (k + p|k) − yˆ(k + p|k)

u(k + p|k)

u(k)

p=1

Mp

p=0

subject to umin ≤ u(k + p|k) ≤ umax ,

p = 0, . . . , Nu − 1

Λp

(2)

−umax ≤ u(k + p|k) ≤ umax , p = 0, . . . , Nu − 1 y min ≤ yˆ(k + p|k) ≤ y max , p = 1, . . . , N The second part of the cost function penalises excessive control increments. A multivariable process is considered with nu inputs (manipulated variables) and ny outputs (controlled variables), i.e. u(k) ∈ Rnu , y(k) ∈ Rny (consequently, u(k + p|k) ∈ Rnu , y ref (k + p|k), yˆ(k + p|k) ∈ Rny ). M p ≥ 0 and Λp > 0 are tuning matrices of dimensionality ny × ny and nu × nu , respectively. Vectors umin , umax , umax ∈ Rnu , y min , y max ∈ Rny deﬁne constraints. Although the whole optimal future control policy (1) over the control horizon is calculated,

Nonlinear Predictive Control Based on Multivariable Neural Wiener Models

33

only its ﬁrst nu elements (current control increments) are actually applied to the process, i.e. u(k) = u(k|k) + u(k − 1). At the next sampling instant, k + 1, output measurements are updated, the prediction is shifted one step forward and the whole procedure is repeated. Predicted values of process outputs, yˆ(k + p|k), over the prediction horizon are calculated using a dynamic model of the process.

3

Multivariable Neural Wiener Models

The structure of the considered neural Wiener model is depicted in Fig. 1. In general, the model consists of a linear dynamic part in series with nonlinear steady-state part, x(k) = [x1 (k) . . . xny (k)]T are auxiliary signals. The linear part is described by the following discrete-time diﬀerence equation A(q −1 )x(k) = B(q −1 )u(k)

(3)

where polynomial matrices are ⎡ ⎤ 1 + a11 q −1 + . . . + a1nA q −nA . . . 0 ⎢ ⎥ .. .. .. A(q −1 ) = ⎣ ⎦ . . . ny −1 ny −nA 0 . . . 1 + a 1 q + . . . + an A q ⎡ 1,1 −1 ⎤ u −1 1,1 −nB u −nB . . . b1,n q + . . . + b1,n b1 q + . . . + bnB q nB q 1 ⎢ ⎥ .. .. .. B(q −1 ) = ⎣ ⎦ . . . n ,1

n ,1

n ,nu −1

b1 y q −1 + . . . + bnyB q −nB . . . b1 y

q

n ,nu −nB

+ . . . + bnyB

q

The backward shift operator is denoted by q −1 , integers nA , nB , τ deﬁne the order of dynamics, τ ≤ nB . Outputs of the dynamic part are ⎤ ⎡ n u nB nA 1,r 1 b ur (k − l) − al x1 (k − l) ⎥ ⎡ ⎤ ⎢ ⎥ ⎢ r=1 l=1 l x1 (k) l=1 ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ . .. (4) x(k) = ⎣ . ⎦ = ⎢ ⎥ ⎥ ⎢ n n n u B A ⎥ ⎢ n ,r n xny (k) ⎣ bl y ur (k − l) − al y xny (k − l) ⎦ r=1 l=1

l=1

The nonlinear steady-state part of the model is described by the equation y(k) = g(x(k)) where the function g : Rny → Rny is represented by ny MultiLayer Perceptron (MLP) feedforward neural networks with one hidden layer [2]. Each network has one input and one linear output. Consecutive outputs of networks are ⎤ ⎡ K1 2,1 2,1 1,1 1,1 w0 + wi ϕ(wi,0 + wi,1 x1 (k)) ⎥ ⎤ ⎢ ⎡ ⎥ ⎢ y1 (k) i=1 ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ . .. (5) y(k) = ⎣ . ⎦ = ⎢ ⎥ ⎥ ⎢ ny ⎥ ⎢ K yny (k) ⎦ ⎣ 2,ny 2,ny 1,n 1,n wi ϕ(wi,0 y + wi,1 y xny (k)) w0 + i=1

34

M. L awry´ nczuk

Fig. 1. The structure of the multivariable neural Wiener model

where ϕ : R1 → R1 is the nonlinear transfer function (e.g. hyperbolic tangent). 1,m Weights of consecutive networks (m = 1, . . . , ny ) are denoted by wi,j , i = m 2 m 1, . . . , K , j = 0, 1 and wi,m , i = 0, . . . , K , for the ﬁrst and the second layers, respectively, K m is the number of hidden nodes of the mth network. The mth output of the neural Wiener model can be expressed as a function of inputs and auxiliary signals at previous sampling instants ym (k) = fm (u1 (k − 1), . . . , u1 (k − nB ), . . . , unu (k − 1), . . . , unu (k − nB ), xm (k − 1), . . . , xm (k − nA )) From (4) and (5) one has m

ym (k)

=w02,m

+

×ϕ

K

wi2,m ×

i=1

1,m wi,0

+

1,m wi,1

n n u B

bm,r ur (k l

− l) −

r=1 l=1

4 4.1

nA

am l xm (k − l)

(6)

l=1

Model Predictive Control Based on Multivariable Neural Wiener Models Quadratic Programming MPC Optimisation Problem

There are two general problems if one wants to use the neural Wiener model (6) in MPC. First, its outputs depend on auxiliary signals x1 (k), . . . , xny (k) whereas the following Nonlinear Auto Regressive with eXternal input (NARX) model ym (k) = f˜m (u1 (k − 1), . . . , u1 (k − nB ), . . . , unu (k − 1), . . . , unu (k − nB ), ym (k − 1), . . . , ym (k − nA )) is typically used for prediction in MPC. In the NARX model outputs depend explicitly on previous inputs and outputs. In order to eliminate auxiliary signals the inverse steady-state model x(k) = ginv (y(k))

Nonlinear Predictive Control Based on Multivariable Neural Wiener Models

35

can be used, g inv : Rny → Rny . Such an approach is discussed in [4]. Unfortunately, the class of processes for which the inverse model exists is limited. The second problem is the nonlinear nature of the model. As a result, output predictions are nonlinear functions of the future control policy (1) which means that the MPC optimisation problem (2) is a nonlinear one. A solution to this problem is successive on-line linearisation of the nonlinear model. The obtained linear approximation is next used for prediction. Thanks to it, a quadratic programming MPC task is obtained [4,5]. The discussed MPC algorithm uses the technique from [7], in which an MPC algorithm based on fuzzy Wiener models for single-input single-output processes is described. The exact linear approximation of the neural Wiener model is not calculated, but the gain of the steady-state part of the model is estimated for the current operating point. For the multivariable process the gain vector is ⎡ ⎤ ⎡ ∂y1 (k) ⎤ k1 (k) ∂x1 (k) .. ⎥ ⎢ .. ⎥ ⎢ ⎢ K(k) = ⎣ . ⎦ = ⎣ (7) . ⎥ ⎦ ∂yny (k) kny (k) ∂x ny (k)

The gain vector deﬁnes the relation between model outputs and auxiliary signals ⎡ ⎤ k1 (k)x1 (k) ⎢ ⎥ .. y(k) = ⎣ ⎦ . kny (k)xny (k) Using vector-matrix notation y(k) = K(k)x(k) where K(k) = diag(k1 (k), . . . , kny (k)). Taking into account the discrete-time diﬀerence equation (3) which deﬁnes the linear part of the model, one obtains a linear approximation of the whole nonlinear Wiener model for the current operating point A(q −1 )y(k) = B(k, q −1 )u(k) (8) −1 −1 where B(k, q ) = K(k)B(q ). Predictions calculated from the approximate model (8) can be compactly expressed as functions of future control increments (the inﬂuence of the past is not shown) yˆ(k + 1|k) =S 1 (k)u(k|k) + . . . yˆ(k + 2|k) =S 2 (k)u(k|k) + S 1 (k)u(k + 1|k) + . . .

(9)

yˆ(k + 3|k) =S 3 (k)u(k|k) + S 2 (k)u(k + 1|k) + S 1 (k)u(k + 2|k) + . . . .. . where step-response matrices are ⎡ k1 (k)s1,1 j ⎢ .. S j (k) = ⎣ . n ,1 kny (k)sj y

⎤ u . . . k1 (k)s1,n j ⎥ .. .. ⎦ . . ny ,nu . . . kny (k)sj

(10)

36

M. L awry´ nczuk

for j = 1, . . . , N . Step-response coeﬃcients of the linear part of the model are denoted by sm,n for all j = 1, . . . , N , n = 1, . . . , nu , m = 1, . . . , ny [11]. j Using (9), the output prediction vector can be expressed as a sum of a forced trajectory which depends only on the future (on future control moves u(k)) and a free trajectory y 0 (k), which depends only on the past

where

ˆ (k) = G(k)u(k) + y 0 (k) y

(11)

⎤ ⎡ 0 ⎤ yˆ(k + 1|k) y (k + 1|k) ⎢ ⎥ ⎢ ⎥ .. .. ˆ (k) = ⎣ y ⎦ , y 0 (k) = ⎣ ⎦ . . yˆ(k + N |k) y 0 (k + N |k)

(12)

⎡

are vectors of length ny N . The dynamic matrix G(k) of dimensionality ny N × nu Nu consists of step responses of the approximate linear model (8) ⎡ ⎤ S 1 (k) 0 ... 0 ⎢ S 2 (k) S 1 (k) . . . ⎥ 0 ⎢ ⎥ G(k) = ⎢ . (13) ⎥ . . .. .. .. ⎣ .. ⎦ . S N (k) S N −1 (k) . . . S N−Nu +1 (k) Thanks to the fact that a linear approximation of the original neural Wiener model is used for prediction, i.e. using the prediction equation (11), the MPC optimisation problem (2) becomes an easy to solve quadratic programming task

2 2 min y ref (k) − G(k)u(k) − y 0 (k) + u(k)Λ u(k)

subject to umin ≤ J u(k) + u(k − 1) ≤ umax

(14)

−umax ≤ u(k) ≤ umax y min ≤ G(k)u(k) + y 0 (k) ≤ y max Deﬁnitions of all vectors and matrices are given in [5]. The discussed MPC algorithm is named MPC with Nonlinear Prediction and Approximate Linearisation (MPC-NPAL) in contrast to the MPC-NPL algorithm with an inverse steady-state model and exact linearisation [4]. 4.2

Implementation Details

From (5) one obtains elements of the gain vector (7) Km

∂ym (k) 2,m ∂ϕ(zm (k)) 1,m = wi wi,1 ∂xm (k) ∂zm (k)

(15)

i=1

1,m 1,m where m = 1, . . . , ny , zm (k) = wi,0 + wi,1 xm (k). If hyperbolic tangent is used

as the nonlinear transfer function ϕ,

∂ϕ(zm (k)) ∂zm (k)

= 1 − tanh2 (zm (k)).

Nonlinear Predictive Control Based on Multivariable Neural Wiener Models

37

0 The nonlinear free trajectory ym (k + p|k), m = 1, . . . , ny , is calculated online recurrently over the prediction horizon (for p = 1, . . . , N ) using the neural Wiener model (6). It depends only on the past 0 (k + p|k) = ym

m

w02,m

+

K

wi2,m ϕ

1,m wi,0

+

1,m wi,1

nu I uf (p) r=1

i=1

bm,r ur (k − 1) l

l=1

+

nB

bm,r u (k − l + p) r l

l=Iuf (p)+1

Iyp (p)

−

0 am l xm (k

− l + p|k) −

l=1

nA

am l xm (k

+ dm (k) (16) − l + p)

l=Iyp (p)+1

where Iuf (p) = max(min(p, nB ), 0), Iyp (p) = min(p− 1, nA ) and unmeasured disturbances dm (k) are estimated as diﬀerences between measured process outputs and outputs calculated from the neural Wiener model (6) dm (k) = ym (k) − ym (k|k − 1) 4.3

Algorithm Summary

Steps repeated at each sampling instant k of the MPC-NPAL algorithm are: 1. Approximate linearisation of the neural Wiener model is carried out: the gain vector K(k) for the current operating point is calculated from (15). Step response coeﬃcients of the linearised model are updated according to (10), the dynamic matrix G(k) (13) is formed. 2. Elements of the nonlinear free trajectory y 0 (k) are calculated from (16) using the neural Wiener model. 3. The future control policy u(k) is found from the quadratic programming problem (14). 4. The ﬁrst nu elements of the calculated vector u(k) are applied to the process, i.e. u(k) = u(k|k) + u(k − 1) 5. The iteration of the algorithm is increased, i.e. set k := k + 1, go to step 1.

5

Simulation Results

The considered process is a chemical reactor depicted in Fig. 2. It has two inputs: u1 – the feed ﬂow rate, u2 – the cooling substance ﬂow rate, and two outputs: y1 – the product concentration, y2 – the product temperature. As a simulated process the Wiener system is used. Its linear part is described by matrices [1] 0 1 − 1.862885q −1 + 0.866877q −2 −1 A(q ) = 0 1 − 1.869508q −1 + 0.873715q −2 0.041951q −1 − 0.037959q −2 0.475812q −1 − 0.455851q −2 −1 B(q ) = 0.058235q −1 − 0.054027q −2 0.144513q −1 − 0.136097q −2

38

M. L awry´ nczuk

Fig. 2. The reactor

The nonlinear steady-state part of the system is described by functions shown in Fig. 3 (valves with saturation for which inverse functions do not exist). Fig. 3 also shows neural approximations of the steady-state part (two networks with K 1 = K 2 = 5 hidden nodes are used). The following MPC algorithms are compared: a) the classical MPC algorithm based on the linear model, b) the discussed MPC-NPAL algorithm based on the neural Wiener model and quadratic programming, c) the MPC-NO algorithm with on-line nonlinear optimisation, it uses the same neural Wiener model. Parameters of all algorithms are: N = 10, Nu = 2, M p = diag(1, 1), Λp = diag(0.15, 0.15), umin = umin = −20, umax = umax = 20, umax = umax = 5. 1 2 1 2 1 2 Due to signiﬁcantly nonlinear nature of the system the linear MPC algorithm is unstable as demonstrated in Fig. 4. Conversely, both nonlinear control strategies are stable and precise as depicted in Fig. 5. Performance of the MPC-NPAL algorithm with quadratic programming is quite similar to that obtained in the computationally demanding MPC-NO approach with nonlinear optimisation. In terms of Sum of Squared Errors, for MPC-NPAL SSE = 14.5199, for MPC-NO SSE = 13.5809. At the same time, the diﬀerence in computational burden is signiﬁcant: for MPC-NPAL the computational cost is 1.2460 MFLOPS while for MPC-NO it soars to 14.4477 MFLOPS (almost 11.6 times more).

Fig. 3. Characteristics y1 (k) = g1 (x1 (k)) and y2 (k) = g2 (x2 (k)) of the steady-state part of the process (solid line) and their neural approximations (dashed line)

Nonlinear Predictive Control Based on Multivariable Neural Wiener Models

39

Fig. 4. Simulation results of the MPC algorithm based on the linear model

Fig. 5. Simulation results: the MPC-NO algorithm with nonlinear optimisation based on the neural Wiener model (solid line) and the MPC-NPAL algorithm with quadratic programming based on the same model (dashed line)

40

6

M. L awry´ nczuk

Conclusions

In contrast to MPC algorithms in which an exact linearisation of the nonlinear model is used for prediction [4,5], the described MPC-NPAL algorithm is based on approximate linearisation. Nevertheless, control quality is quite good, similar to that obtained in the MPC-NO approach with nonlinear on-line optimisation. Unlike existing MPC approaches based on Wiener models, e.g. [4,9], the inverse model is not used. Hence, the algorithm can be applied when inverse steady-state functions do not exist. Acknowledgement. The work presented in this paper was supported by Polish national budget funds for science.

References 1. Camacho, E.F., Bordons, C.: Model predictive control. Springer, London (1999) 2. Haykin, S.: Neural networks – a comprehensive foundation. Prentice Hall, Englewood Cliﬀs (1999) 3. Janczak, A.: Identiﬁcation of nonlinear systems using neural networks and polynomial models: block oriented approach. Springer, London (2004) 4. L awry´ nczuk, M.: Computationally eﬃcient nonlinear predictive control based on neural Wiener models. Neurocomputing 74, 401–417 (2010) 5. L awry´ nczuk, M.: A family of model predictive control algorithms with artiﬁcial neural networks. International Journal of Applied Mathematics and Computer Science 17, 217–232 (2007) 6. Maciejowski, J.M.: Predictive control with constraints. Prentice Hall, Harlow (2002) 7. Marusak, P.: Application of fuzzy Wiener models in eﬃcient MPC algorithms. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS (LNAI), vol. 6086, pp. 669–677. Springer, Heidelberg (2010) 8. Nørgaard, M., Ravn, O., Poulsen, N.K., Hansen, L.K.: Neural networks for modelling and control of dynamic systems. Springer, London (2000) 9. Norquay, S.J., Palazo˘ glu, A., Romagnoli, J.A.: Model predictive control based on Wiener models. Chemical Engineering Science 53, 75–84 (1998) 10. Qin, S.J., Badgwell, T.A.: A survey of industrial model predictive control technology. Control Engineering Practice 11, 733–764 (2003) 11. Tatjewski, P.: Advanced control of industrial processes, Structures and algorithms. Springer, London (2007)

Methods of Integration of Ensemble of Neural Predictors of Time Series - Comparative Analysis Stanislaw Osowski1,2 and Krzysztof Siwek1 1

Warsaw University of Technology Military University of Technology, 00-661 Warsaw, Poland {sto,ksiwek}@iem.pw.edu.pl 2

Abstract. It is well known fact that organizing different predictors in an ensemble increases the accuracy of prediction of the time series. This paper discusses different methods of integration of predictors cooperating in an ensemble. The considered methods include the ordinary averaging, weighted averaging, application of principal component analysis to the data, blind source separation as well as application of additional neural predictor as an integrator. The proposed methods will be verified on the example of prediction of 24-hour ahead load pattern in the power system, as well as prediction of the environmental pollution for the next day. Keywords: time series prediction, ensemble of predictors, neural networks.

1 Introduction Prediction of the time series is an important task in everyday life and engineering. Exact forecasting of the energy consumption (fuel, gas, electrical energy) for each hour of the next day is an example of its application in engineering. The other example includes forecasting of the pollution for the next day, which enables to counteract its negative consequences for the health of the inhabitants. There are also many other important examples of time series predictions in our everyday life. The common practice allowing to obtain good accuracy of prediction is the application of many predictors working on the same input data and organized in the form of an ensemble [3],[11]. Each method of prediction stresses different aspect of the problem. Assume that each predictor is independent from the other. All of them are burdened by some errors, which may be treated as a noise. Combining their individual results allows to compensate some errors and reduce the average level of prediction errors (noise). In this way the final accuracy of prediction of the time series is increased. The important problem that arises is the integration of these results in an optimal way, leading to the most accurate prediction. This paper will discuss different methods of integration of the ensemble. The considered methods include the ordinary averaging, weighted averaging, application of the principal component analysis, blind source separation as well as application of an additional neural predictor working as an integrator. The proposed methods will be verified on the example of prediction of 24-hour ahead load pattern in the power system and prediction of environmental pollution of PM10 for the next day. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 41–50, 2011. © Springer-Verlag Berlin Heidelberg 2011

42

S. Osowski and K. Siwek

2 The Theoretical Basis of Integration of Predictors Prediction of the time series means to predict the present value x(n) or set of such values given past values of this process and other external signals influencing the process. The prediction task may be viewed as a form of model building in the sense that the smaller we make the prediction error in a statistical sense the better the network serves as a model of the underlying physical process responsible for generating the data. It is well known that neural networks as the universal nonlinear models are very powerful tools for solving this kind of problems [3],[12]. Different forms of neural solutions are applied in practice. To the most often used belong: multilayer perceptron (MLP), radial basis function network (RBF), support vector machine in regression mode (SVR), recurrent networks (Elman, Jordan, etc.), self-organizing Kohonen network or neuro-fuzzy networks [1],[4],[5],[6],[11]. The most often used practice is to train different neural predictors and then accept one which guarantees the best results of prediction on the validation data set. However better solution is to use all trained networks combined in an ensemble and integrate their results into final prediction. The general scheme of ensemble of predictors is presented in Fig. 1.

Fig. 1. The general structure of ensemble system for prediction

The important condition for including the predictor into ensemble is independent operation from the other and also similar level of prediction error. Then the problem of integration of the partial results arises. This problem is well solved for classification tasks [8] but needs some additional study for the prediction problems. Here we will present and compare some chosen methods of integration of predictor ensemble. 2.1 The Averaging In this approach the final forecast is defined as the average of the results produced by all M predictors organized in an ensemble. Two kinds of averaging techniques are used in practice. The simplest one is the ordinary mean of the partial results. In such case the final prediction vector xˆ of the time series is defined as

Methods of Integration of Ensemble of Neural Predictors of Time Series

xˆ =

1 M ∑ xi M i =1

43

(1)

This formula makes use of the stochastic distribution of the predictive errors. The process of averaging reduces the final error of forecasting. It works quite well if all predictive networks are of comparable accuracy. If it is not true the final results may be inferior with respect to the best individual predictor. In such case better results may be obtained by applying the weighted averaging, that is by taking summation of terms in (1) with different weights following from the estimated accuracy of each predictor. This accuracy may be measured on the basis of the particular predictor performance on the data from the past. The most reliable predictor should be considered with the highest weight, and the least accurate one with the least attention. The forecasted jth term of time series (j=1, 2, …, N) can be now defined in the following form M

xˆ j = ∑ w (ji ) x (ji )

(2)

i =1

(i)

where the upper index means the ith neural predictor. The weights w j

are adjusted

individually for each element of time series and should take into account the accuracy of ith predictor obtained on the learning data for this particular element of the series.

2.2 Principal Component Analysis In this solution the weighted voting of the individual predictors is substituted by the linear transformation of the data provided by PCA. The PCA represents a classical statistical technique for analyzing the covariance structure of the multivariate statistical observations, enhancing the most important elements of information [2],[4],[13]. Assuming that N-dimensional input vector x is transformed into K-dimensional output vector z (K
44

S. Osowski and K. Siwek

predictors. The size of Rxx is equal nx × n x . Then we perform the eigenvalue analysis of this matrix delivering the eigenvalues λi and the same number of the eigenvectors wi, associated with them. The K eigenvectors associated with largest eigenvalues of this matrix (K
2.3 Blind Source Separation In this method of integrating the results of prognosis (vectors xi) generated by each predictive network for the period used in training, create the time series that are put in parallel to the blind source separation (BSS) system [2]. The number of inputs to BSS is equal to M, the number of applied prognosis networks. The BSS system decomposes the original stream of signals of length q, forming the matrix X ⊂ R M ×q (q is the number of prognosis vectors x used in learning, q = Np ), into independent components using the matrix W ⊂ R M ×M . The independent component signals, generated by BSS, form the matrix Y of M rows and q columns [2]. This is the linear transformation described by Y=WX. Each row of the matrix Y represents the independent component series. Some of these series represent the essential information and some represent the noise. Reconstructing the original time series back into real prognosis on the basis of the essential independent components only, will provide the prognosis deprived of the noise, that is of presumably better quality. The problem is that we don’t know in advance which component is the noise and which represents the useful information. It is possible to solve the problem by reconstructing all combinations of independent components and to accept the one which provides the best results of prediction on the learning data. The other approach is to find out which component is of the noisy character using statistical tests [11], for example the correlation analysis. The signals of noisy channels are then replaced by zeros in the reconstruction phase. The reconstruction of the original data matrix X is done by using the inverse operation, called deflation [2] ˆ = W −1 Y ˆ X

(4)

ˆ denotes the reconstructed time series matrix and Y ˆ - the indeIn this equation X pendent component matrix built from the original matrix Y by zeroing row or rows corresponding to the noise. In recovering the signals we may try all sensible combinations of independent components, substituting the rejected components (appropriate

Methods of Integration of Ensemble of Neural Predictors of Time Series

45

rows of Y) by zeros. The combination corresponding to the best result of prediction on the learning data is assumed as the final solution. In the reconstruction phase on the testing data only this combination will be used. 2.4 Application of PCA and Additional Neural Network as an Integrator

In this approach the first step is to concatenate the vectors xi generated by individual predictors into one larger vector of the size MN. The first step is to reduce their size by applying PCA transformation. As a results we get one vector y of the chosen dimension K<MN. The set of available p low-dimensional vectors y is used in the next stage as the training data for the final neural predictor, whose output signals will represent the finally forecasted time series under prediction. To get high quality of prediction results we have to use the predictor of the highest possible accuracy. According to our experience to the best predictors belong MLP and SVR [3],[12]. The general scheme of this type of integration is presented in Fig. 2. As the learning data for training the final predictor we use the pairs (yi, ti) for i=1, 2, ..., p. Vectors yi result from PCA analysis and ti are the known time series patterns used also in learning the individual neural predictors in the first stage of our approach.

Fig. 2. The diagram of the proposed 2-stage integration system

3 Forecasting the 24-Hour Load Pattern in Power System The first example will be devoted to forecasting the 24-hour load pattern in Polish Power System. The numerical experiments have been performed for the data of three years (over 26280 hours). The same data set applied in learning and testing has been used for each individual predictor. To get the objective results we have applied cross correlation approach, in which 2/3 of data has been used in learning and the other 1/3 of data in testing. The experiments of learning and testing have been repeated 10 times at random selection of the training and testing set. The data samples have been normalized dividing the real load by the mean value of the data base of the Polish

46

S. Osowski and K. Siwek

Power System of 3 years taking part in the experiments. All experiments have been performed on Matlab platform [10]. 3.1 The Results of Individual Predictors

In the experiments we have applied 4 types of individual predictors: MLP, SVR, Elman network (EN) and self-organizing approach (SO). To represent the generally unknown function of the next day load pattern in this approach , we map the past loads of the system into the present forecasted load at dth day and hth hour. Our supervised model of the load was assumed in the following mathematical form [11]

Pˆ (d , h) = f (w, t, s, P(d, h − 1),...,P(d, h − H ), P(d −1, h),...,P(d − D, h − H ))

(5)

where w represents the vector of parameters of the model, H and D - the number of past hours and days, respectively, influencing the prediction process, t - the type of the day (workday or holiday) and s - the season of the year (autumn, winter, spring or summer). The value Pˆ (d , h) represents the predicted loads and the values P ( d − i, h − j ) written without hat – the known values of the load from the past. The detailed description of these methods are given in [11]. Table 1 presents the statistical results on the testing data in the form of mean values and standard deviations of mean absolute percentage errors (MAPE) and maximum percentage errors (MAXPE). Table 1. The testing errors of the load forecasting for 8760 hours of the Polish Power System by using individual predictors (mean value±std) obtained at 10-fold cross validation approach

Predictor

MAPE [%]

MAXPE [%]

MLP SVR EN SO

2.06±0.15 2.24±0.11 2.26±0.09 2.38±0.04

16.95 28.30 24.97 18.12

As it is seen the MAPE results of each neural predictor are of comparable level, although the maximum errors differ a lot. That provides good basis for their integration into one final forecast in an ensemble. 3.2 The Results of Application of an Ensemble

Integration of the obtained results has been performed by applying all presented above methods: ordinary averaging, weighted averaging, application of PCA, BSS and additional neural network as an integrator. The weights in the weighted averaging approach have been determined on the basis of accuracy analysis of prediction of each individual predictor, calculated separately (i) for the particular hour. The weight w j representing the weighting coefficient for ith predictor at jth hour has been calculated using the relation

Methods of Integration of Ensemble of Neural Predictors of Time Series

w(i)j =

47

η (i)j (6)

M

∑η (k)j k =1

where η represents the accuracy of the appropriate predictor, η=1-MAPE. In the case of PCA method of integration we have tried different number K of principal components. The best results have been obtained at K=11. The final forecasted vector was assumed as the mean of reconstructed data, each corresponding to the particular channel. The BSS integrating has been tried at different numbers of independent components used in reconstruction. The best results of experiments have been obtained at 3 independent components. In application of an additional neural network as an integrator we have tried 2 final predictors: SVR and MLP. In the first stage of transformation we used 9 principal components, which have been applied as the input signals to the final predictor. Table 2 compares the results (MAPE and MAXPE) of the discussed methods of integration. It is evident that application of any form of integration improves the accuracy of forecasting. The highest increase has been obtained at application of combination of PCA and additional neural predictor with SVR as the final predictor. Table 2. The comparison of results of 24-hour power pattern forecasting at application of different methods of integration

Integration method

MAPE [%]

MAXPE [%]

Ordinary mean Weighted average PCA BSS MLP integration SVR integration

1.89±0.09 1.86±0.08 1.84±0.08 1.77±0.07 1.51±0.09 1.38±0.05

16.98 16.97 16.23 16.28 14.29 10.69

The average MAPE=2.06% of the best individual predictor (MLP) was reduced to 1.38%. This means that the relative rate of MAPE improvement with respect to the best individual predictor (MLP) is almost 33%. The great improvement has been also observed for maximum percentage error, where the best result of individual predictor (16.95% of MLP) was reduced 10.69% (SVR integration).

4 Forecasting PM10 Pollution for the Next Day The next example illustrating the usefulness of the ensemble methods in improving the accuracy of forecasting is concerned with the prediction of the next day mean concentration of the environmental pollution formed by the particulate matters of the diameter to 10μm (PM10). The prediction is done on the basis of the actual measured values of the environmental parameters, such as: wx – the xth component of wind

48

S. Osowski and K. Siwek

vector, wy – the yth component of wind vector t – temperature, h – humidity, r –type of the day and s – the season of the year, and the known pollution of the previous days [5],[7],[14]. Denoting by w the vector of adjusted parameters of the predictive model the general supervised model of PM10 prediction for dth day is described by [14]

Pˆ (d ) = f (w, wx, wy, t, h, r, s, P(d −1))

(7)

The symbol Pˆ (d ) represents the predicted PM10 pollution and P (d − 1) written without hat – the known values of the pollution of the previous day. In solving this problem we have applied different structures of neural predictors combined with wavelet transformation. The time series under prediction is first decomposed into wavelet decomposition and the prediction process is applied to wavelet coefficients on different levels. The detailed description of this approach is given in [14]. Similarly to the previous application we have learned different neural predictors responsible for forecasting the pollution of the next day. Four neural type predictors have been applied: SVR, MLP, RBF and EN. The numerical experiments have been performed for the meteorological data of the last 3 years, measured in the suburb Ursynow of Warsaw. They have been normalized and pre-processed using wavelet transformation according to the procedure presented in [14]. Two third of data have been used for learning and one third left for testing purposes only at application of cross validation approach. The experiments of learning and testing have been repeated 10 times at random selection of the training and testing set. The number of hidden neurons in MLP, RBF and EN networks as well as hyperparameters of SVR have been adjusted using some validation data extracted from the learning set (20% of learning data). The results of individual predictors in the form of the most important quality measures used in environmental engineering: mean absolute percentage error (MAPE), root mean squared error (RMSE) and correlation coefficient R [11] are presented in Table 3. As it is seen all applied prediction methods have generated results of similar accuracy, although the best was SVR with respect to MAPE. Table 3. The testing errors of PM10 prediction by using individual predictors (mean value±std) as a result of 10-fold cross validation experiments

Predictor

MAPE [%]

RMSE [μg/m3]

R

MLP RBF SVR EN

16.57±0.18 16.46±0.58 16.23±0.19 17.34±0.53

6.18±0.14 6.19±0.24 6.22±0.18 7.61±0.15

0.917±0.0014 0.922±0.0014 0.909±0.0016 0.923±0.0035

The next step is the integration of their results in the ensemble. All methods discussed in the previous section have been tried. The results of these experiments in the form of different measures of quality of prediction (MAPE, RMSE and R) are presented in Table 4.

Methods of Integration of Ensemble of Neural Predictors of Time Series

49

Table 4. The comparison of results of different methods of integration of predictors – the 10fold cross validation approach

Integration method

MAPE [%]

RMSE [μg/m3]

R

Ordinary mean Weighted average PCA (4->3) BSS SVR integration MLP integration

16.64±0.28 16.29±0.22 14.73±0.19 14.26±0.099 15.17±0.045 15.28±0.028

6.51±0.21 6.23±0.22 5.77±0.29 5.68±0.19 6.18±0.22 6.15±0.16

0.934±0.0056 0.935±0.0049 0.924±0.0014 0.928±0.0012 0.933±0.0071 0.919±0.0021

This time the best integration method with respect to MAPE was application of BSS integrator. Good results have been also observed for PCA application at reduction of the vector dimension from original value 4 to 3.

5 Conclusions The paper has investigated different methods of integration of neural predictors, organized in the form of ensemble. It was shown that application of many independent predictors cooperating in the ensemble is profitable and results in improving the accuracy of the final forecasting results. The important advantage of the proposed neural based approaches to time series prediction is that they do not require very exhaustive information about the time series under investigation and that they have the ability of allowing the nonlinear relationships between very different predictor variables. These facts and good quality of the results make them very attractive in the predictive application.

Acknowledgment This research activity was financed by the Polish Ministry of Science and Higher Education as the research grant within the years 2010-2012.

References 1. Chen, B.J., Chang, M.W., Lin, C.J.: Load forecasting using support vector machines: a study on EUNITE competition. IEEE Trans. Power Systems 19, 1245–1248 (2004) 2. Cichocki, A., Amari, S.I.: Adaptive blind signal and image processing. Wiley, N.Y (2003) 3. Haykin, S.: Neural networks, a comprehensive foundation. Macmillan, N.Y (2002) 4. Hong, W.C.: Hybrid evolutionary algorithms in a SVR-based electric load forecasting model. Intern. Journal of Electrical Power & Energy Systems 31, 409–417 (2009) 5. Hooyberghs, J., Mensink, C., Dumont, G., Fierens, F., Brasseur, O.: A neural network forecast for daily average PM10 concentrations in Belgium. Atmospheric Environ. 39(18), 3279–3289 (2005)

50

S. Osowski and K. Siwek

6. Kandil, N., Wamkeue, R., Saad, M., Georges, S.: An efficient approach for short term load forecasting using ANN. Electr. Power & Energy Systems 28, 525–530 (2006) 7. Kukkonen, T., et al.: Extensive evaluation of neural networks models for the prediction of NO2 and PM10 concentrations, in central Helsinki. Atmospheric Environment 37, 4539–4550 (2003) 8. Kuntcheva, L.: Combining pattern classifiers - methods and algorithms. Wiley, N.J (2004) 9. Mandal, P., Senjyu, T., Urasaki, N., Funabashi, T.: A neural network based several hours ahead electric load forecasting using similar days approach. Electrical Power and Energy Systems 28, 367–373 (2006) 10. Matlab user manual, user’s guide, MathWorks, Natick (2009) 11. Osowski, S., Siwek, K., Szupiluk, R.: Ensemble neural network approach for accurate load forecasting in the power system. Applied Math. & Computer Sci. 19, 303–315 (2009) 12. Schölkopf, B., Smola, A.: Learning with kernels. MIT Press, Cambridge (2002) 13. Siwek, K., Osowski, S.: Two-stage neural network approach to precise 24-hour load pattern prediction. In: Corchado, E., Wu, X., Oja, E., Herrero, Á., Baruque, B. (eds.) HAIS 2009. LNCS (LNAI), vol. 5572, pp. 327–335. Springer, Heidelberg (2009) 14. Siwek, K., Osowski, S., Sowiński, M.: Neural predictor ensemble for accurate forecasting of PM10 pollution. In: IJCNN, Barcelona, pp. 1–7 (2010) 15. Voukantsis, D., Niska, H., Karatzas, K., Riga, M., Damialis, A., Vokou, D.: Forecasting daily pollen concentrations using data-driven modeling methods. Atmospheric Environmen. 44(39), 5101–5111 (2010)

A Rejection Option for the Multilayer Perceptron Using Hyperplanes Eduardo Gasca A.1 , Sergio Saldaña T.1 , José S. Sánchez G.2 , Valentín Velásquez G.1 , Eréndira Rendón L.1 , Itzel M. Abundez B.1 , Rosa M. Valdovinos R.2 , and Rafael Cruz R.1 1

Technological Institute of Toluca, State of Mexico, Mexico 2 University Jaume I, Castello, Spain [email protected]

Abstract. Currently, a growing quantity of the Artiﬁcial Intelligence tasks demand a high eﬃciency of the classiﬁcation systems (classiﬁers); making an error in the classiﬁcation of an object or event can cause serious problems. This is worrying when the classiﬁers confront tasks where the classes are not linearly separable, the classiﬁers eﬃciency diminishes considerably. One solution for decreasing this complication is the Rejection Option. In several circumstances it is advantageous to not have a decision be taken and wait to obtain additional information instead of making an error. This work contains the description of a novel reject procedure whose purpose is to identify elements with a high risk of being misclassiﬁed; like those in an overlap zone. For this, the location of the object in evaluation is calculated with regard to two hyperplanes that emulate the classiﬁers decision boundary. The area between these hyperplanes is named an overlap region. If the element is localized in this area, it is rejected. Experiments conducted with the artiﬁcial neural network Multilayer Perceptron, trained with the Backpropagation algorithm, show between 12.0%- 91.4%of the objects in question would have been misclassiﬁed if they had not been rejected. Keywords: Reject option, Multilayer Perceptron, Backpropagation, hyperplane, overlap.

1

Introduction

In pattern recognition systems a common task is to categorize an unknown object (pattern) as an element of a class; the class is included in a ﬁnite set deﬁned previously. In these systems, the procedure responsible for assigning the class label (classiﬁer) always makes a decision. The classiﬁers eﬃciency is characterized mainly by its accuracy in the classiﬁcation. One of these classiﬁers is the Artiﬁcial Neural Networks which generates decision boundaries using hyperplanes to separate classes in the observation space. Under this outline, the patterns with a high probability of being incorrectly labeled are usually one of two types: ambiguous data, which creates confusion A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 51–60, 2011. c Springer-Verlag Berlin Heidelberg 2011

52

E. Gasca A. et al.

Fig. 1. Topology of the Multilayer Perceptron of three layers

among diﬀerent classes because of their position in an overlap zone; or outliers, these patterns don’t belong to any class included in the initial group. Currently a growing quantity of real applications require a higher reliability level of the classiﬁers, mainly in tasks where making an error can be very expensive. These applications need systems with a percentage of classiﬁcation error as low as it is possible which can be impeded by the presence of ambiguous and/or outliers patterns, among others issues. The number of objects in such statuses inﬂuences the classiﬁers eﬃciency strongly, even when its design is appropriate. One way of decreasing the negative aspects of the ambiguous and/or outliers patterns is implementing the Rejection Option (RO) procedure. The reject concept admits the classiﬁers inability to formulate a correct decision in the circumstances given. It is preferable to postpone the pattern classiﬁcation and wait to obtain more information than to take the risk of making a mistake. Here is shown a novel procedure for the RO implantation in an artiﬁcial neural network, Multilayer Perceptron, trained with the Backpropagation algorithm. The content of the following sections is: In section 2, the description of the Multilayer Perceptron; research on the RO is explained in section 3; a novel reject procedure is shown in section 4; section 5 explains the experimental development; and the conclusions are in section 6.

2

Multilayer Perceptron

Figure 1 shows the Multilayer Perceptron (MLP) structure with a hidden layer. The insertion of this layer (it can be one or more) gives it the capacity of confronting tasks where the classes are not linearly separable. Diﬀerent papers ([3], [4] and [5]) have mentioned the advantages of using Perceptrons of three layers (TLP) -an input layer, a hidden layer, and an output layer- instead of Perceptrons with a larger quantity of layers. This is due to their smaller computational

A Rejection Option for the Multilayer Perceptron Using Hyperplanes

53

burden and their capacity to ﬁnd an approximation for any relationship, no matter how close, between the input patterns and their classes [14] and, moreover, for their ability to separate non-convex and disconnected areas in the observation space [7]. Also, if the activation function of the hidden layer units (nodes) is sigmoid, these units are able to generate a base where the network outputs are located [10]. There are diﬀerent algorithms for training the MLP. One of them is Backpropagation (BP) whose goal is to determine the values of the inner parameters (connection value between units of diﬀerent layers, weights). For this, the mean squared error between the network output and its desired value is minimized.

3

Rejection Option

Some eﬀorts have been made for the endowing of a RO to the artiﬁcial neural networks. For the MLP, these eﬀorts have focused on the analysis of the answers the units in the output layer have produced. The class label is allocated by the MLP through the winner take all rule; the node with the highest value (winning node) determines the class. However, this rule generates uncertainty when the winner has a low value, or when more than one unit has a high value. Under this outline, in [2] and [12] a cost function is deﬁned to evaluate the classiﬁer acting with respect to the proportion of patterns: correctly classiﬁed, misclassiﬁed, and rejected. This is supplemented with an analysis of the network output values that are obtained in the training sample classiﬁcation. Two thresholds have been determined as a result of the analysis and with them it is possible to identify the outliers and/or ambiguous patterns. On the other hand, Singh and Markou [11] include a reject class during the network training. First, they prepare the training sample separating the elements whose distance from its centroid is bigger than the threshold. The centroid represents all the class patterns. Next, they create atypical artiﬁcial elements and generate random around average values with the standard deviation of each feature of the separated elements. The artiﬁcial elements are incorporated into the training sample and are labeled with the reject class label. A diﬀerent procedure of rejection was shown by Tsujitani and Koshimizu [13]; they quantiﬁed the inﬂuence of the atypical elements in the discrepancy between the desired values and the output of the MLP. For this, they used a technique of residual analysis generated by means of a classical regression model.

4

Rejection Option Using Hyperplanes

This section contains a novel procedure for implementing the Rejection Option in the Multilayer Perceptron, trained with the Backpropagation algorithm. During its execution, the validity of the class label, allocated by the neural network, is veriﬁed considering the position the pattern occupies in the observation space. An interpretation of the BP operation establishes that the class label is allocated by the net via the identiﬁcation area where the patterns are located.

54

E. Gasca A. et al.

When the network is trained, the observation space is imaginarily partitioned into independent areas for each class [6]. It is patent that classes with areas close to the pattern have a higher probability of giving the class label. Therefore, if it is possible to determine the two classes nearest to the pattern, from the patterns correctly classiﬁed at the training sample, we can detect the classes with the highest probability of being chosen by the network to allocate the label. This is true in spite of the classiﬁcation result. If the patterns are located in the overlap zone between classes they have a high probability of being misclassiﬁed and in consequence are candidates to be rejected. For this reason, the novel procedure of the RO has a strategy to identify when a pattern is in the approximation of the overlap area. The function of the algorithm is described next: 1. To make the BP training and to identify the patterns that the BP classiﬁed correctly for each class. 2. To determine the class label for the pattern in evaluation. 3. To calculate the k-nearest neighbors to the correctly classiﬁed patterns in every class. 4. To calculate the nearest two classes to the pattern in evaluation using the average of the distances between this element and each one of its k-nearest neighbors. 5. To determine the two hyperplanes for the nearest two classes. These will simulate the decision boundary of the classiﬁer. 6. Applying the reject criteria. The class label given by the BP is not allocated to the pattern in evaluation if: a) The class allocated by the network is diﬀerent than the nearest two classes. b) It is located between the hyperplanes; this is the overlap area approximation. c) The pattern in evaluation is located in a region used by a class with diﬀerent label. The following section contains the description of the hyperplanes calculation for the nearest classes. 4.1

Hyperplanes Calculation

With the goal of searching an approximation of the decision border generated by BP, a procedure based on the calculation of two hyperplanes was designed. Hyperplanes are used as separation surface for their simplicity and because BP utilizes them to divide the observation space through the hidden layer nodes. There are diﬀerent methods to construct a function which closely ﬁts a data set [1], [9]. One approach corresponds with the multiple linear regression method and, within these, the least squares method is one of the most utilized. Given a set of data points {p}, in a N -dimensional space p = (x1 , x2 , ...xN )

(1)

A Rejection Option for the Multilayer Perceptron Using Hyperplanes

55

and the hyperplane equation a0 +

N

a i xi = 0

(2)

i=1

The least squares method determines the parameters by the minimizing the mean square error between the data points and the equation of the hyperplane; a group of simultaneous equations is generated and the solutions are the values of the parameters. The main diﬃculty for building the system of equations lies in searching so many non co-linear data points equal to the dimension of the observation space. In other words, we should have a quantity of patterns for each class, at least equal to the number of their attributes. Given such circumstances, the answer is to artiﬁcially generate all the required patterns. As we are interested in delimiting the area closest to the pattern in evaluation the nearest two neighbors (of the nearest two classes) are used to calculate the artiﬁcial patterns. Starting from the detection of the nearest two classes and their nearest two neighbors, the procedure is described below. 1. For each feature of the pattern, determine the maximum and minimum values of the nearest two patterns with the same class label. 2. Generate as many random numbers as the quantity of features. These quantities should be between the maximum and minimum values that were found in the previous point. The set of random quantities will integrate the artiﬁcial pattern. The pseudorandom number generator is initialized with a constant seed. 3. Process the created pattern with the Backpropagation network. If the allocated label coincides with the class of the nearest two patterns the artiﬁcial pattern is retained, if any other label is allocated the pattern is discarded. 4. Points 1 - 3 are executed until the required quantity of artiﬁcial patterns is achieved. 5. The hyperplane is calculated with the complete set of patterns and Eq.(2). With the hyperplanes calculated, we simulate the decision boundary generated by BP. It is logical to suppose that the region between the two hyperplanes contains an overlap area; however, since in general these hyperplanes are not parallel, their intersection creates two areas which could be viewed as overlap zones. To avoid confusion, the overlap zone is deﬁned as the space demarcated by the four neighbors; Fig. 2. Implicitly, such position gives information about the proximity of the pattern in evaluation and the four nearest neighbors. We should stress that the smooth operation of the RO procedure is based on obtaining local real information about the pattern position in reference to the decision boundary generated by the network. This is because the correctly classiﬁed patterns are used to calculate the hyperplanes. If the intersection is between the pattern in evaluation and the two nearest neighbors, the area delimited by the two hyperplanes (where the pattern is located) cannot be considered a region of overlap

56

E. Gasca A. et al.

Fig. 2. Hyperplanes generated by using the nearest two neighbors of class, and class. Area 1 represents the zone of overlap. Area 2 fulﬁlls the conditions of an overlap zone as well, but is created by the hyperplanes crossing and therefore is not considered an overlap zone. Table 1. The training parameters of the MLP Learning rate Momentum Iterations quantity Hidden layers Input nodes Hidden nodes Output nodes Activation function

= = = = = = = =

0.9 constant during the training 0.7 constant during the training 30, 000 in all the experiments 1 in all the experiments features number for all the databases features number + 1 for all the data bases class quantity for all the data bases sigmoid in all the experiments

because it is created by the crossing of hyperplanes; Fig. 2. In consequence, the pattern is relatively far away from the nearest neighbors and can coincide with the exclusive area of a class. To include the intersection criterion we did the following: ﬁrst, the point closest to the nearest hyperplane to the pattern in evaluation is calculated. Next, the locations of this point and of the neighbors (that generated the nearest hyperplane) are analyzed. If they are located on contrary sides of the hyperplane, the pattern in evaluation is located in a false overlap region. Lastly the sample in the overlap zone will be rejected if the label of the nearest hyperplane is of a diﬀerent class.

5

Experiments and Results

In all the experiments with the novel Rejection Option procedure a Multilayer Perceptron network trained with the Backpropagation algorithm was used. The generalization power of Backpropagation was computed using 10-Fold cross validation. The applied parameters are shown in Table 1. The databases were taken from the repository of University of California at Irving [8], and are known as: Breast Cancer (Can), Glass (Gla), Image Segmentation (Ima), Ionosphere (Ion), Iris, Liver Disorders(Liv), Pima, Sonar (Son), Vehicle (Veh), Vowel (Vow), Wine.

A Rejection Option for the Multilayer Perceptron Using Hyperplanes

57

Table 2. Description of the data bases used in the experiments Can Gla Features 9 9 Classes 2 6 Patterns 683 214

Ima 18 7 420

Ion 34 2 351

Iris 4 3 150

Liv 6 2 345

Ms Pima Son Veh Vow Wine 2 6 60 18 10 13 2 2 2 4 11 3 690 532 208 846 528 178

Table 3. The average percentage of misclassiﬁed rejected patterns and their standard deviation and without the application of the intersection criterion (WI) and without criterion (WOI).The 2nd and 4th nearest neighbors were used to determine the nearest classes.

2 nearest neighbors (WOI) 2 nearest neighbors (WI) 4 nearest neighbors (WI)

Can Gla Ima Ion Iris Liv MS Pima Son Veh Vow Win 33.1 38.5 59.5 34.5 0.0 12.0 12.9 28.2 17.7 39.9 76.6 58.3 ±35.9 ±14.4 ±38.2 ±31.8 ±0.0 ±11.1 ±19.5 ±11.5 ±13.7 ±10.4 ±8.9 ±49.2 72.1 44.5 62.6 54.8 50.0 17.9 32.2 33.2 40.7 62.5 91.4 58.3 ±23.9 ±19.9 ±37.9 ±21.5 ±50.0 ±18.4 ±20.9 ±12.0 ±23.4 ±20.8 ±7.8 ±7.8 76.6 45.7 53.8 65.7 66.7 13.9 30.5 28.9 37.5 50.1 87.2 50.0 ±21.2 ±16.5 ±36.9 ±23.7 ±28.9 ±18.1 ±20.5 ±14.2 ±23.7 ±20.8 ±12.0 ±50.0

Each one of them represents a diﬀerent application of the real world. MS is an artiﬁcial database which was created using two Gaussian functions. Their main characteristics are shown in the Table 2. Usually, the classiﬁers error is determined by the ratio of misclassiﬁed patterns to the patterns total. However, we do not consider this an appropriate way of estimating the RO performance. For instance; if all the elements are rejected, under the normal procedure, the error made by the classiﬁer + RO would take the value of zero. It is clear that such a result does not signify that the classiﬁer operated correctly. It is impossible to make a mistake when a decision is not taken. During the RO operation we confront a risk; some patterns correctly classiﬁed by BP can be rejected but ideally the RO should only isolate the misclassiﬁed elements. Therefore, we consider it is better to estimate the RO error with: error = errorred +

(−patrech + patbie) patmal + patbie = pattot pattot

(3)

where: errorred is the network error without applying the RO; patrech represents the misclassiﬁed rejected patterns; patbie corresponds to the correctly classiﬁed rejected patterns; patmal symbolizes the misclassiﬁed elements with RO; pattot is the total number of patterns. The results of these experiments are shown in Tables 3 and 4. Speciﬁcally, Table 3 contains the average of misclassiﬁed patterns rejected by RO. There it

58

E. Gasca A. et al.

Table 4. The estimated average percentage of the BP error with and without the RO. Two versions of the RO were used: without (WOI) and with (WI) the intersection criterion. The k-nearest neighbors were used, with k = 2 and 4 to determine the nearest classes. Can 11.3 ±4.8 BP+RO2 16.0 WOI ±6.2 BP+RO2 13.6 WI ±6.5 BP+RO4 8.9 WI ±5.2 BP

Gla Ima 54.8 7.5 ±11.9 ±5.0 44.1 13.3 ±11.2 ±7.4 40.2 8.5 ±9.6 ±3.4 41.5 7.7 ±10.9 ±3.9

Ion 11.1 ±6.9 20.9 ±6.5 21.9 ±6.5 21.9 ±5.0

Iris 3.3 ±5.7 7.4 ±6.6 10.9 ±7.3 8.0 ±6.1

Liv 34.8 ±8.8 41.6 ±6.0 42.3 ±7.3 41.1 ±5.4

MS Pima 18.6 23.3 ±12.6 ±6.1 16.3 27.6 ±9.3 ±4.5 12.4 27.0 ±7.5 ±4.7 12.3 29.3 ±7.2 ±5.3

Son Veh Vow Win 44.4 38.1 40.2 3.9 ±14.9 ±13.5 ±13.9 ±3.8 48.2 33.4 18.7 14.6 ±8.2 ±4.6 ±5.4 ±10.4 39.9 31.1 11.5 11.9 ±14.5 ±8.9 ±5.0 ±8.8 40.1 34.3 19.2 13.3 ±13.2 ±5.6 ±13.8 ±9.0

can be observed that, with the exception of the Iris database, these percentages take a value between 12.0% and 91.4%. We should mention that for Iris and Win the quantity of 3 and 6 folds were used respectively, because only these folds contain misclassiﬁed rejected patterns. Moreover, in the case of the 4 nearest neighbors, the quantity of 6, 9 and 5 folds were used respectively for Gla, Son, and Veh because the neighbors were insuﬃcient for determining the nearest classes. Table 4 contains the error of the RO, Eq. (3). There it is shown that in 33.3% of the databases the error with RO (BP+RO2WOI) is smaller than the value generated by the classiﬁer without RO. In other words, the quantity of misclassiﬁed rejected patterns is bigger than the correctly classiﬁed rejected elements plus the misclassiﬁed non-rejected patterns; a decrease of the error value was produced by the use of RO. In these experiments, implicitly, we suppose that the two hyperplanes used are parallel. Table 4 contains the error percentage when the intersection criterion is implemented in the RO procedure (BP+RO2(4)WI). Its positive inﬂuence can be observed in most of the databases, even when the amount of correctly classiﬁed rejected patterns is increased, Table 5. However, the increase of the misclassiﬁed rejected patterns is higher and therefore a decrease of the total error is produced.

Table 5. The average percentages of the correctly classiﬁed rejected patterns by the implementation of the intersection criterion

BP+RO2 WOI BP+RO2 WI BP+RO4 WI

Can 9.3 ±2.9 11.8 ±5.9 7.0 ±4.3

Ima Gla 23.9 10.1 ±12.7 ±6.4 22.2 6.2 ±5.8 ±3.0 25.4 5.4 ±10.2 ±4.4

Ion 15.0 ±5.8 18.8 ±7.1 19.7 ±5.4

Iris 4.1 ±4.9 9.1 ±7.4 6.8 ±6.3

Liv MS 14.6 1.3 ±12.2 ±1.9 17.9 1.2 ±14.1 ±1.6 14.6 0.8 ±14.9 ±1.1

Pima 14.0 ±3.9 14.9 ±6.1 17.0 ±6.2

Son Veh Vow Win 20.1 20.2 15.3 13.6 ±13.4 ±4.8 ±8.1 ±9.1 24.1 25.9 14.0 10.5 ±19.3 ±15.5 ±5.1 ± 8.0 24.0 26.3 25.0 12.0 ±17.8 ±13.4 ±17.9 ±8.5

A Rejection Option for the Multilayer Perceptron Using Hyperplanes

6

59

Conclusions

The described procedure of the Rejection Option showed a good performance for detecting patterns that are misclassiﬁed by Backpropagation. The rejection percentage value was never smaller than 12.0% and reached a value of 91.4%. However, its eﬀectiveness depends on the proportion between the correctly classiﬁed and misclassiﬁed rejected patterns. This is why only 50% of the databases have had a decrease of the classiﬁcation error. Among the issues that inﬂuence the Rejection Option eﬃciency that can be mentioned are: the quantity of the nearest neighbors for determining the two nearest classes, we should have a suﬃcient number of the correctly classiﬁed patterns for each class at the training sample; the location of the hyperplanes is determined by the patterns generated artiﬁcially and consequently they modify the quantity of the rejected elements. To reduce this drawback, a pseudorandom number generator, initialized with a constant seed, is used. This option allows the experiments results to be reproduced. Acknowledgements. This work was partially supported for the Fondo Sectorial de Investigación para la Educación-CONACyT through the agreement SEP2003-C02-44225.

References 1. Burden, R.L., Douglas Faires, J., Reynolds, A.C.: Numerical Analysis. In: Prindle, Weber, Schmidt (eds.) Wadsworth International (1981) 2. Cordella, L.P., Stefano, C.D., Tortorella, F., Vento, M.: A Method for Improving Classiﬁcation Reliability of Multilayer Perceptrons. IEEE Trans. on neural Networks 6(5), 1140–1147 (1995) 3. Irie, B., Miyake, S.: Capabilities of three-layered perceptrons. In: Proc. of the IEEE Conference on Neural Networks, vol. I, pp. 641–648 (1988) 4. Lippmann, R.P.: An introduction to computing with neural nets. IEEE Acoust. Speech Signal Process. Mag. 4, 4–22 (1987) 5. Lin, C.-C., El-Jaroudi, A.: An algorithm to determine the feasibilities and weights of two-layer perceptrons for partitioning and classiﬁcation. Pattern Recognition 31(11), 1613–1625 (1998) 6. Looney, C.G.: Pattern Recognition Using Neural Networks: Theory and Algorithms for Engineers and Scientists. Oxford University Press, Oxford (1997) 7. Makhoul, J., El-Jaroudi, A., Schwartz, R.: Partitioning capabilities of two-layer neural networks. IEEE Trans. Signal Process. 39(6), 436–1440 (1991) 8. Merz, C.J., Murphy, P.M.: Repository of Machine Learning Databases, University of California at Irvine (1998), http://www.csi.uci.edu/mlearn 9. Reister, D.B.: The least Squares Fit of a Hyperplane to Uncertain Data. ORNL (1996) 10. Shah, J.V., Poon, C.-S.: Linear independence of internal representations in multilayer perceptrons. IEEE Trans. on Neural Networks 10(1), 10–18 (1999) 11. Singh, S., Markou, M.: An Approach to Novelty Detection Applied to the Classiﬁcation of Images Regions. IEEE Trans. on Knowledge and Data Engineering 16(4), 396–407 (2004)

60

E. Gasca A. et al.

12. De Stefano, C., Sansone, C., Vento, M.: To Reject or Not to Reject: That is the Question an Answer in Case of Neural Classiﬁers. IEEE Transactions on Systems, Man, and Cybernetics Part C Applications end Reviews 30(1), 84–94 (2000) 13. Tsujitani, M., Koshimizu, T.: Neural Discriminant Analysis. IEEE Transactions on Neural Networks 11(6), 1394–1400 (2000) 14. Wilson, C.L., Blue, J.L., Omidvar, O.M.: Training dynamics and neural network performance. Neural Networks 10(5), 907–923 (1997)

Parallelization of Algorithms with Recurrent Neural Networks João Pedro Neto and Fernando Silva Dept. Informatics, Faculty of Sciences, University of Lisbon, Portugal {jpn,fsilva}@di.fc.ul.pt

Abstract. Neural networks can be used to describe symbolic algorithms like those specified in high-level programming languages. This article shows how to translate these network description of algorithms into a more suitable format in order to feed an arbitrary number of parallel processors to speed-up the computation of sequential and parallel algorithms. Keywords: Neural Networks, Parallelization, Symbolic Computing.

1

Introduction

Neural networks are used typically in learning and optimization problems. However, the initial works of McCulloch and Pitts in the 1940s presented neural networks to the scientiﬁc community as a computational model for logic operations [4]. Much later, in the 1990s, the equivalence of a neural network to Turing Machines was achieved in [8], [9]. In the cited works, as in this paper, neural networks are not used in optimization or learning but, rather, as a way to express computation as general as those speciﬁed in Turing machines or in a typical programming language. Herein, the concern is with neural networks that compute symbolic computation, i.e., computation where information has a deﬁned and well speciﬁed type (like integers or booleans). The main thesis is: provided a high-level description of an algorithm A, it is possible to automatically create a neural network that computes A. Our previous works, [5], [6], [7], answer this question using a discrete-time recurrent neural network. More about symbolic processing in neural networks is found in [2], [9], [1], [3]. A related question follows: since these computations are executed over massive parallel architectures, can this feature be used to our advantage? This paper focuses on this problem. First, the paper explains the translation process from the initial program speciﬁcation to an equivalent neural network. Then a second translation transforms the previous network into a set of tuples more suitable for parallel computation. This is the subject of the next section. Finally, the third part shows some simulation results, comparing the system’s behavior against the use of Java threads. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 61–69, 2011. c Springer-Verlag Berlin Heidelberg 2011

62

2

J.P. Neto and F. Silva

First Translation: Programs to Networks

Herein, algorithms are described using a high-level programming language named netdef. This language was designed exclusively to ease the translation process into neural networks. netdef is an imperative language and its main concepts are processes and channels (it was based on the parallel language occam-2). A program can be described as a collection of processes executing concurrently, and communicating with each other through channels or shared memory. The language has assignment, conditional and loop control structures, and it supports several data types (booleans, integers, ’reals’), variable and function declarations, and other processes. It uses a modular synchronization mechanism based on handshaking for process ordering. A detailed description of netdef is found at https://docs.di.fc.ul.pt/ (report 99-5). Programs written in netdef can be converted into neural networks through a compiler (available at www.di.fc.ul.pt/~jpn/netdef). The compiler takes a netdef program and translates it into a text description deﬁning the neural network. Given a neural hardware, an interface would translate the ﬁnal description into suitable syntax, so that the neural system may execute. The chosen recurrent neural network model is a discrete time dynamic system, x(t + 1) = φ(x(t), u(t)), with initial state x(0) = x0 , where t denotes time, xi (t) denotes the activity (ﬁring frequency) of neuron i at time t, within a population of N interconnected neurons, and uk (t) denotes the value of input channel k at time t, within a set of M input channels. The application map φ is taken as a composition of an aﬃne map with a piecewise linear map of the interval [0,1], known as the piecewise linear function: ⎧ ⎨1 , σ(x) = x , ⎩ 0,

x≥1 0<x<1 x≤0

(1)

Function σ provides two discontinuities (the underﬂow at 0 and the overﬂow at 1) needed to assign universality to this type of networks [6]. The dynamical system becomes,

xj (t + 1) = σ

N i=1

aji xi (t) +

M

bjk uk (t) + cj

(2)

k=1

where aji , bjk and cj are rational weights. The information ﬂow between neurons, due to the activation function σ is preserved only within [0,1], implying that data types must be coded in this interval. Each type (integers, reals, booleans) has a diﬀerent way to code its values. As an example, to represent values of type real within [-a,a], where ’a’

Parallelization of Algorithms with Recurrent Neural Networks

63

is a positive integer, the coding is α(x) = (x + a)/2a, which is a one-to-one mapping of [-a,a] into set [0,1]. So, if a is 10, then, say, 5.0 would be represented in the network ﬂow as value α(5.0) = (5.0 + 10)/20 = 15/20 = 0.75. Fig. 1 displays a graphical representation of equation 2, used throughout this paper. Usually, synaptic values equal to 1 are not displayed. ajj aji

xi

xj

bjk

uk

cj

Fig. 1. Graphical notation for neurons, input channels and their interconnections

Let’s brieﬂy illustrate the translation of an instruction – if b then x := x-1 – into the equivalent neural network. In the main net, synapse in sends value 1 (by some previous neuron xIN ) into neuron xM1 , starting the computation. Module G (denoted in the main net by a square) computes the value of boolean variable ’b’ and sends the 0/1 (i.e., false/true) result through synapse res. This module accesses the value ’b’ and outputs it through neuron xG3 . This is achieved because xG3 bias 1.0 is compensated by value 1 sent by xG1 , allowing value ’b’ to be the activation of xG3 . This result is synchronized with an output of 1 through synapse out. The next two neurons (still on the main net) decide between entering module P (if ’b’ is true) or stopping the process (if ’b’ is false). Module P makes an assignment to real variable ’x’ with the value computed by module E. Before neuron x receives the activation value of xP 3 , the module uses the output signal of E to erase its previous value. In module E the decrement of ’x’ is computed (using α(1) as the coding of real 1). The 1/2 bias of neuron xE2 for subtraction is necessary due to coding α. As an example of a neuron equation, the dynamics of neuron x is given by equation 3. IN

xM1

IN

xM2 IN

G OUT 2

RES

P

-1

MAIN NET

-1

xM3

IN

xG1

xG2

OUT

b

xG3

RES

OUT MODULE G

xM4

OUT

-1 α(1)

IN

xP1

IN

E

xP2

OUT

OUT IN

-1

RES MODULE P

xP3

xE4

x

xE1

2

MODULE E

-1

Fig. 2. Construction of "if b then x := x-1" process

-3/2

-1

x

xE2

RES

xE3

OUT

64

J.P. Neto and F. Silva

x(t + 1) = σ x(t) + xP 3 (t) − xE3 (t)

(3)

Notice, however, if neuron x was used in other modules, the compiler would add synaptic links to its equation. As seen in Fig. 2, netdef uses a hand-shaking mechanism to control each module execution. A module is connected to their immediate neighbors via an in/out signal synchronization. A module only starts after receiving an in signal and ends by outputting an out signal. This allows for sequential and parallel block structures. IN

I1

IN OUT

... IN

In

OUT

Fig. 3. A sequential module

The parallel module waits for the last signal of its sub-modules to arrive. Each of the out signals indicates the end of its sub-module execution. Every signal will be kept until all n neurons have value 1, due to the −(n − 1) bias of the right neuron, which is enough to compensate for up to n − 1 activated neurons. Only when all n neurons are sending values (meaning that all sub-modules ended their execution), the negative bias is overcome and the right neuron will output 1, while, at the same time, resetting the left neurons (via the -1 synapses showed above).

I1 IN

I2

OUT

-1 OUT

... In

-1 OUT

-1 -1

-(n-1)

OUT

Fig. 4. A parallel module

In conclusion, every neural network built this way is homogenous, i.e., all neurons have the same activation function, and each ﬁnal network is an independent module that can be reused in other contexts. Regarding time and space complexity, the compiled networks do not add to the algorithm’s complexity. All modules have constant size, and most are quite small: module if consists of four neurons, as shown above; module while has three neurons; the assign module also has three neurons, etc.

Parallelization of Algorithms with Recurrent Neural Networks

3

65

Second Translation: Networks to Tuples

It is straightforward to simulate the execution of these nets using a regular computer (our own software can compile and execute the resulting nets). But, what kind of ’neural’ hardware would be adequate to compute these networks? The mentioned compilation produces modules which communicate via a small number of channels, but nonetheless the resulting networks are highly non-planar with quite complex topologies. It would not be feasible to translate them into a tridimensional hardware of neurons and synapses. Besides, every algorithm produces a diﬀerent network, so a ﬁxed architecture would be useful just for a speciﬁc problem. It is theoretically possible to implement a universal algorithm, i.e., to implement a neural network that codes and executes any algorithm, but there are easier solutions. In this section we propose a translation of the previous network description into a set of tuples, namely, triples, that are easier to handle and process. A neural network can be seen as a collection of synapses, and each synapse is totally deﬁned by three values: the previous neuron (or 1 if the synapse refers to a neuron’s bias), the next neuron and the synaptic value. Fig. 5 shows a translation example of three diﬀerent types of synapse. b x

a

y c

Fig. 5. This neural network translates to [(x,a,y), (y,b,y), (1,c,y)]

Given a network Φ, let’s denote by LΦ the respective triple list. On the worst case (a totally connected network) LΦ has space quadratic complexity. However netdef networks are highly sparse making LΦ size, in practice, proportional to the number of neurons. Notice there is no need to keep detailed information about each neuron; they are implicitly deﬁned at LΦ . This list has a ﬁxed size: it is possible to change the synaptic values dynamically but it is not possible to create new neurons or delete existing ones. There is, however, the possibility of deactivating a neuron by assigning zero values to its input and output synapses. So, given a netdef program specifying a sequential or parallel algorithm, the compiler translates it to a time-discrete recurrent network. Then, a second translation occurs parsing the network description into a set of triples describing the network synapses, i.e., its topology. This set of triples can then be distributed over N diﬀerent cpus to emulate the network dynamics. The network’s state at a given time t is the activation of their neurons at time t. To compute the network’s next state, i.e., for time t + 1, the system needs to execute the list of triples over the network state. To do this, the computing system also needs a shared memory to keep two network states: the current values of state t and the next values of state t + 1. The latter state will be used to update the former state when all triples are processed and, so, a new cycle can be processed. More details can

66

J.P. Neto and F. Silva

be found in [7]. In the next section, some simulation results of this computing system are presented and discussed.

4

Results and Discussion

This section included two parallel test examples. The ﬁrst test deals with the execution of ten sorting algorithms over partitions of ten elements (namely, the O(n2 ) insertion sort is used). The second test deals with computing the product of two 4 × 4 matrices (using the standard O(n3 ) algorithm). The main goal of these tests is to gather evidence of how the increase of cpus aﬀect the time to achieve the result. We shall compare this with the use of Java threads. Notice that we will not compare absolute times but only the decrease of total processing according to the increase of cpus/threads. Absolute times would tell us nothing, since we are making a simulation, not using real parallel hardware, and our system was not optimized (all triples are processed in each cycle and they do not need to). Much more interesting is to check until when can we add processing units without strong diminishing returns. Also, we will not use time to measure the results. It is assumed that the number of triples will always be greater than the number of cpus, and that each cpu will have a balanced charge. So, the number of processing triples per cpu will always decrease linearly if we not apply any restrictions over triple distribution. What will show results about a measure that is, arguably, the operation with greater impact in performance: the potential write conﬂicts in shared memory due to asynchronous io operations over shared neural states. A write conﬂict occurs when two processors try to update one neural state at the same time. Given a test and a number of cpus, we monitor the maximum number of conﬂicts among all cpus and all network states, i.e., from time t = 0 until the computation stops. This measure is a worst-case scenario: all writes over the same neuron result, in this scenario, in a writing conﬂict (actually, most parallel updates by diﬀerent cpus will occur at diﬀerent instants). Another decision is to determine how to distribute triples over cpus. Some possibilities were considered and tested, like: (a) assign the next n triples over the next cpu; (b) group triples from the same network module (typically a small number of highly interconnected neurons as seen in ﬁg. 2) and assign them to the least occupied cpu via a Round Robin mechanism. For the ﬁrst test (ten parallel sorting processes), the number of potential conﬂicting write operations grows in a stable linear way up to ≈ 70 cpus. Then the slope becomes gentler and starts to stabilize around 150 cpus when the system is no longer slowed down by this measure (there are not that many conﬂicting neurons in the entire neural network). The matrix test shows a similar shape: linear growth with a gentle slope at around 80–90 cpus and then a stabilization around 160 cpus. Method (b), as expected, achieved better results (≈ 25% less conﬂicts) in this measure. Fig. 6 shows these numbers for increasing number of cpus in the sort test and ﬁg. 7 shows the same values for the product of two 4 × 4 matrices.

Parallelization of Algorithms with Recurrent Neural Networks

67

100

80

60

40 Method (a) Method (b)

20

0 1

21

41

61

81

101

121

141

161

181

201

221

241

Fig. 6. Potential write conflicts for the insertion sort test

Some other relevant data: the neural network for the sort algorithm consists of 3268 triples, while the matrices multiplication algorithm produced 7307 triples. Using method (a) there is a maximum of 14 triples/cpu with 250 cpus for the sort test and 30 triples/cpu for the matrix test. With method (b) there is a maximum of 101 triples/cpu with 34 cpus (more processing does not have an impact at performance using method (b)). For the matrix test the system computes a maximum of 67 triples/cpu with 141 cpus. This means that method (a) achieves better results using this triple/cpu measure and may be a better solution depending on the cost of having increasing writing traﬃc on shared memory. The Java threads, solving the same parallel problems, behave as expected. There is a linear decrease while the system is able to split the diﬀerent processes over the available threads. When there is no way to trivially split the parallel jobs (10 sorts and 16 matrix cells), the maximum number of operations stabilize. Table 1 shows the number of operations for the busiest thread. This problem, in our approach, is solved by the automatic and non trivial fragmentation of a given program into a myriad of loosely related triples that can easily be distributed among the available processors. There are possibilities for optimization in the number of triples the system really needs to compute. This will not improve the linear rate shown above, but may achieve considerable constant speedups. Notice that the network modules may not all be active at once. Except for high-parallel algorithms, there will be only a small number of modules active at each given moment. So, many triples (those from the inactive modules) are not used and, if possible, should not enter in the next computation step. How can we easily compute what triples should be processed? Herein, the in/out synchronization mechanism is again helpful. Since a certain module is only activated after its input neuron receives an activation

68

J.P. Neto and F. Silva

70 60 50 40 30 Method (a) Method (b)

20 10 0 1

21

41

61

81

101

121

141

161

181

201

221

241

Fig. 7. Potential write conflicts for the matrix test

signal (i.e., the previous synapse receives a 1) that means that we can keep the triples of those input synapses – let’s denote them input triples – as guards to the triples representing the remaining module structure. Every time an input triple is activated, the cpu uploads the entire triple structure of that module (notice that this may or may not include the inner sub-modules, depending on the number of triples these sub-modules of arbitrary complexity may represent) and compute it along with all the other active triples. When an active module ends its computation, the output triple (representing the synapse that transfers the output signal to the input neuron of the next module) is activated and the system has enough information to remove those same triples from the pool of active triples. These sets and their guards are deﬁned just once at the beginning, Table 1. Operations per Java thread threads insertion sort 1 2 3 4 5 6,7 8,9 10 11–15 16+

1000 500 400 300 200 200 200 100 100 100

matrix 11617 5808 4356 2904 2904 2178 1452 1452 1452 726

Parallelization of Algorithms with Recurrent Neural Networks

69

when triples are distributed, not during execution which would needlessly slow down the computation. Using this mechanism, the number of triples in execution depends only of the number of active modules and not in the entire network structure. This would speed the execution of the active modules and provide a better eﬃcient use of the available parallel processing power.

5

Conclusions

We presented a way to translate an algorithm described in netdef, a highlevel parallel programming language, into a collection of triples suitable to be distributed over many parallel computing units. To achieve this, we used a timediscrete recurrent neural network as an intermediate description, using a modular synchronization mechanism based on handshaking for process ordering, where each triple is the description of a network’s synapse. We presented some parallel tests, namely, by sorting several vectors and by multiplying two 4 × 4 matrices. Some evidence was collected showing that this algorithm representation is able to keep up the linear speedup for a large number of available parallel processors. This is, arguably, due to the loose connection that each triple has with the remaining triples. Acknowledgements. This work was supported by LabMAg (Laboratório de Modelação de Agentes) and FCT (Fundação para a Ciência e Tecnologia).

References 1. Carnell, A., Richardson, D.: Parallel computation in spiking neural nets. Theoretical Computer Science 386(1-2), 57–72 (2007) 2. Gruau, F., Ratajszczak, J., Wibe, J.: A neural compiler. Theoretical Computer Science 141, 1–52 (1995) 3. Herz, A., Goltisch, T., Machens, C., Jaeger, D.: Modelling Single-Neuron Dynamics and Computations: A Balance of Detail and Abstraction. Science 314, 80–85 (2006) 4. McCulloch, W., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133 (1943) 5. Neto, J., Siegelmann, H., Costa, J.: On the Implementation of Programming Languages with Neural Nets. In: First International Conference on Computing Anticipatory Systems, vol. 1, pp. 201–208 (1998) 6. Neto, J., Costa, J., Siegelmann, H.: Symbolic Processing in Neural Networks. Journal of Brazilian Computer Society 8(3), 58–70 (2003) 7. Neto, J.: A Virtual Machine for Neural Computers. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 525–534. Springer, Heidelberg (2006) 8. Siegelmann, H., Sontag, E.: Analog Computation via Neural Networks. Theoretical Computer Science 131, 331–360 (1994) 9. Siegelmann, H.: Neural Networks and Analog Computation, Beyond the Turing Limit. Birkhäuser, Basel (1999)

Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs Olena Schuessler and Diego Loyola German Aerospace Center, Institute of Remote Sensing, Münchner Straße 20 82234 Weßling, Germany {Olena.Schuessler,Diego.Loyola}@dlr.de

Abstract. This paper reports on methods for the parallelization of artificial neural networks algorithms using multithreaded and multicore CPUs in order to speed up the training process. The developed algorithms were implemented in two common parallel programming paradigms and their performances are assessed using four datasets with diverse amounts of patterns and with different neural network architectures. All results show a significant increase in computation speed, which is reduced nearly linear with the number of cores for problems with very large training datasets. Keywords: Neural network training, multithreading and multicore, Pthreads and OpenMP parallelization.

1 Introduction In recent years we observe a growing interest in artificial neural networks for solving all kinds of classification, function approximation, interpolation and forecasting problems. Neural networks as universal approximators [1] are a very powerful tool which can reproduce extremely complicated non-linear dependencies. Neural networks learn an underlying function from input/output examples and normally the more complicated the problem is the more examples, also called patterns, are needed. Training a neural network for complicated and multi-dimensional problems normally means using very large amounts of training examples with hundred thousands or even millions of patterns. Such training can take weeks and even months to reach the desired accuracy. In the same way, finding an optimal neural network configuration requires a certain amount of cross-validation experiments, which can be also very time consuming. Therefore there is a need to speedup the training process of neural networks, especially for very large training datasets. In our days multithreaded and multicore CPUs with shared memory are a costeffective way of obtaining significant increases in CPU performance. An exponential growth in performance is expected in the near future from more hardware threads and cores per CPU [2]. Researches focused their attention recently on parallelizing a variety of computational intelligence algorithms [3-6] using these new CPUs. For neural networks two basic approaches of parallelization can be defined: parallelizing the neural network structure and parallelizing the training process. The first A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 70–79, 2011. © Springer-Verlag Berlin Heidelberg 2011

Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs

71

approach uses the parallel nature of neural networks, and assigns to each processing node (neuron) a separate thread [3]. All neurons in one layer are processed simultaneously and synchronized before propagating into next layer. The second approach is to assign a part of the training dataset to each thread and process (train) them simultaneously. This approach is covered in [4], where a three-layer perceptrone neural network is parallelized and tested using two and eight threads. The same technique is implemented for dual-core processors in [5]. Our study is more general than the aforementioned papers as it covers a larger diversity of training algorithms, neural network architectures and parallel training implementations. In particular we focus on speeding up the training process for very large datasets with neural networks of complicated structures containing more than three layers, what will allow us to solve complex nonlinear real world problems. This paper is organized as follows: section 2 gives a short summary of multilayer perceptrone neural networks and the backpropagation learning algorithm; section 3 describes the parallel implementation of the neural network training; section 4 shows the results with different test problems and section 5 presents the conclusions.

2 Multilayer Perceptrone Neural Network One of the most popular neural network architectures is the multilayer perceptrone network (MLP) [7]. The MLP network is composed of an input layer, one or more hidden layers and an output layer. Each layer contains a certain amount of neurons and all neurons of neighbor layers are interconnected, see Figure 1.

Fig. 1. Architecture of a 4-layer Multilayer Perceptrone Neural Network with 2 input neurons, 2 hidden layers with 3 and 2 neurons, and 1 output neuron

Each connection in the network has an associated weight. The task of each neuron is to compute the weighted sum of its inputs and to transform it to the output signal. This transformation is done with an activation function; the most popular activation functions for MLP are the sigmoid and Gaussian functions: y=

1 1 + exp( − x )

y = exp( − x 2 )

(1)

In a feedforward MLP computations propagate from the input layer through the hidden layers to the output layer that calculates the output signal.

72

O. Schuessler and D. Loyola

2.1 Backpropagation Learning Once we have computed the output of the neural network, we can compare it to a desired output value. The difference between these values is the network error. In order to reduce this error and to get a good match between network output and expected values we can iteratively adjust the weights in the network, until we reach a good agreement. One of the most common algorithms for adjusting the weights is the backpropagation algorithm [8] which propagates the error of output layer back to the input layer and changes the network weights. There are two approaches for backpropagation training: in incremental (on-line) learning we update the weights after each pattern is presented to the network and in batch learning we accumulate error values over all patterns and then update the weights. The classical algorithm of batch training is described in Figure 2. Neural network training algorithm 1. Initialize weights in the network; set desired error value emax and maximum number of iterations itrmax ; initialize the delta weights and number of iterations with zero 2. For each pattern p in training set T do: p eout = op − t p

, where o p is output of the

3.

Calculate the error of output neuron

4.

network and t p is expected (target) value for pattern p Backpropagate: calculate the errors of neurons

5.

p Calculate delta weights Δwij = δ j ⋅ oi , where δ j = dy j ( p ) ⋅ eout for neurons of

hidden

layers

p

output layer, and δ j = dy j ( p ) ⋅ 6.

in

p p ehid = eout ⋅ whid ⋅ dy( p ) , where dy is derivative of activation function y

∑δ

k k∈outp ( j )

⋅ w jk for neurons of hidden layer.

Accumulate the new calculated delta weights Δwij = Δwij + Δwij

p

7. End For 8. Update weights in the networks

wij = wij + η ⋅ Δwij

, where

η

is the learning rate

9. Compute mean square network error (MSE) on training set T as e = 1 out

T

∑ (e p∈T

p 2 out

)

where T is the cardinality of T, i.e. the number of patterns in training dataset 10. If eout > emax and itr < itrmax then increment the number of iterations and return to 2. Fig. 2. Algorithm for neural network backpropagation in batch learning mode

3 Parallelized Neural Network Training The most common way to create parallel programs for shared memory systems is multithreading. Functional parallelism works through specially programmed thread functions and data parallelism works through shared virtual address space of a process which can be accessed from every thread.

Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs

73

3.1 Parallel Backpropagation Learning To parallelize the backpropagation algorithm we decided to use the batch training approach as it is relatively easy to adapt for multithreaded and multicore CPUs. First we divide the training dataset T into equal parts T1, T2,…,TN, where N is the number of threads. Then steps 2 to 7 of the algorithm presented in Figure 2 can run in parallel threads that independently performs backpropagation of each pattern in the dataset Tk assigned to it. When every thread have finished steps 2 to 7 for its training dataset Tk, the corresponding delta weights ΔwijTk are then accumulated together N

Δwij = ∑ ΔwijT

k

(2)

k =1

After that, steps 8 and 9 of the algorithm are executed sequentially: update weights of the network and compute the new network error. If the expected network accuracy is not reached then a next iteration is started by repeating steps 2 to 7 in parallel. In order to avoid overfitting during the training we use the early stopping method [9]: the training is stopped when the network error in an independent test dataset (with patterns not included in the training dataset) increases. In Figure 3 we see a comparison between training and testing steps using sequential batch learning and the corresponding steps using the proposed parallel batch learning. It is important to note that in our parallelized training algorithm we do not move/copy the training datasets between threads, which would take too long time in case of large datasets, but we only update/copy the relative few neural network weights.

(a)

(b)

Fig. 3. Schematic representation of the neural network training and testing procedure for (a) sequential and (b) parallelized backpropagation implementation

Similar to the parallelization of batch training algorithm, we created parallel versions of other two popular training algorithms which for some tasks proved to be faster and converge better: RProp [10] and QuickProp [11].

74

O. Schuessler and D. Loyola

3.2 Multithreaded Training Implementation The most common approaches of multithreading programming are POSIX Threads (Pthreads) [12], which is library based and requires parallel coding, and OpenMP [13], which is based on compiler directives and can also use serial code. The OpenMP implementation of algorithms is usually more straightforward and doesn't require specifically parallel coding, whereas the Pthreads implementation gives more control over the parallelization but requires specifically parallel programming [14]. The performance of Pthreads-based and OpenMP-based algorithm parallelization is problem and platform specific [15]. In this work we implement different neural network training algorithms using both parallelization techniques and compare their performances. Figure 4 shows in detail the parallelization scheme developed for the neural network training. Here we see the division of work between the master thread and its slave threads.

Fig. 4. Parallel implementation of the backpropagation learning algorithm. Here the master thread divides training data into equal parts. Each slave trains the neural network using the subset of the training data assigned to it. When all threads are ready, the master thread computes the overall weights update of the neural network.

Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs

75

In the master thread we first create a new neural network (or load it from a file if we want to continue training of existing neural network). Then the master thread reads the training dataset T and separates them into equally large dataset T1, T2,..TN , initializes N slaves and sends to each of them a separate part of the training dataset Tk together with a copy of original neural network. Each slave computes the weight delta updates based on the current copy of the neural network and its training dataset Tk. As soon as all slaves have finished their computations, they send accumulated weight deltas to the master thread and the master thread combines them to update weights of the reference neural network and computes the network output error (in our implementation mean square error (MSE)) for the test dataset. This process is repeated until the MSE value of the test dataset is smaller than a given threshold or until a maximum number of training epochs is reached.

4 Simulations and Results 4.1 Datasets Used for Simulations We test the sequential and parallelized training algorithms on four different problems. Information on the corresponding datasets, neural network configurations and algorithms used for the training are shown in the Table 1. Table 1. Simulation datasets and corresponding network configurations

Dataset

Neural network

Training algorithms

Number Number of of layers neurons

Number of patterns

Number of inputs

Number of outputs

Characters recognition

3823

64

10

Backprop & RProp

5

177

Surface interpolation

7840

2

1

Backprop & QuickProp

4

53

Ozone extrapolation

120072

2

1

Backprop & RProp

4

33

1998855

8

62

Backprop & QuickProp

5

126

Name

O2 A-band simulation

The first dataset is a typical classification problem of handwritten digits [16] and has a few thousand patterns; four of them are shown in Figure 5(a). The second dataset corresponds to a surface interpolation problem shown in Figure 5(b) with few thousand patters computed using the function z = sin(( x 2 + y 2 )1/ 2 ) /( x 2 + y 2 )1/ 2 . The third dataset describes a spatial extrapolation problem for the global concentration of ozone on the atmosphere as measured by satellites [17]. The inputs are latitude and longitude and expected total ozone is the output, see Figure 5(c). This dataset contains more than one hundred thousand patterns. The fourth dataset is from a function approximation problem with almost two million training patterns of oxygen

76

O. Schuessler and D. Loyola

A-band reflectivity simulations for various geophysical conditions [18]. An example of such training patterns is shown in the Figure 5(d).

(c)

(a)

(d)

(b)

Fig. 5. (a) Example patterns for handwritten characters 3, 5, 7 and 9 from the Character recognition dataset; (b) plot of the Surface interpolation dataset; (c) satellite data of global ozone concentration from the Ozone extrapolation dataset; (d) example of pattern from O2 A-band dataset (reflectivity as function of wavelength and cloud-top height)

4.2 Efficiency of Parallelization

For the simulations we used different networks configurations as described in Table 1. A fixed learning rate of 0.7 is used in all simulations. To compare the speedup S of the parallelized training over sequential training, we use the following equation

S=

τ single ⋅100% τ parall.

(3)

where τsingle and τparall are the time of computation for the sequential and parallelized versions, both measured over 1000 epochs of training. We also compute ideal speedup Sideal for each problem according to Amdahl [19]

S ideal =

1 ⎞ ⎛ f ⎜ +1− f ⎟ ⎠ ⎝n

⋅100%

(4)

Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs

77

where f is the fraction of the code which is parallelizable and n is the number of cores. The efficiency E of the parallelized training is calculated using

E=

τ single n ⋅τ parall.

(5)

The single and parallel trainings were started using the same initial values of network weights. Each experiment was repeated 10 times using different network configurations and then the average time needed for the training was computed. The computers used for our simulation have 2 Quad-Core L5420 Intel CPUs (2.5 GHz) with 8 GByte RAM each Figure 6 shows the computed ideal speedup for each problem and the measured speedup of parallelized network training for the four simulation datasets using both parallelization techniques: POSIX threads on the left and OpenMP on the right. O2 A−Band spectrum simulation Surface approximation Characters recognition Ozone concentration Ideal speedup

Speedup [%]

800

600

600

400

400

200

200

0

0

2

4 6 8 Threads number

10

O2 A−Band spectrum simulation Surface approximation Characters recognition Ozone concentration Ideal speedup

800

12

0

0

2

4 6 8 Threads number

10

12

Fig. 6. Measured speedup (solid lines) of neural network training parallelization as function of the used threads and expected ideal speedup (dashed lines). Slightly better results for the four simulation datasets are obtained on an 8-core computer with the parallel training implemented in OpenMP (right) compared to the one implemented in Pthreads (left).

We can see that the best speedup is achieved when we use as many server threads as available cores. In case of using more threads than cores, the time for context switches and scheduling lower considerably the efficiency of parallelization. Figure 7 shows the efficiency of the parallelization for the four problems for different number of threads/cores. Increasing the number of used threads causes some synchronization overhead but a tremendous speedup is obtained in any case. Best efficiency and speedup are achieved for the O2 A-Band spectrum simulation problem (740% with Pthreads and 743% with OpenMP) for 8 threads. This can be explained with the large size of the training dataset (1998855 patterns) and relative small network structure (126 neurons). Each core is highly loaded with training of the

78

O. Schuessler and D. Loyola 1

0.9

0.9

0.8

0.8

Efficiency

1

O2 A−Band spectrum simulation Surface approximation Characters recognition Ozone concentration

0.7

0

2

4 6 Threads number

O2 A−Band spectrum simulation Surface approximation Characters recognition Ozone concentration

8

0.7

0

2

4 6 Threads number

8

Fig. 7. Efficiency of the neural network training parallelization on an 8-core computer as function of the used threads. The efficiency is highest and stays nearly constant for the problem with the larger number of patterns (O2 A-Band spectrum) in both OpenMP (right) and Pthreads (left) implementations.

assigned patterns and relative short synchronization time is needed for updating the relative few weights of the master network and copying them back to the slave threads. For smaller datasets the efficiency is lower because the time for synchronization is relatively high compared with the time for training. We also observe that the training algorithm parallelization with OpenMP performs slightly better than the one implemented with POSIX threads.

5 Conclusions In this paper we proposed a method for parallelization of neural network training based on the backpropagation algorithm and implemented it using two different multithreading techniques (OpenMP and POSIX threads) applicable to the current and next generation of multithreaded and multicore CPUs. The main motivation to parallelize the training process of neural network was to speedup the training process of very large training datasets. The speedup and efficiency of the proposed parallel algorithm and multithread implementations were analyzed using four different datasets. A nearly ideal speedup was reached for very large training datasets. For example in case of the O2 A-Band spectrum simulation problem with almost two million patters we obtained an increase in computation speed very close to the expected ideal, that is 743% when 8 cores were used.

Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs

79

References 1. Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143–195 (1999) 2. Sodan, A.C., Machina, J., Deshmeh, A., Macnaughton, K., Esbaugh, B.: Parallelism via Multithreaded and Multicore CPUs. Computer 43(3), 24–32 (2010) 3. Seiffert, U.: Artificial Neural Networks on Massively Parallel Computer Hardware. In: ESANN 2002 Proceedings - European Symposium on Artificial Neural Networks, April 24-26, pp. 319–330. Bruges, Belgium (2002) 4. Turchenko, V., Grandinetti, L.: Efficiency Analysis of Parallel Batch Pattern NN Training Algorithm on General-Purpose Supercomputer. In: Omatu, S., Rocha, M.P., Bravo, J., Fernández, F., Corchado, E., Bustillo, A., Corchado, J.M. (eds.) IWANN 2009. LNCS, vol. 5518, pp. 223–226. Springer, Heidelberg (2009) 5. Tsaregorodtsev, V.: Parallel Implementation of back-Propagation Neural Network Software on SMP Computers. In: Malyshkin, V.E. (ed.) PaCT 2005. LNCS, vol. 3606, pp. 186–192. Springer, Heidelberg (2005) 6. Lotrič, U., Dobnikar, A.: Parallel Implementations of Recurrent Neural Network Learning. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 99–108. Springer, Heidelberg (2009) 7. Gallant, S.: Perceptron-based learning algorithms. IEEE Transactions on Neural Networks 1(2), 179–191 (1990) 8. Rummelhart, D., Hinton, G., Williams, R.: Learning Internal Representations by Error Propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing, vol. I, pp. 318–362. MIT Press, Cambridge (1986) 9. Prechelt, L.: Automatic early stopping using cross validation: quantifying the criteria. Neural Networks 11(4), 761–767 (1998) 10. Riedmiller, M., Braun, H.: Rprop - A Fast Adaptive Learning Algorithm. In: Proceedings of the International Symposium on Computer and Information Science VII, Technical Report (1992) 11. Fahlman, S.: An Empirical Study of Learning Speed in back-Propagation Networks. Computer Science Technical Report, CMU-CS-88-162 (1988) 12. Butenhof, D.R.: Programming with POSIX Threads. Addison-Wesley, Reading (1997) ISBN 0-201-63392-2 13. Quinn, M.J.: Parallel Programming in C with MPI and OpenMP. McGraw-Hill Inc., New York (2004) ISBN 0-07-058201-7 14. Kuhn, B., Petersen, P., O’Toole, E.: OpenMP versus threading in C/C++. Concurrency: Practice and Experience 12, 1165–1176 (2000) 15. Stamatakis, A., Ott, M.: Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study. In: Chetty, M., Ngom, A., Ahmad, S. (eds.) PRIB 2008. LNCS (LNBI), vol. 5265, pp. 424–435. Springer, Heidelberg (2008) 16. Alpaydin, E., Kaynak, C.: Optical Recognition of Handwritten Digits Data Set, http://archive.ics.uci.edu/ml/datasets/ 17. Loyola, D., Coldewey-Egbers, M., Dameris, M., Garny, H., Stenke, A., Van Roozendael, M., Lerot, C., Balis, D., Koukouli, M.: Global long-term monitoring of the ozone layer - a prerequisite for predictions. International Journal of Remote Sensing 30(15), 4295–4318 (2009) 18. Loyola, D.: Applications of Neural Network Methods to the Processing of Earth Observation Satellite Data. Neural Networks 19(2), 168–177 (2006) 19. Tang, G., D’Azevedo, E., Zhang, F., Parker, J., Watson, B., Jardine, P.: Application of a hybrid MPI/OpenMP approach for parallel groundwater model calibration using multi-core computers. Computers & Geosciences 36(11), 1451–1460 (2010)

Supporting Diagnostics of Coronary Artery Disease with Neural Networks Matjaž Kukar1 and Ciril Grošelj2 1

University of Ljubljana, Faculty of Computer and Information Science, Tržaška 25, SI-1001 Ljubljana, Slovenia [email protected] 2 Nuclear Medicine Department, University Medical Centre Ljubljana, Zaloška 7, SI-1001 Ljubljana, Slovenia [email protected]

Abstract. Coronary artery disease is one of its most important causes of early mortality in western world. Therefore, clinicians seek to improve diagnostic procedures in order to reach reliable early diagnoses. In the clinical setting, coronary artery disease diagnostics is often performed in a sequential manner, where the four diagnostic steps typically consist of evaluation of (1) signs and symptoms of the disease and electrocardiogram (ECG) at rest, (2) sequential ECG testing during the controlled exercise, (3) myocardial perfusion scintigraphy, and (4) finally coronary angiography, that is considered as the “gold standard” reference method. Our study focuses on improving diagnostic and probabilistic interpretation of scintigraphic images obtained from the penultimate step. We use automatic image parameterization on multiple resolutions, based on spatial association rules. Extracted image parameters are combined into more informative composite parameters by means of principle component analysis, and finally used to build automatic classifiers with neural networks and naive Bayes learning methods. Experiments show that our approach significantly increases diagnostic accuracy, specificity and sensitivity with respect to clinical results. Keywords: multi-layered perceptron, radial basis function network, coronary artery disease, medical diagnostics, explanation.

1 Introduction Coronary artery disease (CAD) is one of the world’s most premier causes of mortality, and there is an ongoing research for improving diagnostic procedures. The usual clinical process of coronary artery disease diagnostics is stepwise, consisting of four diagnostic levels: (1) evaluation of signs and symptoms of the disease and ECG (electrocardiogram) at rest, (2) ECG testing during the controlled exercise, (3) myocardial scintigraphy and (4) coronary angiography. In this process, the fourth diagnostic level (coronary angiography) is considered as the “gold standard” reference method. As this diagnostic procedure is invasive and rather unpleasant for the patients, as well as relatively expensive, there is a tendency to improve diagnostic performance of earlier diagnostic levels, especially of myocardial scintigraphy [8]. Approaches used for this purpose include applications of neural networks [14], expert systems [6], subgroup mining [5], statistical A. Dobnikar, U. Lotriˇc, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 80–89, 2011. c Springer-Verlag Berlin Heidelberg 2011

Supporting Diagnostics of Coronary Artery Disease with Neural Networks

81

techniques [16], and rule-based approaches [10]. In our study we focus on automatizing and improving the diagnostic performance of myocardial scintigraphy (third step). Results of myocardial scintigraphy consist of a series of medical images that are taken both during rest and a controlled exercise. In clinical practice, expert physicians use their medical knowledge and experience as well as the image processing capabilities provided by various imaging software to manually describe and evaluate the images. We propose the use of fully automatic multi-resolution image parameterization, based on description with spatial association rules, coupled with evaluation with neural networks. Our experiments show that by this approach diagnostic performance can be significantly improved with respect to the results of clinical practice.

2 Methods An important issue in image parameterization in general, and in our approach with association rules in particular, is to select appropriate resolution(s) for extracting most informative textural features. Structural algorithms use descriptors of some local relations between image pixels where the search perimeter is bounded to a certain size, and give different results at different resolutions. The resolution used for extracting parameters is important and depends on the observed domain. For finding suitable resolutions and extracting informative features we use ARes and ArTex algorithms [17]. The obtained high quality image parameters can be used for several purposes, among others to describe images with a relatively small number of features. They are subsequently used for neural network learning in order to build a model of diagnostic process. Images corresponding to patients with known final diagnoses are used as learning data that, in conjunction with applied learning methods, produce descriptive and predictive models for the diagnostic problem at hand. 2.1 Stepwise Diagnostic Process Stepwise diagnostic process[15] is frequently used in clinical practice. Diagnostic tests are ordered in a sequence according to some pre-determined criteria, such as increasing invasiveness, cost, or diagnostic accuracy. Diagnostic process continues until some utility criteria are fulfilled (such as sufficiently high reliability of a diagnosis). Test results can be analyzed by sequential use of the Bayes’ conditional probability theorem. The obtained post-test probability accounts for the pre-test probability, sensitivity and specificity of the test, and may later be used as a pre-test probability for the next test in sequence (Figure 1). The first pre-test probability is typically estimated from tables concerning a broad population sample. The process results in a series of tests where each test is performed independently. Its results may be interpreted with or without any knowledge of the other test results. In medical diagnostics, performance of a diagnostic test is frequently described with diagnostic accuracy (Acc), sensitivity (Se), and specificity (Sp). Accuracy

Acc =

#true positives + #true negatives #all patients

82

M. Kukar and C. Grošelj

Fig. 1. Increasing the diagnostic test levels in stepwise diagnostic process of CAD

Sensitivity Specificity

#true positives #all patients with the disease #true negatives Sp = #all patients without the disease Se =

These quantities are subsequently used for post-test probability calculation with Bayes’ theorem [15]. Test results from earlier levels are used to obtain the final probability of disease. Diagnostic tests are performed until the post-test probability of disease’s presence or absence exceeds some pre-defined threshold value (e.g., 90%). This approach may not only incorporate several test results but also the data from the patient’s history [3]. The Bayes’ theorem is applied to calculate the conditional probability of the disease’s presence, when the result of a diagnostic test is given. For positive or negative test result the respective post-test probabilities P (d|+) = P (disease|positive test result) or P (d|−) = P (disease|negative test result) are calculated: P (d |+) = P · Se/(P · Se + (1 − P ) · (1 − Sp)) P (d |−) = P · (1 − Se)/(P · (1 − Se) + (1 − P1 ) · Sp) This approach is especially useful when used in conjunction with neural networks. Namely, outputs of neural networks can hardly be utilized for probabilistic evaluation. However, by calculating neural network’s performance in terms of sensitivity and specificity, we can apply Bayes’ theorem to produce its probabilistic evaluation. 2.2 Image Classification with Neural Networks The ultimate goal of medical image analysis and image mining is decision about the diagnosis. When images are described with informative numerical attributes, we can use various learning algorithms for generating a classification system (classifier) that produces diagnoses of the patients, whose images are being processed. Our early work in the problem of diagnosing CAD from myocardial scintigraphy images [9] indicates that the naive1 Bayesian classifier gives very good results. In present paper we focus on using neural networks as our modelling paradigm. 1

The so-called “naivety” of this particular Bayesian classifier leis in its assumption in attributes’ conditional independence with respect to the class attribute.

Supporting Diagnostics of Coronary Artery Disease with Neural Networks

83

Neural networks have certain appeal for expert physicians in many different fields, mostly due to their structural relatedness with biological systems and human brain (at least in a considerably simplified manner). While in quantitative terms (diagnostic accuracy, sensitivity, specificity) they often perform well, their main drawback for clinical use is their opaqueness: it is difficult to understand neural network’s reasoning for both the general model and a single prediction. We alleviate this problem by using the principle of general explanations [18]. In our experiments we use two well-known kinds of neural networks: multilayered perceptron [2] and radial basis function network [1] and compare them with the naive Bayes’ classifier. 2.3 ArTex and ARes Algorithms for Multi-resolution Image Parameterization Images in digital form are normally described with data matrices. Such pixel level data, however, are insufficient to uniformly distinguish between predefined image classes. Determining image features that can satisfactorily discriminate between observed image classes is a difficult task for which several algorithms exist [13]. They transform the image from the matrix form into a set of numeric or discrete features (parameters) that convey useful high-level (compared to simple pixel intensities) information for discriminating between classes. Extracted image features are often based on either structural, statistical or spectral properties of the image. For the purpose of diagnosis from medical images the structural description seems to be most appropriate. We use the ArTex algorithm to obtain structural attributes [17], based upon spatial association rules. The association rules algorithms can be used for describing textures if an appropriate texture representation formalism is used. Association rules capture structural and statistical information and conveniently identify spatial relations that occur frequently. It is often beneficial to obtain association rules from the same image at several different resolutions, as they may convey different kinds of useful information. This means that we may get completely different image parameterization attributes for the same image at different scales. For automatic selection of relevant resolutions we use the ARes algorithm [17] that builds upon the well-known SIFT algorithm [12]. ARes proposes several most informative resolutions by counting local intensity peaks ordering them by this count. The user (or an automatic heuristic criterion) then selects how many of the proposed resolutions are to be used.

3 Materials In our study we use a dataset of 288 patients with performed clinical and laboratory examinations, exercise ECG, myocardial scintigraphy (including complete image sets) and coronary angiography because of suspected CAD. The features from the ECG and the scintigraphy data were extracted manually by the clinicians. 10 patients were excluded for data pre-processing and calibration required by ArTex/ARes, so only 278 patients (66 females, 212 males, average age 60 years) were used in actual experiments. In 149 cases the disease was angiographically confirmed and in 129 cases it was excluded. The patients were selected from a population of several of thousands patients who were

84

M. Kukar and C. Grošelj

Table 1. CAD data for different diagnostic levels. Of the attributes belonging to the coronary angiography diagnostic level, only the final diagnosis – the two-valued class – was used. Diagnostic level 1. 2. 3. 4.

Number of attributes Nominal Numeric Total Signs and symptoms 22 5 27 Exercise ECG 11 7 18 Myocardial scintigraphy 8 2 10 (+9 image series) Coronary angiography 1 6 7 Class distribution 129 (46.40%) CAD negative 149 (53.60%) CAD positive

examined at the Nuclear Medicine Department, University Clinical Centre Ljubljana, between 2001 and 2006. We selected only the patients with complete diagnostic procedures, and for whom the imaging data was readily available. Some characteristics of the dataset are shown in Table 1. The myocardial scintigraphy attributes consists of evaluation of myocardial defects (no defect, mild defect, well defined defect, serious defect) that could be observed in images either while resting or during a controlled exercise. They are assessed for four different myocardial regions: LAD, LCx, and RCA vascular territories, as well as ventricular apex. Additional two attributes concern effective blood flow and volumes in myocardium: left ventricular ejection fraction (LVEF) and end-diastolic volume (EDV). 3.1 Scintigraphic Images For each patient a series of images was taken with the General Electric eNTEGRA SPECT camera, both at rest and after a controlled exercise, thus producing the total of 64 grayscale images in resolution of 64 × 64 and 8-bit pixels. Because of patients’ movements and partial obscuring of the myocardium by other internal organs, these

Fig. 2. Typical polar maps taken after exercise (left), and at rest (right). Shadows in the center of both images suggest inadequately perfused myocardial tissue, especially during exercise (left image). Images shown in this figure correspond to the patient with a very manifestation of CAD. Both images are in resolution of 64 × 64 pixels with 256 intensity levels per pixel.

Supporting Diagnostics of Coronary Artery Disease with Neural Networks

85

images are not suitable for further use without heavy pre-processing. For this purpose, the ECToolbox workstation software [4] was used, and one of its outputs, a series of 9 polar map (bull’s eye) images were taken for each patient. Polar maps were chosen since previous work in this field [11] had shown that they have useful diagnostic value. Unfortunately, in most cases (and especially in our specific population) the differences between images taken during exercise and at rest are not as clear-cut as shown in Figure 2. Interpretation and evaluation of scintigraphic images therefore requires considerable knowledge and experience of expert physicians. Although specialized tools such as the ECToolbox software can aid in this process, they still require special training and in-depth medical knowledge for evaluation of results.

4 Results Experiments were performed in the following manner. First, 10 learning examples (10 per-patient sets of nine images for CAD) were excluded for data preprocessing and calibration of ArTex/ARes. Images from the remaining examples were parameterized; only the obtained parameters were subsequently used for evaluation. Further testing was performed in the ten-fold cross-validation setting: at each step 90% of examples were used for building a classifier, and the remaining 10% of examples for testing. For neural network learning, the number of parameters (attributes) was reduced with feature extraction – by applying the principal component analysis (PCA) and retaining only the best principal components (those that together accounted for not less than 70% of data variance, amounting to 10 best components). Besides the described 10 components, an equal number of best attributes provided by physicians was used. We applied the naive Bayes classifier, as well as two types of neural networks: multilayered perceptron and RBF neural network. Aggregated results of the coronary angiography (CAD negative/CAD positive) were used as the class variable. The results of clinical practice were validated by careful blind evaluation of images by an independent expert physician. Significance of differences to clinical results was evaluated by using the McNemar’s test. 4.1 Results in CAD Diagnostics ArTex/ARes parameterization produced three resolutions: 0.95×, 0.80×, and 0.30× of the original resolution, producing together 2944 additional attributes. Since this number is too large for most practical purposes, it was reduced to 10 by applying feature extraction (with PCA). We also enriched the data representation by using the same number (10) of best physicians’ attributes and compare the results of machine learning with diagnostic accuracy, specificity and sensitivity of expert physicians after evaluation of scintigraphic images. It is gratifying to see that without any special tuning of learning parameters, the results are in all cases significantly better than the results of physicians in terms of classification (diagnostic) accuracy. While all applied methods significantly improve physicians’ results in all three criteria: diagnostic accuracy, sensitivity and specificity, they also perform brilliantly even without using physicians’ attributes. Especially good results are that of the multilayered perceptron and naive

86

M. Kukar and C. Grošelj

Table 2. Experimental results of machine learning classifiers on parameterized images obtained by selecting only the best 10 attributes from PCA on ArTex/ARes (also combined with 10 best attributes provided by physicians). Classification accuracy results that are significantly better (p < 0.05) than clinical results are emphasized. PCA on ArTex/Ares Physicians + PCA on ArTex/Ares Accuracy Specificity Sensitivity Accuracy Specificity Sensitivity Naive Bayes 81.3 83.7 79.2 80.9 82.9 79.2 Multilayered Perceptron 82.4 79.8 84.6 81.3 79.1 83.2 RBF Network 79.1 78.3 79.9 79.1 79.1 79.1 Clinical 64.0% 71.1% 55.8% 64.0% 71.1% 55.8%

Bayes classifier (Table 2). Multilayered perceptron improved overall diagnostic accuracy over physicians by 18%, specificity by 8% and sensitivity by 29%. On the other hand, naive Bayes improved diagnostic accuracy by 17%, specificity by 12% and sensitivity by 24%. Although overall accuracy is sightly better with multilayered perceptron, in practice naive Bayes’ results would be more useful, as in CAD diagnostics physicians are more interested in improving specificity. For multilayered perceptron, emphasis on correct diagnostics of the negative class (sensitivity) could be produced by employing cost-sensitive learning [7]. ROC curves for multilayered perceptron and naive Bayes are depicted in Figure 3 (using only automatically generated attributes) and Figure 4 (using both automatically generated and physicians’ attributes). It seems that automatic attributes are at least as informative as physicians’, as the ROC curves and areas under them (AUCs) are virtually identical. Notice how naive Bayes’ curves are positioned more towards left (indicating higher specificity) while multilayered perceptron’s curves lie more towards top (indicating higher specificity).

1,00

1

0,90

0,9

0,80

0,8

0,70

0,7

0,60

0,6

0,50

0,5

0 40 0,40

04 0,4

0,30

0,3

0,20

0,2 0,1

0,10

0

0,00 0,00

0,20

0,40

0,60

0,80

1,00

(a) Multilayered Perceptron, AUC=0.8655

0

0,2

0,4

0,6

0,8

1

(b) Naive Bayes, AUC=0.8659

Fig. 3. ROC curves for automatically generated attributes only (ArTex/ARes + PCA). In each figure, x-axis values represents false positive rate (1-Specificity), whereas y-axis values represent true positive rate (Sensitivity).

Supporting Diagnostics of Coronary Artery Disease with Neural Networks

1

1

0,9

0,9

0,8

0,8

0,7

0,7

0,6

0,6

0,5

0,5

04 0,4

00,4 4

0,3

0,3

0,2

0,2

0,1

0,1

0

87

0 0

0,2

0,4

0,6

0,8

1

(a) Multilayered Perceptron, AUC=0.8725

0

0,2

0,4

0,6

0,8

1

(b) Naive Bayes, AUC=0.8569

Fig. 4. ROC curves for automatically generated attributes (ArTex/ARes + PCA) and best physicians’ attributes. In each figure, x-axis values represents false positive rate (1-Specificity), whereas y-axis values represent true positive rate (Sensitivity).

4.2 Explaining the Neural Network’s Predictions by Evaluation of Individual Attribute’s Contributions For explaining otherwise incomprehensible neural networks’ prediction, we utilized a general classifier explanation method as described in [18]. It is built upon a gametheoretic perspective, treating attributes as players, and representing their contributions in nomogram-like manner. The main characteristics of this method are, versatility, comprehensibility and generality. To achieve such this, the applied methodology avoids anything model-specific, essentially treating models (machine learning classifiers) as black-boxes, limited to changing the inputs (attribute values) and observing the outputs. An attribute’s contribution for a particular classification is defined as the average change in prediction when the feature’s value is permuted. For each testing example (patient), each attribute’s contribution to the classification of the particular classifier is calculated. Its values are normalized between −1 and 1, where −1 indicates the strongest opposition against the classification and 1 strongest conformance with the classification (Figure 5).

5 Discussion We describe an automatic approach for coronary artery disease diagnostics, based upon classification of SPECT images utilizing multi-resolution image parameterization and neural networks. We show that all aspects of diagnostic performance can be significantly improved with respect to the results of clinical practice. Utilizing neural networks for image classification can help less experienced physicians in evaluation of medical images and thus improve their diagnostic performance (in terms of accuracy, sensitivity and specificity). By utilizing the described approaches, practical improvements of the diagnostic procedure can be expected. Higher diagnostic accuracy (up to 18%) is by itself a very considerable gain. Due to higher specificity of tests (up to 8%), fewer patients without the disease would have to be examined with coronary angiography

88

M. Kukar and C. Grošelj

Fig. 5. Explanation of a single MLP’s prediction. Each attribute’s relative contribution for (positive value) or against (negative value) the given MLP’s prediction is depicted as bar. Attributes with contributions above or below certain threshold (say +0.1 and −0.1) are considered as relevant factors in this particular prediction. MLP’s prediction can roughly be explained as a sum of all attributes’ contributions.

which is invasive and therefore dangerous method. Together with higher sensitivity (up to 29%) this would save money and shorten the waiting times of the truly ill patients. By using advanced evaluation techniques, such as stepwise diagnostics (for probabilistic interpretation of diagnoses) and explanation of classifications, neural networks are becoming eminently useful also in sensitive fields, where explanation and interpretation of their decision is of utmost importance. Last but not least, we have to emphasize that the results of our study are obtained on a significantly restricted population and therefore may not be generally applicable to the normal population, i.e. to all the patients coming to the Nuclear Medicine Department, University Clinical Centre Ljubljana, Slovenia.

Supporting Diagnostics of Coronary Artery Disease with Neural Networks

89

Acknowledgements. We thank Erik Štrumbelj for comments and cooperation regarding universal explanations. This work was supported by the Slovenian Ministry of Higher Education, Science, and Technology.

References 1. Broomhead, D.S., Lowe, D.: Multivariable functional interpolation and adaptive networks. Complex Systems 2, 321–355 (1988) 2. Rumelhart, D.E., McClelland, J.L.: Parallel Distributed Processing. Foundations, vol. 1. MIT Press, Cambridge (1986) 3. Diamond, G.A., Forester, J.S.: Analysis of probability as an aid in the clinical diagnosis of coronary artery disease. New England Journal of Medicine 300(1350) (1979) 4. General Electric. Ectoolbox protocol operator’s guide (2001) 5. Gamberger, D., Lavrac, N., Krstacic, G.: Active subgroup mining: a case study in coronary heart disease risk group detection. Artif. Intell. Med. 28(1), 27–57 (2003) 6. Garcia, E.V., Cooke, C.D., Folks, R.D., Santana, C.A., Krawczynska, E.G., De Braal, L., Ezquerra, N.F.: Diagnostic performance of an expert system for the interpretation of myocardial perfusion spect studies. J. Nucl. Med. 42(8), 1185–1191 (2001) 7. Kukar, M., Kononenko, I.: Cost-sensitive learning with neural networks. In: Proc. European Conference on Artificial Intelligence ECAI 1998, Brighton, UK, pp. 445–449 (1998) 8. Kukar, M., Kononenko, I., Grošelj, C., Kralj, K., Fettich, J.: Analysing and improving the diagnosis of ischaemic heart disease with machine learning. Artificial Intelligence in Medicine 16(1), 25–50 (1999) 9. Kukar, M., Šajn, L., Grošelj, C., Grošelj, J.: Multi-resolution image parametrization in stepwise diagnostics of coronary artery disease. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME 2007. LNCS (LNAI), vol. 4594, pp. 119–129. Springer, Heidelberg (2007) 10. Kurgan, L.A., Cios, K.J., Tadeusiewicz, R.: Knowledge discovery approach to automated cardiac spect diagnosis. Artif. Intell. Med. 23(2), 149–169 (2001) 11. Lindahl, D., Palmer, J., Pettersson, J., White, T., Lundin, A., Edenbrandt, L.: Scintigraphic diagnosis of coronary artery disease: myocardial bull’s-eye images contain the important information. Clinical Physiology 6(18) (1998) 12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004) 13. Nixon, M., Aguado, A.S.: Feature Extraction and Image Processing, 2nd edn. Academic Press, Elsevier (2008) 14. Ohlsson, M.: WeAidU–a decision support system for myocardial perfusion images using artificial neural networks. Artificial Intelligence in Medicine 30, 49–60 (2004) 15. Olona-Cabases, M.: The probability of a correct diagnosis. In: Candell-Riera, J., OrtegaAlcalde, D. (eds.) Nuclear Cardiology in Everyday Practice, pp. 348–357. Kluwer, Dordrecht (1994) 16. Slomka, P.J., Nishina, H., Berman, D.S., Akincioglu, C., Abidov, A., Friedman, J.D., Hayes, S.W., Germano, G.: Automated quantification of myocardial perfusion spect using simplified normal limits. J. Nucl. Cardiol. 12(1), 66–77 (2005) 17. Šajn, L., Kononenko, I.: Multiresolution image parametrization for improving texture classification. EURASIP J. Adv. Signal Process 2008(1), 1–12 (2008) 18. Štrumbelj, E., Kononenko, I.: An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research 11, 1–18 (2010)

The Right Delay Detecting Specific Spike Patterns with STDP and Axonal Conduction Delays Arvind Datadien1 , Pim Haselager1,2, and Ida Sprinkhuizen-Kuyper1,2 1

Department of Artiﬁcial Intelligence, Radboud University Nijmegen 2 Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, The Netherlands [email protected], {w.haselager,i.kuyper}@donders.ru.nl

Abstract. Axonal conduction delays should not be ignored in simulations of spiking neural networks. Here it is shown that by using axonal conduction delays, neurons can display sensitivity to a speciﬁc spatiotemporal spike pattern. By using delays that complement the ﬁring times in a pattern, spikes can arrive simultaneously at an output neuron, giving it a high chance of ﬁring in response to that pattern. An unsupervised learning mechanism called spike-timing-dependent plasticity then increases the weights for connections used in the pattern, and decreases the others. This allows for an attunement of output neurons to speciﬁc activity patterns, based on temporal aspects of axonal conductivity. Keywords: Spiking neural networks, STDP, Axonal delay, Spatiotemporal pattern.

1

Introduction

Our sensory organs convert stimuli from the outside world into electrical signals (spikes) that can be processed by neurons in the brain. When a stimulus is presented, it will result in multiple neurons ﬁring at speciﬁc times. These neurons create what is called a spatio-temporal spike (or ﬁring) pattern. The spatial aspect is deﬁned by which neurons ﬁre, and the temporal aspect is deﬁned by the point in time at which they ﬁre. If a neuron can learn to ﬁre only in response to one pattern, it could (among other things) be used to trigger an appropriate response to the stimulus that caused that pattern. Spikes in spike patterns do not arrive instantly at other neurons. There is a delay that is caused by the electrical signal traveling along an axon, called an axonal conduction delay (also referred to as transmission delay or simply conduction delay). This delay is often overlooked in simulations of spiking neural networks. However, Izhikevich [5] has shown that the presence of axonal conduction delays of various lengths might be very important to the functioning of the brain. Because these delays can be as small as 0.1 ms, or as large as 44 ms, and the brain can reproduce spike timings with sub-millisecond precision, it seems important that axonal conduction delays should not be ignored in simulations. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 90–99, 2011. c Springer-Verlag Berlin Heidelberg 2011

The Right Delay

91

It was shown in [6] that a neuron can learn to detect a spatio-temporal spike pattern, by using a learning mechanism called Spike-Timing-Dependent Plasticity (STDP). The neuron will ﬁre after presentation of the spike pattern it has learned, and not when it receives randomly timed spikes or other ﬁring patterns. If the spike pattern is of considerable length (tens of milliseconds), the neuron will at ﬁrst learn to ﬁre at a random moment during the pattern, and then start to ﬁre earlier, until it ﬁres near the very start of the spike pattern. The study [6], however, did not model axonal conduction delays of varying lengths. In another study [7], it was shown that longer and/or multiple patterns can be detected by adding inhibitory connections between output neurons, thus causing neurons to compete for the right to ﬁre at a certain time. This prevents the neurons from all ﬁring together at the start of a pattern, and causes them to learn to ﬁre in response to diﬀerent parts of a pattern. The sequential ﬁrings of some neurons will then signify the occurrence of one entire pattern, and diﬀerent neurons ﬁring in a speciﬁc order will indicate the occurrence of another spike pattern. In [7], varying axonal conduction delays were not taken into account. In [7], it was up to chance which pattern a neuron would learn to respond to when multiple patterns were presented. In this paper we are interested to see if we can create a neuron that will respond only to one speciﬁc spike pattern, of which the form is known in advance. This would provide for the functionality analogous to that of an attunement mechanism, based on mechanisms that are known to exist in the brain. When trying to detect a known spike pattern in a spiking neural network, neurons ﬁring nearly simultaneously at some moment during the pattern could be selected, and the weights of their connections to the output neuron increased, so that their simultaneous activity would cause the output neuron to ﬁre. The more neurons are involved in this process, the less likely it becomes that they all ﬁre together during other spike patterns and thus fewer incorrect ﬁrings of the output neuron would occur. However, using just this method, it would be impossible to reliably detect spike patterns that do not involve a large amount of neurons ﬁring simultaneously, since the chance of false positives would increase. We suggest that the use of axonal conduction delays can help to solve this problem. By delaying spikes in the pattern by various amounts, they can be made to arrive simultaneously at the output neuron, instead of in the order in which they were originally created (see Fig. 1). Thus, a pattern becomes easier to detect for a neuron if the delays of its incoming connections match the ﬁring times of that pattern. Weights of connections that are involved in such a delaybased detection of a pattern can be increased, and others decreased, by the unsupervised learning mechanism STDP. In this paper we present a simulation experiment in which we attempt to allow two spiking neurons to learn to ﬁre only in response to two speciﬁc spatiotemporal spike patterns, by using matching axonal conduction delays, and STDP. In contrast to [7], here the goal is not only to detect multiple patterns, but also to predetermine which neuron will detect which pattern, i.e. building a dedicated attunement mechanism.

92

A. Datadien, P. Haselager, and I. Sprinkhuizen-Kuyper

!

!

Fig. 1. The ﬁring times of six input neurons, and the times at which their spikes arrive at the output neuron. The length of the axonal conduction delays for the connection between each input neuron and the output neuron, are displayed on the right axis. The delays have been chosen so that all spikes in pattern 1 arrive simultaneously at the output neuron.

The following subsection will brieﬂy explain the workings of STDP. In the next section, details of our implementation will be described. After presenting the main results of our simulation, we will ﬁnish with a short discussion of our ﬁndings. 1.1

Spike-Timing-Dependent Plasticity

One learning method that has been observed to occur in the brain is SpikeTiming-Dependent Plasticity (STDP). It is the alteration of a synaptic weight, based on the diﬀerence between ﬁring times of two connected neurons, or when we take into account axonal conduction delays, the diﬀerence between the arrival time of a spike at a neuron, and the ﬁring time of that neuron. Weight changes caused by STDP remain over longer periods of time. A decrease in weight is therefore called Long Term Depression (LTD), and an increase is called Long Term Potentiation (LTP). Mechanisms for short term synaptic plasticity also exist, but will not be used in this paper. Multiple STDP rules have been observed in biological neurons [1]. The following STDP rule (also shown in Fig. 2), is often found in connections from excitatory to excitatory neurons, and was taken from [5]: A+ ∗ e−t/τ+ if t > 0 g(t) = . (1) A− ∗ et/τ− if t ≤ 0 Here t is deﬁned as the last postsynaptic ﬁring time minus the last presynaptic spike arrival time. The other variables are ﬁxed parameters. This function increases the weight of a connection if the presynaptic neuron ﬁres shortly before the postsynaptic neuron, and decreases the weight if the order

The Right Delay

93

is reversed. By doing this, it increases a neuron’s ability to make another neuron ﬁre, if they already ﬁred together. This eﬀect was also observed in biological neurons by Hebb in 1949 [2]. He stated that “When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in ﬁring it, some growth process or metabolic change takes place in one or both cells such that A’s eﬃciency, as one of the cells ﬁring B, is increased”. And this is the reason why STDP is sometimes called a Hebbian learning rule [1].

Fig. 2. Two STDP functions f (t), and g(t). Interval t is the postsynaptic ﬁring time minus the presynaptic spike arrival time in ms. Δw is the resulting change in weight of the connection in mV. Function g(t) is deﬁned in (1) and has been observed in biological neurons. Function f (t) is deﬁned in (2) and was used in our simulations due to better performance.

As long as the total amount of LTD is greater than the amount of LTP in an STDP rule, it causes the weight of the connection between two neurons that ﬁre at uncorrelated times from each other, to be reduced to zero. In other words, it allows a neuron to disregard ‘random’ input.

2

Materials and Methods

To conduct our experiment, we have constructed a Java program that simulates a spiking neural network with axonal conduction delays, and STDP. It allows for easy inspection of the ﬁring times and weight changes, as they occur during simulation. In the following subsections, we describe the neuron model, the network, and our STDP implementation. 2.1

The Neuron Model

Neurons in the network are simulated with Izhikevich’ Simple Model [3] which is shown and described in Fig. 3. This model can exhibit many characteristics of

94

A. Datadien, P. Haselager, and I. Sprinkhuizen-Kuyper

peak 30 mV

v'= 0.04v 2+5v +140 - u + I u'= a(bv - u) reset c

if v = 30 mV, then v c, u u + d

v(t) reset d

de

cay

with r at e a

u(t) sensitivity b

Fig. 3. Izhikevich’ Simple Model: a model of a neuron’s membrane potential over time. The variable v is the membrane potential, and u is a recovery value. v = dv/dt and u = du/dt. The variables a, b, and c are manually chosen parameters; typical values for them are discussed in [3]. The variable I is the incoming potential, usually from spikes received from other neurons. All values are chosen so that time is in ms scale, and weight is in mV scale. The neuron is considered to have ﬁred when its membrane potential reaches 30 mV, after which the values of v and u are reset. Electronic version of the ﬁgure and reproduction permissions are freely available at www.izhikevich.com.

biological neurons, at a reasonably low computational cost. By simply changing parameters, diﬀerent types of neurons can be simulated. Two alternative models are the more simple Leaky Integrate-and-ﬁre model, or the more complex Hodgkin-Huxley model. For a more elaborate comparison of models of spiking neurons, see [4]. In the network used here, the output neurons were conﬁgured to behave as regular spiking neurons by using the parameter values (a = 0.02), (b = 0.2), (c = −65), and (d = 6). The input neurons were required to always ﬁre exactly when we wanted them to. Because the regular spiking neuron type has biologically realistic properties such as a refractory period, it was not suitable to use for the input neurons. Instead, the parameter values (a = 0), (b = 0), (c = −65), and (d = 0) were used for the input neurons, to obtain a neuron type that will always ﬁre instantly when it receives a large current I of 100 mV for 1 ms as input. 2.2

The Network

In our software, connections between neurons are determined at the start of the simulation. Any neuron in the network can have a connection to any other neuron, or to itself. Two neurons have at most one connection in each direction. Connections are either excitatory, meaning that arriving spikes will increase the value of I in the receiving neuron, or inhibitory, meaning that spikes will decrease the value of I. The axonal conduction delay for each connection is determined at the start of the simulation, and does not change. The minimum axonal conduction delay is 1 ms, meaning that a spike will arrive at the target neuron 1 ms after the source neuron has ﬁred. Connection weights are set randomly at the start, and may change due to STDP, but are kept between a maximum and minimum value. The network algorithm used here is synchronous, or “clock-driven”, meaning that there is a simulated clock that is advanced in discrete time steps. In the

The Right Delay

95

software used here, each time step, or “tick” is 1 ms long. After each tick, the new state of the network must be determined. This is done in the following steps: 1. For each input neuron that we want to ﬁre at this time, add an incoming potential of 100 mV. 2. For each neuron, advance all the spikes that are currently traveling along each of its axons by 1 ms. If a spike arrives at a neuron, add an incoming potential to that neuron equal to the weight of the connection, and apply STDP. In this step, interval t will always be negative (see Sect. 2.3 for details). 3. For each neuron, update its membrane potential, and note if it has ﬁred. The new membrane potential is determined by using Euler’s method to approximate the solutions to the diﬀerential equations in Fig. 3. This is done in 5 steps of 0.2 ms. 4. For each neuron that has ﬁred during this time step, create a spike on each of its axons. If a neuron has ﬁred, apply STDP to each of its incoming connections. In this step, interval t will always be positive, or zero (see Sect. 2.3 for details). 5. For each neuron, apply the net change to its weights, according to the previously determined STDP values. 2.3

STDP Implementation

The weight change that results from STDP in one tick, is determined in steps 2 and 5 above. In step 2, a spike from neuron A arrives at neuron B at time tspike . Assume that neuron B ﬁred in the past, at time tf ire . Here, tf ire < tspike , so interval t is negative. To determine the weight change for the connection, we look at the left part of f (t) in Fig. 2. In step 5, a neuron B ﬁres at time tf ire . Assume that a spike from neuron A arrived at B in the past, at time tspike . Here, tf ire ≥ tspike , so interval t is positive. To determine the weight change for the connection, we look at the right part of f (t) in Fig. 2. For our implementation of STDP, we used the following function, which was inspired by [8]: ⎧ 0 if t < −200 or t > 200 ⎨ (2) f (t) = −0.006 if − 200 ≤ t < 0 or 10 ≤ t ≤ 200 . ⎩ 0.05 if 0 ≤ t < 10 We used f (t) instead of the more biologically plausible function g(t), since tests showed better results with the former. Experimentation with the neuron model showed that a presynaptic spike causes an increased membrane potential in the postsynaptic neuron for about 8 to 10 ms. After that, the membrane potential has returned to its resting state. The size of the synaptic weight does not aﬀect this period. So, only within the window of 0 ≤ t < 10 the arrival of a spike will contribute to the ﬁring of a neuron. It is for this reason that LTP occurs only within that window in function f (t).

96

A. Datadien, P. Haselager, and I. Sprinkhuizen-Kuyper

Both STDP rules f (t) and g(t) have a limited window in which they have an eﬀect on weights, be it positive or negative (see Fig. 2). Outside this window, the change in weights is (nearly) zero. The limited window of eﬀect of an STDP rule is required, in order for a neuron to remember the weights it has learned for a pattern, even when the pattern is not presented for a long period of time. During that time, random spikes will arrive. Because they all arrive after the last time the neuron ﬁred (during a pattern), they would all cause LTD, were it not for this limit. Because of the window, as long as a neuron doesn’t ﬁre too often outside of its pattern, it won’t forget. The software implementation discussed here was tested by qualitatively replicating the results of [6] and [7]. We ﬁrst successfully created a neuron that learned to detect the start of a spatio-temporal spike pattern as was done in [6]. Following that, we used multiple output neurons with inhibiting connections between them, to detect multiple patterns, as was done in [7]. In both tests we set all conduction delays to 1 ms. These ﬁndings indicate that, although we use a diﬀerent STDP rule and diﬀerent neuron models, our software works well as a spiking neural network. With this in mind, we used our implementation to conduct a new experiment, taking into account axonal conduction delays.

3

Detecting Specific Spatio-Temporal Spike Patterns

For the ﬁrst part of our experiment, we created two spatio-temporal spike patterns, and set the axonal conduction delays between 100 input neurons and an output neuron so that they would match the ﬁrst pattern. The same was done for a second output neuron and the second pattern. Neurons that did not ﬁre during a pattern were given the minimum delay of 1 ms. The network layout consists of two layers of neurons, as can be seen in Fig. 4. All weights were initialized to a value chosen randomly between 1 mV and a maximum weight (2 or 5 mV). The network was presented input in cycles that were 120 ms long. Each cycle consisted of 6 parts (4 random parts and 2 patterns) that were 20 ms long, and were placed in random order within the cycle. The patterns remained the same, but the random parts were recreated for every cycle. There was no separate

I1 Ii In

w1j , d1j wij , dij

'#

7 Oj

wnj , dnj

Fig. 4. The network layout for our experiment: n input neurons I (n = 100) are connected to m output neurons O (m = 2). The connections are excitatory, have a variable weight wij and a ﬁxed axonal conduction delay dij .

The Right Delay

97

learning phase; STDP was always active. The weights stabilized after a while, and results were recorded some time after this. The maximum weight of the excitatory connections were initially set to 5 mV. STDP caused all weights of the connections used in the patterns to be increased to this maximum value. All other connections were reduced to nearly zero. As can be seen in Table 1, with a maximum weight of 5 mV, each output neuron ﬁred during 100% of the presentations of its matching pattern, but also during about 30% of the non-matching pattern. With a maximum weight of 2 mV, the false positive ﬁrings were reduced to 0%. There were no ﬁrings during random input parts. An example of a period of activity after learning can be seen in Fig. 5.

Fig. 5. Firing times after learning. Each row represents the ﬁring times of one neuron. The top 100 are input neurons, the bottom two are output neurons. We see that each output neuron ﬁres during a diﬀerent pattern.

Following these good results, we decided to test performance in a case where only the top half of the axonal conduction delays matched a pattern. The bottom half of the connections used in the patterns were conﬁgured with delays of 1 ms. As can be seen in the last two rows of Table 1, reducing the number of matching delays increased the number of false positive ﬁrings. Reducing the maximum weight again decreased the false positives, but also reduced the number of positive ﬁrings. As could be expected, performance dropped. Table 1. The percentages of positive and false positive ﬁrings for varying maximum weights and varying percentages of matching delays. When an output neuron ﬁres during its matching pattern, we call it a positive, and when it ﬁres during its nonmatching pattern we call it a false positive ﬁring. Max weight 5 mV 2 mV 5 mV 2 mV

Matching delays Positive ﬁrings False positive ﬁrings 100% 100% 30% 100% 100% 0% 50% 100% 50% 50% 78% 0%

98

A. Datadien, P. Haselager, and I. Sprinkhuizen-Kuyper

Fig. 6. Two input patterns with varying amounts of jitter (0, 1, 2, 3, and 4 ms) Table 2. The percentages of positive ﬁrings for varying amounts of jitter. The second and third column show the number of dead and living neurons after learning. The last two columns show the percentage of correct ﬁrings. Jitter 1 ms 2 ms 3 ms 4 ms

Dead 0 0 2 9

Alive 12 12 10 3

Positive ﬁrings Positive ﬁrings among living neurons 99% 99% 93% 93% 65% 77.9% 19% 72%

Lastly, we tested performance in cases where the patterns were distorted. As in the original case, the axonal conduction delays were set so that for each output neuron they fully matched one of the two original undistorted patterns. The maximum weight was set to 2 mV. The patterns were then changed by moving each spike randomly forward or backward in time, from 0 ms to a maximum jitter value (see Fig. 6). The positive ﬁring rates of 12 output neurons were recorded over 6 trials, for each jitter value between 1 and 4 ms. There were a total of 12 neurons because each of the 6 trials contained 2 neurons (one for each pattern). The results are presented in Table 2. Here we see that there were cases where output neurons died, meaning that they stopped ﬁring altogether. Those neurons obviously had a positive ﬁring rate of 0%. In the last column we ignore these dead neurons so that we can still see how often on average the other (living) neurons were able to ﬁre during the presentation of their matching pattern. It is clear that as the amount of jitter increases, the performance decreases. To conclude, we have seen that by using axonal conduction delays in combination with STDP, we can indeed create a neuron that will detect only a predetermined spatio-temporal spike pattern. Not all connections used in the pattern need matching delays, and the delays do not need to match the pattern’s spike times exactly.

4

Discussion

The results of our simulations are further evidence for the idea that delays play an important role in the functioning of the brain. They are not an irrelevant detail or even a minor nuisance that should be ignored in spiking neural networks. Although here we focused on the delay caused by axonal conduction, signals may also be delayed by synapses, dendrites, or possibly in other ways.

The Right Delay

99

By essentially transforming the spike timings of a spatio-temporal spike pattern, delays allow neurons to even detect patterns that involve few simultaneous spikes. This should also allow longer patterns to be detected. However, for this to work, the conduction delays must match the pattern. This could be seen as an extra obstacle, e.g. because of the diﬃculties involved in setting the ‘right’ temporal delays, but it could also be seen as an opportunity; a way in which neurons could specialize for speciﬁc tasks, as was the case in our experiment. In this paper, the delays were set by hand. In nature, delays may simply be set by chance, or perhaps they are selected by evolution, favoring useful patterns. A learning mechanism may also modify delays. Further research may clarify the way in which delays are set by nature. We have shown that an artiﬁcial spiking neuron can learn to detect a speciﬁc spatio-temporal spike pattern, by using only STDP and axonal delays. STDP is a well known learning mechanism of the brain, and the speciﬁcs of axonal delays are also well understood. It seems plausible that the speciﬁc functionality modeled here is also exhibited by neuronal processes in the brain. Instead of a ‘nuisance’, delays might be a valuable additional source of processing capacity, assisting in the neuronal attunement to speciﬁc patterns. Delays can be useful.

References 1. Caporale, N., Dan, Y.: Spike Timing–Dependent Plasticity: A Hebbian Learning Rule. Annu. Rev. Neurosci. 31, 25–46 (2008) 2. Hebb, D.: The organization of behavior: A neuropsychological theory. John Wiley & Sons Inc. (1949) 3. Izhikevich, E.: Simple model of spiking neurons. IEEE Transactions on Neural Networks 14(6), 1569–1572 (2003) 4. Izhikevich, E.: Which model to use for cortical spiking neurons? IEEE Transactions on Neural Networks 15(5), 1063–1070 (2004) 5. Izhikevich, E.: Polychronization: Computation with spikes. Neural Computation 18(2), 245–282 (2006) 6. Masquelier, T., Guyonneau, R., Thorpe, S.: Spike timing dependent plasticity ﬁnds the start of repeating patterns in continuous spike trains. PLoS ONE 3(1), e1377 (2008) 7. Masquelier, T., Guyonneau, R., Thorpe, S.: Competitive STDP-based spike pattern learning. Neural Computation 21(5), 1259–1276 (2009) 8. Nessler, B., Pfeiﬀer, M., Maass, W.: STDP enables spiking neurons to detect hidden causes of their inputs. In: Bengio, Y., Schuurmans, D., Laﬀerty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 1357–1365 (2009)

New Measure of Boolean Factor Analysis Quality Alexander A. Frolov1 , Dusan Husek2 , and Pavel Yu. Polyakov3,4 1 Institute of Higher Nervous Activity and Neurophysiology, Russian Academy of Sciences, Butlerova st. 5a, 117865 Moscow, Russia 2 Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodarenskou vezi 2, 18207 Praha, Czech Republic 3 Scientiﬁc-Research Institute for System Studies, Russian Academy of Sciences, Nakhimovskii pr. 36, 117218 Moscow, Russia 4 VŠB – Technical University of Ostrava 17. listopadu 15 708 33 Ostrava – Poruba, Czech Republic [email protected], [email protected], [email protected]

Abstract. Learning of objects from complex patterns is a long-term challenge in philosophy, neuroscience, machine learning, data mining, and in statistics. There are some approaches in literature trying to solve this diﬃcult task consisting in discovering hidden structure of highdimensional binary data and one of them is Boolean factor analysis. However there is no expert independent measure for evaluating this method in terms of the quality of solutions obtained, when analyzing unknown data. Here we propose information gain, model-based measure of the rate of success of individual methods. This measure presupposes that observed signals arise as Boolean superposition of base signals with noise. For the case whereby a method does not provide parameters necessary for information gain calculation we introduce the procedure for their estimation. Using an extended version of the "Bars Problem" generation of typical synthetics data for such a task, we show that our measure is sensitive to all types of data model parameters and attains its maximum, when best ﬁt is achieved. Keywords: Boolean factor analysis, information gain, Hopﬁeld neural network, statistics, expectation-maximization, associative memory, neural network application, Boolean matrix factorization, bars problem.

1

Introduction

Learning of objects from complex patterns is a long lasting theme in philosophy [14], neuroscience [1], physiology [11,12], machine learning and neural network community [2,3,4,5,7,10], including data mining and statistics [13,15,16]. There are many approaches in literature that try to solve this hitherto rather ill deﬁned task which consists in discovering hidden structure of high-dimensional binary data. It has not yet been solved eﬀectively, still posing a major challenge for scientists. The well-known benchmark for learning of objects from complex patterns is the Bars Problem (BP) introduced by [4]. The BP in various modiﬁcations has A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 100–109, 2011. c Springer-Verlag Berlin Heidelberg 2011

New Measure of Boolean Factor Analysis Quality

101

Fig. 1. A Sixteen vertical and horizontal bars in 8-by-8 pixel images. B Examples of images in the standard bars problem. Each image contains two bars on average.

been considered in many papers [10]. In this problem, each pattern of a data set is an n-by-n binary pixel image containing several of L = 2n possible (one-pixel wide) horizontal and vertical bars (Fig. 1). Pixels belonging and not belonging to the bar take values 1 and 0, respectively. For each image, each bar could be chosen with a probability C/L, where C is the mean number of bars mixed in an image. In the point of intersection of vertical and horizontal bars, pixel takes the value 1. The Boolean summation of pixels belonging to diﬀerent bars simulates the occlusion of objects. The task is to recognize all bars as individual objects on the basis of a data set containing M images consisting of bar mixtures. Such a task can be typically solved by factor analysis, which, in general, is one of the most eﬃcient methods to reveal and reduce informational redundancy of high-dimensional signals. A special case of factor analysis, where it is implied that components of original signals, factor loadings and factor scores are binary values, is Boolean Factor Analysis (BFA). In terms of BFA, bars are factors, each image is a Boolean superposition of factors, and factor scores take values 1 or 0 depending on bar presence or absence in the image. Thus, the bars problem is a special case of BFA. In spite of the fact that binary data representation is typical for many ﬁelds, including social science, marketing, zoology, genetics and medicine, BFA methods are rather moderately developed. In our previous papers [9,6,7] we demonstrated the eﬃciency of our neural network based BFA method ANNIA [7,8] when solving the task of mushroom data set clustering, in our analysis of parliamentary voting, and in text analysis as well as factor search in artiﬁcial data sets. However, whilst in the case of artiﬁcial signals, we can assess the quality factors found by a BFA method by direct comparison of results with the factors used for synthetics data generation, in the case of analysis of real data sets the only way to assess the quality of resulting factors is their comparison with those provided by experts. To make the assessment more accurate and objective we introduce a measure of quality of BFA (object learning), which would require neither knowledge of the exact solution of the problem in advance, nor involvement of experts. For this, in Sec. 3, we propose a measure, information gain, deﬁned as the relative diﬀerence between the entropy of a signal space of observations, i.e.

102

A.A. Frolov, D. Husek, and P.Y. Polyakov

entropy when nothing is known about latent structure of these signals, and entropy, when the latent structure was revealed by BFA or by another method providing factor scores. This measure is based on BFA model, which is rigorously proposed in Sec. 2. Procedure for the estimation of model parameters , when these are unknown, is introduced in Sec. 3.1. In the next section, (4) we investigated the general properties of information gain as a measure of quality of factorization results. Short conclusion follows.

2

Generative Model of Boolean Factor Analysis

In formulating a BFA generative model of signals, each pattern of a signal space is deﬁned by a binary row vector X with dimensionality N that is equal to the total number of attributes. Every component of X takes value One or Zero, depending on the presence or absence of the related attribute. Each factor fi is a binary row vector of dimensionality N whose 1-valued entries correspond to highly correlated attributes of the i-th object. Although the probability of the object’s attribute to appear in a pattern simultaneously with its other attributes is high, it is not obligatory equal to 1. So we denote this probability as pij , where j is the index of an attribute and i is the index of a factor. For attributes constituting the factor probability pij is high, but Zero for others. It is supposed that additionally to common factors fi , each signal also contains "speciﬁc noise" or in other words "independent errors", here modelled by speciﬁc factors, represented by the binary row vector η. Each speciﬁc factor is characterized by a probability qj that j-th component of vector η takes One. If we formalize this, any vector X can be presented in the form X=[

L

Si fi ] ∨ η,

(1)

i=1

where S is a binary row vector of factor scores of dimensionality L, L is the total number of factors, fi is a distorted version of factor fi and η is a speciﬁc noise deﬁning the inﬂuence of speciﬁc factors. Factor distortion implies that some Ones of the i-th factor become Zeros with probability 1 − pij . We suppose that each component of the common factor is distorted independently of the presence of other factors in the pattern and independently of speciﬁc noise. Thus, the probability of the j-th component of X to take the value Xj is P(Xj ) = Xj − (2Xj − 1)(1 − qj )

L

(1 − pij )Si ,

(2)

i=1

components where scores Si are assumed to be given. We suppose that diﬀerent N of X (attributes) are also statistically independent. Thus P(X) = j=1 P(Xj ). BFA is performed on the set X of patterns Xm containing M representatives. We assume that factor distortion in each pattern of the data set does not depend on others. Thus

New Measure of Boolean Factor Analysis Quality

P(X ) =

M

P(Xm ).

103

(3)

m=1

The aim of Boolean Factor Analysis is to ﬁnd the parameters of generative model Θ = (pij , qj , πi , i = 1, . . . , L, j = 1, . . . , N ) and the factor scores Smi , m = 1, ..., M for each of M patterns of the data set. However, it is supposed that the found factors could also be detected in any arbitrary pattern X ∈ / X if generated by the same model.

3

Information Gain

If the factor structure of the signal space is unknown and can not be taken into account, then representing the j-th component of vector X requires h(pj ) bits of information, where h(x) = −x log2 x − (1 − x) log2 (1 − x) is Shannon function and pj is probability of the j-th component to take One. Representing the whole data set requires H0 = M

N

(4)

h(pj )

j=1

bits of information. If the hidden factor structure of the signal space is detected and all factor scores and generative model parameters are found, then representing the j-th component of vector Xm requires hmj = h(Pmj ) bits of information, where Pmj is given by (2). Representing the whole data set requires H=M

L i=1

h(πi ) +

M N

hmj

(5)

m=1 j=1

bits of information. The terms in (5) deﬁne information that is required to represent factor scores and all patterns of the data set when factor scores are given. The information gain is determined by the diﬀerence between H0 and H. We deﬁne the relative information gain as G = (H0 − H)/H0 .

(6)

From a practical point of view BFA is meaningful only if G > 0. According to (5) and (6), to calculate the BFA information gain one needs to know factor scores Smi assigned to patterns of the data set and parameters of the generative model pij and qj . But many methods capable of factor scores search [10,4,5,7] do not provide the probabilities pij and qj explicitly. That is why our quality measure has to be based only on evaluation of methods abilities to assign proper factor scores to patterns of the data set. But we still have to have a method how to ﬁnd these parameters which BFA generative model required for information gain calculation. For this we suggest the procedure based on maximization of the data set likelihood under given factor scores provided by any of considered BFA method.

104

3.1

A.A. Frolov, D. Husek, and P.Y. Polyakov

Probabilities Estimation

For the BFA generative model the data set likelihood function takes the form M N

L = log P(X ) =

log P(Xmj ),

(7)

m=1 j=1

where P(Xmj ) is given by (2). Maximum of L is deﬁned by the following system of L × N + N equations for pij and qj (i = 1, . . . , L, j = 1, . . . , N ): M ∂L −1 ∂P(Xmj ) = P(Xmj ) = 0, ∂pij ∂pij m=1

M ∂L −1 ∂P(Xmj ) = P(Xmj ) = 0, (8) ∂qj ∂qj m=1

where Smi (2Xmj − 1)(1 − qj ) l=1,L (1 − plj )Sml ∂P (Xmj ) = ∂pij 1 − pij ∂P (Xmj ) = (2Xmj − 1) (1 − plj )Sml . ∂qj

(9)

l=1,L

Then (8) takes the form M

Smi =

m=1

M=

M

Smi Xmj 1 − (1 − qj ) l=1,L (1 − plj )Sml m=1 M

1 − (1 − qj ) m=1

X mj

l=1,L (1

− plj )Sml

The obtained system can be solved by iterative procedure pij (k + 1) = M

m=1

qj (k + 1) =

M

1 Smi

pij (k)Smi Xmj 1 − (1 − qj (k)) l=1,L (1 − plj (k))Sml m=1

(10)

M 1 qj (k)Xmj . M m=1 1 − (1 − qj (k)) l=1,L (1 − plj (k))Sml

At its each step Δpij = pij (k + 1) − pij (k) is pij (k)

Δpij = M

m=1

=

Smi

M

Smi (2Xmj − 1)(1 − qj (k))

m=1

Xmj − (2Xmj − 1)(1 − qj (k))

l=1,L

(1 − plj (k))Sml

l=1,L (1

− plj (k))Sml

M pij (k)(1 − pij (k)) pij (k)(1 − pij (k)) ∂L −1 ∂P(Xmj ) P(Xmj ) = . M M ∂pij ∂pij m=1 Smi m=1 m=1 Smi

New Measure of Boolean Factor Analysis Quality

105

Similarly Δqj = qj (k + 1) − qj (k) is M (2Xmj − 1)(1 − qj (k)) l=1,L (1 − plj (k))Sml qj (k) Δqj = M m=1 Xmj − (2Xmj − 1)(1 − qj (k)) l=1,L (1 − plj (k))Sml =

M qj (k)(1 − qj (k)) qj (k)(1 − qj (k)) ∂L −1 ∂P(Xmj ) P(Xmj ) = . M ∂q M ∂qj j m=1

Then ∂L ∂L pij (k)(1 − pij (k)) ∂L 2 Δpij + Δqj = M ∂pij ∂qj ∂pij m=1 Smi i,j j i,j qj (k)(1 − qj (k)) ∂L 2 + . M ∂qj j

ΔL

Thus at each iteration step likelihood does not decrease and hence iteration procedure converges. Since we assume that for attributes constituting a factor, probabilities pij are suﬃciently high but equal to zero for other attributes, at each iteration step we set pij = 0, if pij is small. In particular, we set it to Zero, if pij < 1 −

(1 − πl plj ),

(11)

l=i

where the right side of the inequality is the probability that the j-th attribute appears in the pattern due to other factors except fi . We consider such pij "cleaning" procedure as factor binarization, because we treat the component with a high probability pij as constituting the i-th factor (fij = 1), and the component with a small pij as not constituting it (fij = 0). According to our experience the iteration procedure converges in 3-5 steps. As the input to the iterative procedure (10), we used pij and qj obtained from probabilities p1ij that the j-th attribute appears in the pattern when the ith factor is present and p0ij when it is absent. On one hand, probabilities p1ij and p0ij can be estimated as frequencies of the j-th attribute taking One in patterns of the data set containing and not containing the i-th factor. On the other hand, these probabilities can be estimated as p0ij = 1 − (1 − qj ) (1 − πl plj ), l=i

p1ij

= 1 − (1 − qj )(1 − pij )

(1 − πl plj ) l=i

This results in pij = (p1ij − p0ij )/(1 − p0ij ).

(12)

106

A.A. Frolov, D. Husek, and P.Y. Polyakov

(a)

(b)

Fig. 2. Information gain for "theoretical" (thick lines), "ideal" (thin lines marked by ) and “erroneous” solutions in dependence on the size of the data set M , – one of the factors was excluded, – 16 false factors in the form of crossing bars were added, – 10 % of randomly chosen scores were excluded, – 10% of randomly chosen scores were added. (a) – dependence on q (speciﬁc noise) for p = 1, (b) – dependence on p (factors distortion) for q = 0 and for both kinds of noise (q = 0.2, p = 0.7).

As in the procedure of likelihood maximization above, we put the probability pij to zero, if it satisﬁes (11). After ﬁnding pij , qj can be obtained from (12). Probabilities πi are estimated as the frequencies of the related scores provided by BFA. Probabilities pj required for calculation of H0 are estimated as frequencies of related components in the data set.

4

Properties of Information Gain

In this section, we illustrate the general properties of the information gain G deﬁned by (6), using the bars problem data. Particularly, we compare the values of G obtained for – "theoretical solution" when all scores and generative model parameters are exactly the same as those used for data set generation, – "ideal solution" when all scores are exactly the same as those used in the generated data set, but parameters of the generative model are found by likelihood maximization (this case simulates the situation when BFA provides scores ideally matching those in the generated data set under analysis), – "erroneous solution" when some factors or scores are missed or false factors or scores are added. Fig. 2 illustrates the dependence of information gain for "theoretical", "ideal" and "erroneous" solutions on the probabilities qj and pij , and on the size of the data set M . Here, pij = p for components constituting factors and qj = q

New Measure of Boolean Factor Analysis Quality

107

Fig. 3. A Examples of noisy images for p = 0.7, q = 0.2. B Probabilities pij (shown by the shades of grey) of pixels activation obtained by the likelihood maximization for the ideal solution (M = 100, p = 1, q = 0.3) in one of the trials. This ﬁgure in higher resolution can be found here: http://www.cs.cas.cz/dusan/ICANNGA11/Fig3.pdf.

for any j. Recall that pij = 0 for components not constituting factors. Value of G in Fig. 2 is obtained by averaging of 50 trials results . Each trial is made, using randomly generated data set of a given size M . Patterns of the data set were 8-by-8 binary images (i.e., N = 64). L = 16 vertical and horizontal bars (one pixel width, Fig. 1(A)) were randomly mixed in images with probabilities πi = C/L = 1/8, thus 2 bars were mixed in each image on average. Examples of the standard BP images (p = 1, q = 0) are shown in Fig. 1(B), and their noisy versions for p = 0.7, q = 0.2 are shown in Fig. 3(A). For small M the information gain for “ideal” solution is paradoxically higher (Fig. 2) than for the “theoretical” one. But it is the usual case for the procedure of likelihood maximization: when data set is relatively small, the procedure provides the solution for pij and qj that better ﬁts randomly obtained peculiarities of a given data set realization than the “theoretical” solution. That is why both G and L are higher for the “ideal” solution adjusted to those speciﬁc peculiarities. Fig. 3(B) shows values of pij obtained for one of the trials with M = 100, p = 1 and q = 0.3. The black pixels correspond to pij = 1, the white pixels correspond to pij = 0, and the grey pixels correspond to the intermediate values. For the pixels constituting bars all pij = 1. However, the factors found by likelihood maximization contain some additional pixels. For some of them, probability of their appearance with factors is rather high. It means that for this particular data set those pixels were activated by chance simultaneously with the activation of the related bar, and this peculiarity of the data set was detected by likelihood maximization. When M increases, this eﬀect disappears and “ideal” solution coincides with “theoretical” one. As shown in Fig. 2, the maximal information gain is achieved when bars mixed in the scenes are not distorted (p = 1, q = 0), and G decreases when noise increases due to both increasing of q or decreasing of p. We suppose that when the information gain G is positive, BFA is appropriate for a given data set, and when it is negative, BFA has no sense. The smaller is G, the less explicitly the factor structure of the data set is exposed. For example, when G is small (Fig. 2(b), p = 0.7, q = 0.2), bars in images are almost invisible (Fig. 3(A)). Information gain also decreases when BFA is not perfect. Particularly, G decreases when one of the factors is missing (Fig. 2). The decrease of G occurs

108

A.A. Frolov, D. Husek, and P.Y. Polyakov

in this case due to increasing qj . G also decreases when false factors are added to true factors. In the experiments, to the true 16 factors we added 16 false factors that were crosses of randomly chosen vertical and horizontal bars. As shown below, such kind of false factors is typical for some BFA methods. In the experiment whose results are depicted in Figs. 2 and 3, the scores for the false factors were given precisely as those for the true factors. Additional false factors result in decreasing G due to the increase of the ﬁrst term in (5) that gives the information required to describe scores. As shown in Fig. 2, G also decreases when true scores were excluded or false scores were added. Thus, all kinds of errors result in decrease of the information gain. Hence, we can conclude that information gain is a reliable measure for diﬀerent BFA methods comparison and for detecting hidden BFA structure in a given data set as well.

5

Summary

Learning of objects from complex patterns is a problem that has yet to be satisfactorily solved. Even if there exist some attempts to solve this diﬃcult task by methods which explicitly or implicitly perform BFA, nobody has proposed thus far an expert-independent measure for the evaluation of these methods as regards the quality of the solution obtained, when unknown data are analyzed. Here, we deﬁne a generative model of data suitable for such task and identify it as a model of data suitable for Boolean factor analysis (Sec.2). Next, we propose relative information gain (Sec.6) as a measure of the success of individual methods. When this measure is applied, it is supposed that observed signals come from a generative model, i.e. they arise as a Boolean superposition of base signals with noise. As some methods do not provide user with necessary probabilities for calculation of information gain, we have introduced a procedure for their estimation (Sec.3.1). In the Sec.4 we investigate the general properties of information measure G by solving the bar problem. The extended model for the generation of artiﬁcial signals used here (described in the section 3) leads to a generalization of the classical bar problem (Sec.3) because it was extended by the possibility of generating noise of various types according to generative model (Sec.2). We show that information gain G is sensitive to both noise in signals and errors in BFA results. In the experimental part, we show (Sec.4) that G decreases both due to errors in BFA result and as well due to increase of noise in the signal, i.e. when its factorial nature becomes less pronounced. Thus, G is both a proper measure of the accuracy of BFA, and a proper measure of the ﬁtness of signals to BFA. The analysis presented here allows us to conclude that information gain is a reliable basis for the comparison of diﬀerent BFA methods and for detecting the presence of hidden factor structure in a given data. Acknowledgement. This work was supported by the grants AV0Z10300504, GACR P202/10/0262, 205/09/1079, MSM6138439910 and RFBR 09-07-00159-a.

New Measure of Boolean Factor Analysis Quality

109

References 1. Barlow, H.B.: Cerebral cortex as model builder. In: Rose, D., Dodson, V.G. (eds.) Models of the Visual Cortex, pp. 37–46. Wiley, Chichester (1985) 2. Belohlavek, R., Vychodil, V.: On Boolean factor analysis with formal concepts as factors, pp. 20–24 (2006) 3. Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences 76(1), 3–20 (2010) 4. Foldiak, P.: Forming sparse representations by local anti-hebbian learning. Biological Cybernetics 64, 165–170 (1990) 5. Frolov, A.A., Husek, D., Muraviev, I.P., Polyakov, P.Y.: Boolean factor analysis by attractor neural network. IEEE Transactions on Neural Networks 18(3), 698–707 (2007) 6. Frolov, A.A., Husek, D., Polyakov, P., Rezankova, H.: New Neural Network Based Approach Helps to Discover Hidden Russian Parliament Voting Patterns. In: IEEE International Joint Conference on Neural Networks, pp. 6518–6523 (2006) 7. Frolov, A.A., Husek, D., Polyakov, P.Y.: Recurrent neural network based Boolean factor analysis and its application to automatic terms and documents categorization. IEEE Transactions on Neural Networks 20(7), 1073–1086 (2009) 8. Frolov, A.A., Husek, D., Polyakov, P.Y.: Origin and Elimination of Two Global Spurious Attractors in Hopﬁeld-like Neural Network Performing Boolean Factor Analysis. Neurocomputing (2010) (in press) 9. Frolov, A.A., Húsek, D., Rezanková, H., Snásel, V., Polyakov, P.: Clustering variables by classical approaches and neural network Boolean factor analysis. In: IEEE International Joint Conference on Neural Networks, pp. 3742–3746 (2008) 10. Lücke, J., Sahani, M.: Maximal causes for non-linear component extraction. The Journal of Machine Learning Research 9, 1227–1267 (2008) 11. Marr, D.: A Theory for Cerebral Neocortex. Proceedings of the Royal Society of London. Series B, Biological Sciences (1934-1990) 176(1043), 161–234 (1970) 12. Marr, D.: Simple Memory: A Theory for Archicortex. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences (1934-1990) 262(841), 23–81 (1971) 13. Mickey, M.R., Mundle, P., Engelman, L.: Boolean factor analysis. In: Dixon, W. (ed.) BMDP Statistical Software, pp. 538–545. University of California Press, Berkeley (1983) 14. Pelaez, J.R.: Plato’s theory of ideas revisited. Neural Networks 10(7), 1269–1288 (1997) 15. Veiel, H.O.: Psychopathology and Boolean Factor Analysis: a mismatch. Psychol. Med. 15(3), 623–628 (1985) 16. Weber, A.C., Scharfetter, C.: The syndrome concept: history and statistical operationalizations. Psychol. Med. 14(2), 315–325 (1984)

Mechanisms of Adaptive Spatial Integration in a Neural Model of Cortical Motion Processing Stefan Ringbauer, Stephan Tschechne, and Heiko Neumann Ulm University Faculty of Engineering and Computer Science Institute for Neural Information Processing 89069 Ulm, Germany {stefan.ringbauer,stephan.tschechne,heiko.neumann}@uni-ulm.de

Abstract. In visual cortex information is processed along a cascade of neural mechanisms that pool activations from the surround with spatially increasing receptive ﬁelds. Watching a scenery of multiple moving objects leads to object boundaries on the retina deﬁned by discontinuities in feature domains such as luminance or velocities. Spatial integration across the boundaries mixes distinct sources of input signals and leads to unreliable measurements. Previous work [6] proposed a luminance-gated motion integration mechanism, which does not account for the presence of discontinuities in other feature domains. Here, we propose a biologically inspired model that utilizes the low and intermediate stages of cortical motion processing, namely V1, MT and MSTl, to detect motion by locally adapting spatial integration ﬁelds depending on motion contrast. This mechanism generalizes the concept of bilateral ﬁltering proposed for anisotropic smoothing in image restoration in computer vision. Keywords: Motion Estimation, Neural Modeling, Motion Integration, Diﬀusion.

1

Introduction

Optic ﬂow is perceived whenever observer motion occurs relative to the surrounding environment. The resulting projected movements are caused by object motion in the scene or egomotion, or combinations of both. The primate brain has built an impressive ability to derive rich information from such data, making navigation in and interaction with the environment reliable and accurate. As many applications could proﬁt from ﬂow-based information, motion estimation has been subject to intense research in the past decades (e.g. [5], [3], [1]). Biological models are a promising way towards reliable motion estimation and have already demonstrated results comparable with engineering approaches against changes in complex scenes. In accordance with the structure of the the visual system biologically inspired models consist of interconnected areas with specialized functionality. Recurrent connections between such areas help improving and stabilizing the derived motion signal. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 110–119, 2011. c Springer-Verlag Berlin Heidelberg 2011

Adaptive Spatial Integration in Cortical Motion Processing

111

The hierarchical integration over increasingly larger neighborhoods causes the problem that information is mixed up at boundaries where diﬀerent input sources might contribute to the integration ﬁeld. In such cases the feedforward integration leads to an erroneous measure of surface motion. Some approaches have already suggested improvements of such integration steps, e.g. by using luminance contrast to modulate integration [6]. However, this only works if luminance values of objects and background diﬀer suﬃciently, an assumption that only holds for rather limited number of cases. In this paper we propose an adaptation of the integration mechanism to render it sensitive to motion discontinuities. Shapes of receptive ﬁelds for motion integration are adapted in an anisotropic fashion to integrate coherently moving regions but to prevent integration across motion discontinuities. We adapted a model of motion estimation [4] which consists of a hierarchical architecture of the dorsal pathway containing the primary visual area V1, medial temporal area MT, as well as medial superior temporal area MST. Our method alters the motion integration step between area V1 and MT. This process includes the estimation of an anisotropic ﬁlter shape by utilizing motion contrast information from area MST to a diﬀusion-like estimation of the corresponding integration region of cells in MT. With this extension, the integration regions adapt to the outline of moving objects, independent of luminance or other features. This improves quality of motion estimation at object borders and helps to disambiguate further processing.

2

Neural Model

The model proposed is based on [4] which is a recurrent neural model covering the main stages of the motion processing (dorsal) pathway of the primate brain. The model consists of areas V1, MT and the ventral part of MST, namely MSTv. Model area V1 is fed with image sequences and incorporates correlationbased initial motion estimation where area MT considers homogeneous motion integration. It also supports the initial estimates at V1 level by feeding back contextual information obtained from integrating motion information over large spatial neighborhoods. This results in motion information that is represented in form of population codes that is ﬁnally interpreted to receive an optical ﬂow ﬁeld. We extend this model by incorporating the lateral part of area MST, namely MSTl, with cells sensitive to motion discontinuities. This motion contrast information is needed for the actual extension to incorporate a motion contrast dependent motion integration mechanism from area V1 to area MT. 2.1

Method for Motion Discontinuity Detection

MT motion activities are further processed by contrast sensitive cells in model area MSTl [8]. The cells have approximately the same size as MT cells and are selective for both direction and speed changes of the input motion responses. This is done by looking for diﬀerent motions in the spatial neighborhood of a cell. The receptive ﬁelds of these motion contrast cells are composed of a

112

S. Ringbauer, S. Tschechne, and H. Neumann

small center and a larger surround area using Gaussian kernels Gσ as weighting function to yield center = actMT ∗ Gc and surround = actMT ∗ Gs , respectively (where ∗ denotes the convolution operation). For each spatial position motion contrast is computed by a divisive inhibition actMST l [θ|s] (x) = center[θ|s] (x)/(η + β · surround[θ|s] (x))

(1)

where θ is the motion direction co-domain with cardinality k, s the speed codomain with cardinality l, η is a scaling constant, and β is a constant for surround modulation. Locations with high activation in model MSTl represent motion discontinuities that are transitions between diﬀerent motions. In order to detect such contrast locations irrespective of the composite input velocity, we sum the contrast signals from the direction and speed co domains, namely actMST l (x) =

k l actMST l θ (x) actMST l s (x) · k l i=1 j=1

(2)

This information will be used to limit the integration areas to avoid a mix-up of integrating diﬀerent motions. 2.2

Model of Contextual Enhancement

The estimation of motion discontinuities is likely to be noisy and contextual boundaries might be fragmented. The unenhanced signal thus does not clearly separate regions of diﬀerent motion, as desired. We propose a step of modulatory enhancement that reduces the noise and mends discontinuities that belong together. This process is based on the linking principle ﬁrst proposed by [13] which was used and further extended by [11], [12], [14]. In a nutshell, a driving feedforward (FF) signal is enhanced by feedback (FB) signals following the scheme actF F (1 + α actF B ). Our proposed context enhancement consists of two interacting layers, which are designed following the layout of the early visual cortex. The ﬁrst layer models cells that are sensitive for contrasts in a speciﬁc orientation and spatial frequencies. The second layer applies long-range connections to group like-oriented contrasts. Activity of the second layer is then used to enhance the signal from the ﬁrst layer by applying a modulatory feedback signal. This enhances contextually meaningful contours that receive support from their surround while removing spurious activities. Please note that while the principle is adapted from the lower areas of visual cortex, it now processes signals originating from higher areas. For details, see Figure 1 that shows an illustration and examples of the process. In our experiments we used Gabor ﬁlters for the ﬁrst stage, and a multiplicative combination of Gaussian-shaped cells for the second stage. They are used to build a population with 16 orientations each. The feedback iterations between both layers are repeated several times until gaps in contextual contours are closed suﬃciently and until noise is removed. 2.3

Model of Diﬀusion Process

To infer the shape of the integration after the enhancement step, we incorporate a process of anisotropic diﬀusion which is carried out for every MT cell.

Adaptive Spatial Integration in Cortical Motion Processing

113

(a)

(b)

(c)

Fig. 1. Modulatory enhancement. (a) Model sketch of two interacting layers with examples (contrast orientation and grouping stage) (b) Example input (c) Result of the contextual enhancement. Undesired noise was cancelled while contextual clues have been increased.

During the diﬀusion process, activities of the modulatory enhancment steer the diﬀusion process. A diﬀusion process equilibrates concentration diﬀerences while preserving the summed amount of particles, mass or energy. In our case, energy or mass is the spatial weight of a location within the receptive ﬁeld to contribute to the integration process. In formal terms, strength and direction of the resulting diﬀusion (flux) j depends on a diﬀusion tensor D and the gradient ∇u of the energy ﬁeld. The property of energy conversation follows Fick’s Law ([2], [7]) and leads to the expression of the diﬀusion equation ∂t u = −div j with j = −D · ∇u

(3)

Equation 3 realizes the heat conduction process if matrix D is the unit matrix with a homogeneous scaling constant. We utilize an anisotropic scheme in which the local direction is steered by the local orientation of the grouping responses. In this case, D is designed as follows. The predominant orientation θ is found l with θ = argmax(actMST ) for N {1..orientations}. This is used to deﬁne basis θN cos θ −sin θ vectors for the diﬀusion process v1 = and v2 = . The diﬀusin θ cos θ sion matrix D is then obtained after scaling using the eigenvalues T l 1 actMST =0 λ1 0 v1 θ 1 v v λ1 = and λ = 1 in D = . 1 2 2 − M ST l 0 λ2 v2T 1 − e actθ actMST l > 0 θ

114

S. Ringbauer, S. Tschechne, and H. Neumann

(a)

(b)

(c)

(d)

Fig. 2. Diﬀusion process. (a) Homogeneous integration would mix diﬀerent features (illustrated by the box) when cells are close to discontinuities. (b) Anisotropic integration keeps integration regions separate. (c,d) Model simulations of an integration area located close to but on diﬀerent sides of a discontinuity. In our proposal, contextually enhanced motion discontinuities steer the diﬀusion process.

2.4

Combination of Methods

The previous sections introduced all mechanisms that where added to the underlying model proposed in [4]. These mechanisms are now combined. After the initial motion detection in model area V1 the information is integrated to the spatially coarser MT and passed on to MSTl where the motion is analyzed by its contrast sensitive cells (section 2). MSTl responses are then sharpened by the context enhancement mechanism (section 2.2). The resulting information about the motion discontinuities in MSTl is then fed back to model area MT that incorporates it into the motion contrast dependent integration (MCDI). The MCDI uses a diﬀusion process (section 2.3) for each receptive ﬁeld that is bounded by the motion discontinuities calculated earlier. The receptive ﬁelds in turn selectively integrate only motion information from V1 that is covered by the region determined by the diﬀusion mechanism.

3

Results

The model proposed was probed with various selectively test input sequences testing its capabilities for the separation of diﬀerent motions within the integration. This is shown using two diﬀerent image sequences: First, an abstract artiﬁcial sequence containing moving noise patterns and second, the well known ’Yosemite with Clouds’ sequence which is close to a real world example.

Adaptive Spatial Integration in Cortical Motion Processing

3.1

115

Moving Box

This image sequence is a composition of a rectangular patch moving to the lower right in front of a background that is moving to the upper right. Both, the patch and the background, are generated by random noise (equally distributed) luminance pattern. As a result the patch can only be recognized during component motion since no static form clues where available to detect region boundaries (Figure 3).

(a) original input frame

(b) optical (ground truth)

ﬂow

Fig. 3. (a) The textured moving box and background. There are no static features to determine the borders of the moving box. (b) When in motion, the box show an outline and can be segmented from the background. Here the ground truth optical ﬂow is displayed. The color encodes the motion direction as shown in the color wheel on the upper right.

Given this input sequence of surface motion our model generates strong responses in MSTl motion contrast cells at the transition between the box and the background (Figure 4a). There are also weak responses within the box and on the background due to noise in the motion estimation. After the stage of context enhancement the noise within the box and the background is reduced and the boundary of the box induced by motion contrast appears less fragmented (Figure 4b). These boundaries aﬀect the integration process by limiting the integration region within a receptive ﬁeld (RF) as can be seen in Figure 4c where the darkened areas show positions where the integration area is decreased through the motion contrast. Comparing the estimated ﬂow of the model using homogeneous (HI) and motion contrast dependent integration (MCDI) reveals the advantage of the selective motion integration. The HI produces rather smooth transitions at the boundaries of diﬀerent motions whereas the MCDI leads to more diﬀerentiated and sharper boundaries such that regions of strong diﬀerences between estimated and ground truth ﬂow in proximity of motion boundaries are greatly reduced (Figure 5). In case of the noise-textured object the projected error [10] is improved by an average of 18.7% with MDCI (measuring the ﬁrst nine frames). An adaptation mechanism that utilizes luminance contrasts instead of motion would fail in this scenario. The sequence is composed of regions with the same

116

S. Ringbauer, S. Tschechne, and H. Neumann

(a) motion contrast cell response

(b) context enhanced motion contrast

(c) percentage of RF integration area

Fig. 4. (a) The response of the motion contrast sensitive cells in MSTl (lighter means higher activity) on the moving box sequence. The highest activities are found at the borders even though the boundary appears fragmented. (b) After context enhancement the boundaries of the boxes are intensiﬁed and the noise is highly reduced. (c) The percentage of the region of a RF integration is shown (reduced integration regions are darkened).

luminance statistics so that no form information helps to correlate with the moving patches. 3.2

“Yosemite with Clouds” Sequence

The ’Yosemite with Clouds’ image sequence simulates a ﬂight over a virtual mountain scenery. One quarter of the image consists of sky with clouds moving to the right, the rest shows a canyon of the Yosemite National Park (hence the name) which passes under the observer as the camera moves forward and slightly rotates to the right. When the estimated ﬂow after homogeneous (HI) integration is compared to the one with motion contrast dependent integration (MCDI), the diﬀerence is minute. The diﬀerence becomes more striking if we focus on the population code instead of its interpretation. Let’s zoom in to the upper left of the sequence,

(a) error HI

(b) error MCDI

Fig. 5. (a) In quite a distance to the border of the box errors in estimated motion occur, since the homogeneous integration fused the background and foreground motion. (b) Erroneous estimations appear only very close to the motion contrast using MCDI.

Adaptive Spatial Integration in Cortical Motion Processing

(a) Yosemite with Clouds (ground truth)

(d) Yosemite with Clouds (RF locations for (e))

(b) motion contrast dependent integration

117

(c) context enhanced motion contrast

(e) motion direction histogram (HI vs. MCDI)

Fig. 6. (a) The ground truth of the ﬂow ﬁeld of the ’Yosemite with Clouds’ image sequence. (b) The estimated ﬂow ﬁeld using the MCDI mechanism. After the interpretation of the population code there is almost no improvement compared to the HI method. (c) MSTl responses show the estimated motion discontinuities used as integration boundaries. (d) Sample RF locations above and below the horizon with the corresponding motion direction histograms (e). Using the HI method ((e), left) the population code represents both, the cloud as well as the mountain movement independent of the RF position relative to the horizon. However, the MCDI method ((e), right) leads to a much more distinct representation of the underlying movements of the sky and the mountains.

where the horizon and the sky meet (Figure 6d). With HI, the population activity shows a multi-modal distribution since it represents both, the movement of the clouds to the right, as well as the movement of the mountains to the left (Figure 6e, left). This is because the RFs are covering regions above and below the horizon. For the population code interpretation (e.g. for ﬂow ﬁeld representations) the modus with the highest amplitude is chosen and the other modus, even though it is only slightly smaller, is ignored. However, using the MCDI method the movements of the clouds and the mountains are not mixed up, since the RFs are limited by the motion discontinuities (Figure 6c) and therefore avoid an overlap of the regions above and below the horizon. This results in an unimodal population code representation (Figure 6e, right) that oﬀers a better source for further computations, e.g. the motion discontinuities in model area MST or the feedback from model area MT to V1. Furthermore the method for the interpretation of the population code can now be more general since the MCDI method prevents the emergence of ambiguous multi-modal representations.

118

4

S. Ringbauer, S. Tschechne, and H. Neumann

Discussion and Conclusions

We propose a biologically inspired model of motion estimation that utilizes the low and intermediate stages of cortical motion processing, namely V1, MT and MSTl, to detect motion. Our model makes a new contribution regarding the process of motion integration at area MT. We propose that incorporation of motion discontinuities into the integration mechanism improves the result of motion estimation. Those activities yield a more reliable and more robust signal for segregation of moving objects. The resulting integration follows the Gestaltlaw of common fate. In particular, this signal is exclusively gathered from the motion (dorsal) pathway and is not dependent on other features like luminance, form or texture. Since these features might not be visible or complicated to detect robust boundary estimation can be impossible. These channels might be integrated in future versions of the model in order to beneﬁt from multi-feature contrast detection and fusion of boundary information. At the moment, our scheme incorporates the response of motion contrast cells in MSTl that depict all changes in speed and direction. These noisy and fragmented estimates are improved with a methodology adapted from cortical areas V1 and V2, where lateral connections group together like-oriented contrasts and thus lead to contextually improved and less noisy signals. A diﬀusion process is then applied to adapt formerly circular integration areas to the outline of objects. Compared to classical homogeneous integration mechanisms our proposed motion contrast dependent integration scheme shows less estimation errors at object boundaries. The lower cortical region V1 beneﬁts from this improved integration process due to recurrent feedback connection that stabilize and disambiguate the estimations. This property is clearly shown when multiple estimates are presented in velocity space. Here, our proposed extension leads to less ambivalent activations within cell populations. Future work will focus on improving contextual enhancement and the estimation of motion discontinuities. Furthermore, classic features like texture statistics or signals derived from the form (ventral) pathway can be included in the estimation of object boundaries.

Acknowledgments and Author Contributions This work was supported by a grant from the European Commission within the 7th Framework Program: Smart Eyes: Attending and Recognizing Instances of Salient Events (SEARISE, Project no. 215866). This work was also supported with a grant from the German Federal Ministry of Education and Research, project 01GW0763, Brain Plasticity and Perceptual Learning. Authors S. R. and S. T. have equally contributed to the contents of this publication.

References 1. Brox, T., Papenberg, N., Weickert, J.: High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)

Adaptive Spatial Integration in Cortical Motion Processing

119

2. Physik, F.A.: Poggendorﬀ’s Annel (1855) 3. Horn, B., Schunck, B.: Determining optical Flow. Artiﬁcial Intelligence 17(1-3), 185–203 (1981) 4. Raudies, F., Neumann, H.: A model of neural mechanisms in monocular transparent motion perception. Journal of Physiology 104(1-2), 71–83 (2010) 5. Deqing, S., Roth, S., Black, M.: Secrets of Optical Flow Estimation and Their Principles. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2432–2439 (2010) 6. Tlapale, E., Masson, G., Kornprobst, P.: Modelling the dynamics of motion integration with a new luminance-gated diﬀusion mechanism. Vision Research 50(17), 1676–1692 (2010) 7. Weickert, J.: Anisotropic Diﬀusion in Image Processing. Teubner Verlag, Stuttgart (1998) 8. Eifuku, S., Wurtz, R.: Response to motion in extrastriate area MSTl: Disparity sensitivity. Journal of Neurophysiology 82(5), 2462 (1999) 9. Hegde, J., van Essen, D.: Selectivity for complex shapes in primate visual area V2. The Journal of Neuroscience 20, RC61 (2000) 10. Barron, J., Fleet, D., Beauchemin, S.: Performance of Optical Flow Techniques. International Journal of Computer Vision 12(1), 43–77 (1994) 11. Neumann, H., Sepp, W.: Recurrent V1-V2 interaction in early visual boundary. Processing. Biol. Cyber. 81, 425–444 (1999) 12. Thielscher, A., Neumann, H.: Neural mechanisms of human texture processing: texture boundary detection and visual search. Spatial vision 18(2), 227–257 (2005) 13. Weitzel, L., Kopecz, K., Spengler, C., Eckhorn, R., Reitboeck, H.: Contour Segmentation with Recurrent Neural Networks of Pulse-Coding Neurons. In: Int. Conference on Computer Analysis of Images and Patterns (2000) 14. Hansen, T., Neumann, H.: A recurrent model of contour integration in primary visual cortex. Journal of Vision 8, 1–25 (2008)

Self-organized Short-Term Memory Mechanism in Spiking Neural Network Mikhail Kiselev Megaputer Intelligence Inc. 120 West 7th Street, Suite 314, Bloomington, IN 47404 USA [email protected]

Abstract. The paper is devoted to implementation and exploration of evolutionary development of the short-term memory mechanism in spiking neural networks (SNN) starting from initial chaotic state. Short-term memory is defined here as a network ability to store information about recent stimuli in form of specific neuron activity patterns. Stable appearance of this effect was demonstrated for so called stabilizing SNN, the network model proposed by the author. In order to show the desired evolutionary behavior the network should have a specific topology determined by “horizontal” layers and “vertical” columns. Keywords: spiking neural networks, spatio-temporal pattern recognition, shortterm memory, neural network self-organization.

1 Introduction Spiking neural networks (SNN) attract more and more attention of neural network researchers being the most realistic models of ensembles of neurons in mammalian brain [1] and, on the other side, having computation power sufficient to perform complex spatio-temporal transformations of input signal [2]. A significant number of papers has been devoted to SNN learning to recognize dynamic patterns in various domains (for example, [3]) so that this aspect of SNN theory is sufficiently explored. However, for purposes of neurophysiologic modeling as well as for creation of SNNbased learning automated control systems (for example, in robotics) pattern recognition is only one of the important problems to be solved. Another crucial component to be modeled is a short-term memory. Obviously, it is an important part of any realistic brain model. From the point of view of automated control systems (for example, in robotics), it is absolutely necessary in case of presence of factors acting with significant and unknown temporal delay. It is supposed that the short-time memory storing information about the stimuli perceived by network in near past is realized in form of stable patterns of neuronal activity in contrast with long-term memory usually associated with permanent changes of parameters of neurons (for example, synaptic weights) accumulated during the learning period. For the sake of flexibility and universality, the short-time memory mechanism should be formed as a result of evolutionary development of the network - it is evident that in case of big networks and non-trivial tasks this mechanism simply cannot be hard-coded. However, this problem A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 120–129, 2011. © Springer-Verlag Berlin Heidelberg 2011

Self-organized Short-Term Memory Mechanism in Spiking Neural Network

121

is still unsolved for spiking neural networks - at least, the author is unaware of the papers reporting its successful solution. It is the subject of the present work. The use of the so called stabilizing spiking neural network model, first proposed in [4], has been found to be fruitful for progress in this direction. Further research showed a great potential of this SNN class for recognition of dynamic patterns and other spatio-temporal regularities in input signal [5-7]. The stabilizing spiking neural networks will be described in the next three Sections. After that, in Section 5, we define more exactly the solved problem. We discuss the computational experiments and their results in Sections 6 and 7.

2 Stabilizing Spiking Neuron (SS Neuron) The discussed neuron model can be considered as a generalization of the standard LIF model (leaky integrate-and-fire spiking neuron [1]). Neuron receives spikes from other neurons via its synapses, either excitatory or inhibitory. It is assumed that all spikes have equal amplitude and negligible duration. The spikes incoming via excitatory and inhibitory synapses contribute to, respectively, excitatory and inhibitory components of the membrane potential. In our model, these components are not summed up in a simple manner like in majority of neuron models but play different roles in postsynaptic spike generation. We denote them as EPP and IPP to avoid confusion with standard acronyms EPSP and IPSP. Contribution of an excitatory synapse Ww , where w is weight of this to the EPP just after presynaptic spike is equal to W +w synapse, W is a constant. This value is close to w for small w and to W for great w. Choice of this function is explained in Section 3 and, in more detail, in [6]. Contribution of an inhibitory synapse to the IPP upon receiving presynaptic spike is always equal to 1. If no new presynaptic spikes arrive the contributions of all synapses as well as the EPP and IPP themselves decay exponentially to zero value. If the IPP is low the neuron behavior does not differ from the LIF model - when the EPP reaches + the neuron emits spike, contributions of all excitatory synthe threshold value U THR apses and the EPP are reset to zero, and the neuron becomes for a certain period (called absolute refractory period) incapable of emitting new spikes. The IPP has no − . If the EPP influence on this process while its value is less than the threshold U THR + − when the IPP exceeds U THR then the postsynaptic spike suppression reaches U THR takes place, an effect absent in the LIF model. In this case spike is not generated, although the EPP is reset to 0. Spike emission as well as spike suppression affects the slowly changing components of neuron status to be considered below. The most important of these components is called neuron stability. Presence of this neuron property is the principal distinctive feature of the discussed model. It determines synaptic weight plasticity (which is discussed below) and degree of neuron vulnerability to inhibition by other neurons in the network. It is assumed that the network part which is “trained” to do some job should consist of neurons with high stability value that would guarantee preservation of its trained state. The stable part of network is protected from inhibition by “untrained” (and, therefore, unstable) neurons

122

M. Kiselev

because in my model inhibitory synapses do not pass spikes from the presynaptic neurons with stability significantly lower than stability of the postsynaptic neuron. SS neuron stability dynamics is determined by the frequency of spikes emitted by the neuron and by the ratio of emitted/suppressed spikes. By itself, the stability falls exponentially to 0 so that any “silent” neuron becomes unstable after some time. Spike emission increases neuron stability, spike suppression decreases it. As a result of this approach, the stable neurons tend to avoid inhibiting each other in an equilibrium state. If there exist two groups of highly stable neurons mutually inhibiting each other then, eventually, one of them wins so that the other gets unstable, synaptic weights of its neurons are redistributed and, if it gets a chance to stabilize again, begins emitting spikes in most cases when neurons from the first group are silent thus avoiding the mutual inhibition. This mechanism helps to support neuron diversity i.e. existence of neural groups with non-correlated (or anti-correlated) activity patterns that is essential for realization of the desired evolution process in the network.

3 Synaptic Plasticity Like in majority of neural network models, learning of the SS neural network (or, rather, evolution of the network in interaction with its informational environment) is realized as a process of synaptic weight adjustment supplemented in my model by a neuron stabilization/destabilization process. It should be noted that only excitatory synapses are plastic in this model. The SS neuron synaptic weight plasticity law is similar to the classic STDP [1] however with the following substantial amendment. While in majority of known models the synaptic long-term depression is determined by the positive time difference between presynaptic and postsynaptic spike in the described model synaptic depression is a consequence of postsynaptic spike suppression. Beside that, my model of synaptic plasticity implies weight redistribution in strict sense of this word - the total sum of synaptic weights of every neuron remains constant. After a postsynaptic spike generation, weights of the synapses with little contribution in the EPP (and therefore having not received presynaptic spikes for a long time) are redistributed in favor of the synapses having received presynaptic spikes recently. In case of spike suppression the redistribution is performed in the opposite direction. The total amount of weight redistributed depends on the neuron stability. It is very small for neurons with high stability so that they are almost nonplastic. Constancy of the total weight is very important from the homeostatic point of view since it does not allow neurons to shift to a “dead” state with almost zero weights or to become too hyperactive. Besides, it is crucial that in this model the maximum contribution of a synapse to EPP in any case remains less than the constant + . It guarantees that spike emission is a W whose value is several times less than U THR result of collective contribution of several excitatory synapses receiving presynaptic spikes inside a sufficiently short period of time. Only under this condition neuron may serve as a useful operational unit. Let me also note that in my model every neuron connection may have a non-zero signal transmission delay, different for different connections but constant in time. We have considered all principal features of the SS neuron model. The formulae expressing the laws of neuron dynamics and evolution of slowly changed components

Self-organized Short-Term Memory Mechanism in Spiking Neural Network

123

of its state can be found in Appendix. In next section we discuss architecture of the SS neural network, in which the self-organized processes of short-term memory mechanism genesis can be observed.

4 Network The network explored in this work has 3 layers consisting of SS neurons with similar properties but with various numbers of excitatory and inhibitory synapses (Fig. 1). Excitatory synapses of the layer 1 neurons are connected with input network nodes (these will be referred to as receptors). This layer serves for spatio-temporal pattern recognition and is similar to the network described in [7]. It is assumed that information about the most recent pattern should be stored in the two other layers which form a recurrent network - excitatory synapses of the layer 2 neurons are connected with neurons from the 1st and 3rd layers, excitatory synapses of neurons belonging to layer 3 are connected with neurons from the 2nd layer. Beside that, neurons of every layer have a small number of excitatory connections with neurons of the same layer (these connections are not shown on Fig. 1). In our case all network layers contain the equal number of neurons. Beside the layers, there is one more crucial feature of the network topology, namely, its columnar organization. Neurons in all three layers are partitioned into equal numbers of non-intersecting groups so that each group contains the same number of neurons. Let us label these groups in every layer by consecutive numbers. We call set of the groups with the same label a column. Every column is characterized by the property that neither pair of its neurons is connected by an inhibitory link - inhibitory connections exist between neurons from different columns only. Thus, columns form the “vertical” network structure supplementing its “horizontal” layers. It is hard to say to what extent my model is close to neurophysiologic reality but it is worth to mention that the columnar organization is typical for various zones of mammalian cerebral cortex (see, for example, the review in [8]). However, it should be noted that in the considered network all columns have the same receptive field - for all layer 1 neurons their presynaptic receptors are randomly selected from the whole set of receptors.

5 Spatio-temporal Patterns and Short-Term Memory Let us briefly describe the recognition problem solved by the network. Like in the previous works [4-7] our network should learn to determine the subsets of receptors which generate spikes with the significantly higher rate within the same sufficiently short time window. This problem serving as a model problem convenient to study is in the same time quite universal and important for practical applications. It is not very easy because size of the receptor subsets, length of the periods of their increased activity, and degree of this increase are not known a priori. The network having initially chaotic distribution of synaptic weights should learn to react in a specific way to increased activity episodes of these subsets and (it is the main subject of the present study) to keep this reaction for sufficiently long time (for example, until another receptor subset becomes active). Demonstration of this behavior would mean selforganized formation of short-term memory in the network.

124

M. Kiselev

Fig. 1. Structure of excitatory (fat arrows) and inhibitory (thin arrows) connections between network layers

6 Experiments A network consisting of 300 SS neurons was used in all described experiments. Previous studies of stabilizing neural networks showed that several hundred neurons is a good trade-off between network ability to demonstrate interesting and unexpected effects and reasonable computation time. As it was mentioned in Section 4 the network has 3 layers, 100 neurons per layer. Beside that, the network is partitioned into 10 columns including 10 neurons on each layer. Every layer 1 neuron has 64 excitatory synapses and 64 inhibitory synapses; layer 2 neurons - 96 and 128, respectively; layer 3 neurons - 32 and 128. These numbers are the greatest multiples of 32 which are possible for the network connection configuration shown on Fig. 1. The computational experiments were carried out on 240-core GPGPU Tesla C1060 (http://www.nvidia.com/docs/IO/56483/Tesla_C1060_boardSpec_v03.pdf) in such a way that the computations for each synapse were performed by one thread in a block corresponding to a neuron. Since the most optimal organization of computation on this hardware supposes number of threads per one block to be a multiple of 32 the number of synapses of both kinds for one neuron should also be a multiple of 32. In the limits of these conditions, the presynaptic neurons and receptors were randomly chosen for each neuron. Before the beginning of every experiment, the network was set to a chaotic state weights of all excitatory synapses were randomly chosen from the range [0, 2]. The + maximum weight value was set to 3, U THR = 10 . The signal delays in interneuron connections were also random numbers with the log-uniform distribution in the range [1 ms, 64 ms]. Besides, the mean synaptic weight of every neuron was made equal to 1 with help of multiplicative normalization.

Self-organized Short-Term Memory Mechanism in Spiking Neural Network

125

The experiments were organized in series of 10 experiments so that all experiments inside one series were carried out with the same input signal but with various initial chaotic network states. Of course, a greater number of experiments in one series would be desirable; however, it would take unacceptable long time. In this study many experiment parameters had to be varied while even series of this size required 2-3 days of computations... In order to evaluate (at least, roughly) the results in neurophysiologic terms the step of network state re-calculation was taken as 1 ms. The postsynaptic potential decay constant was selected equal to 20 ms, refractory period length - 10 ms. Properties of the signal artificially generated on the network input nodes were constant during one experiment. The signal consisted of a mix of noise (randomly generated spikes) and “patterns” - episodes of high collective activity of certain receptor groups. Onsets of these episodes were also selected randomly. It was expected that being stimulated by such a signal for sufficiently long time the network would come to a certain equilibrium state characterized, in particular, by absence of strong changes of neuron stability. If the equilibrium was not reached after 300000 second of evolution the experiment was stopped and was considered as unsuccessful. In case when the equilibrium was reached the experiment was declared successful if after every high receptor activity episode corresponding to some pattern the firing rate in a certain neuron group specific for this pattern became sufficiently greater than an average level and, moreover, remained high until appearance of a different pattern. An experiment series was considered as successful if it included at most one unsuccessful experiment. The input signal parameters were different in different experiment series. The set of varied parameters included: number of receptors NR, number of different patterns NP, part of receptors corresponding to one pattern sP (in this study all patterns had uniform properties), frequency of pattern emergence fP, pattern duration TP, background receptor spike frequency (noise intensity) FB, ratio of receptor spike intensity inside and outside patterns (for the receptors corresponding to these patterns) I.

7 Results The most important result of the experiments was the observation of stable and repetitive (in all experiments in a series) self-organized formation of the short-term memory mechanism in the network. Further study pursued the goal to determine ranges of the input signal parameters for which the effect takes place (in other words, the series of experiments are successful - see previous section). These ranges are shown in Table 1. We can see that they are quite narrow however sufficient to conclude that the demonstrated effect is not a result of fine tuning of network parameters for the given receptor signal. Table 1. Ranges of input signal parameters for which experiments on self-organized short-term memory formation were successful Parameter NR I N P s P fP TP FB minimum 100 2* 0.1 0.01Hz 0.1 sec 0* 30 maximum 1000 2 0.4* 1Hz* 1 sec* 0.3Hz 100 * Further parameter value modification is senseless or have not been tested

126

M. Kiselev

Fig. 2. Activity records of receptors and neurons in the three different periods of the network evolution

Usually, the network evolution to a state demonstrating the short-term memory effect includes 2 stages. The first stage leads to formation of layer 1 neuron groups recognizing the patterns and positive feedback loops consisting of neurons from layers 2 and 3. On the second stage the “useful” loops realizing the short-term memory mechanism are selected from all the loops. As an illustration, Fig. 2 contains 150 ms network activity records (left to right) from the very beginning of network evolution, after the first stage, and after the short-term memory formation. The time axis on Fig. 2 is horizontal. All these records correspond to situations when pattern 1 appears some time after pattern 2. The uppermost section contains receptor spikes. The other 3 sections correspond to three network layers. Fat dots denote emitted spikes, small dots – suppressed spikes. Column boundaries are depicted as thin horizontal lines. The leftmost record displays chaotic reaction of layer 1 neurons to the incoming pattern and almost no reaction from neurons on layers 2 and 3. On the second record we see that the 9th column of layer 1 demonstrates recognition of pattern 1 while the 9th columns of layers 2 and 3 are involved in the senseless constant activity completely independent of input signal. At last, on the third record we can observe a manifestation of the short-term memory effect – the activity of the 7th column having induced by pattern 2 is being blocked by the 9th column recognizing pattern 1. After that the neural loops belonging to column 9 of layers 2 and 3 are activated thus carrying memory about the pattern 1 appearance.

8 Conclusion A spiking neural network model is proposed and explored which, starting from an initial chaotic state, can demonstrate under certain conditions self-organized formation of shortterm memory mechanisms i.e. an ability to store information about patterns recognized in

Self-organized Short-Term Memory Mechanism in Spiking Neural Network

127

recent past in form of specific neural activity. It is a novel result for spiking neural networks. The significant condition necessary for obtaining this result is a special network topology based on a combination of 3 “horizontal” layers and several “vertical” columns. If we manage to make the discussed effect more stable and obtainable for wider ranges of signal and network properties we plan to start work on creation of model of an SNNbased learnable automated control system beginning from simple artificial tasks and, possibly, extending its application area to practical problems.

References 1. Gerstner, W., Kistler, W.: Spiking Neuron Models. In: Single Neurons, Populations, Plasticity. Cambridge University Press, Cambridge (2002) 2. Maass, W., Markram, H.: On the computational power of circuits of spiking neurons. Journal of Computer and System Sciences 69, 593–616 (2004) 3. Wysoski, S.G., Benuskova, L., Kasabov, N.: Adaptive Spiking Neural Networks for Audiovisual Pattern Recognition. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds.) ICONIP 2007, Part II. LNCS, vol. 4985, pp. 406–415. Springer, Heidelberg (2008) 4. Kiselev, M.: Statistical Approach to Unsupervised Recognition of Spatio-temporal Patterns by Spiking Neurons. In: Proceedings of IJCNN 2003, Portland, Oregon, pp. 2843–2847 (2003) 5. Kiselev, M.: SSNUMDL - a network of spiking neurons recognizing spatio-temporal patterns. Neurocomputer 12, 16–24 (2005) (in Russian) 6. Kiselev, M.: Self-organized Spiking Neural Network Recognizing Phase/Frequency Correlations. In: Proceedings of IJCNN 2009, Atlanta, Georgia, pp. 1633–1639 (2009) 7. Kiselev, M.: One layer self-organized spiking neural network recognizing synchrony structure in input signal (in Russian). Neurocomputer 10, 3–11 (2009) 8. Jones, E., Rakic, P.: Radial Columns in Cortical Architecture: It Is the Composition That Counts. Cerebral Cortex 20, 2261–2264 (2010)

Appendix: One Iteration of an SS Neuron State Recalculation In the formulae below the subscript i is used to denote values related to the excitatory synapse number i. In the similar manner, the subscript r will denote the r-th inhibitory synapse. Presence of presynaptic spike on the i-th (or r-th) synapse before the current iteration is indicated by the Boolean value Ii (or, respectively, Ir). These spikes was emitted by presynaptic neurons or receptors some number of iterations ago – depending on the interneuron connection delay. Beside that, if stability of an inhibitory presynaptic neuron is less than the model constant sW called working stability and is less than stability of the postsynaptic neuron divided by a number of order of 1 (2 – in this version of the model) then the respective Ir is false in any case. Beside the neuron state modification, the other result of iteration is the Boolean value v ′ indicating spike generation by the given neuron in the given iteration. Primed variables like v ′ denote final values to be calculated in the current iteration while variables without a prime sign correspond to constants, values calculated in the previous iteration or intermediate values.

128

M. Kiselev

Neuron state includes the following components: s – stability, r – refractory period timer, τ - EPP decay constant, Δs - stability increment after spike emission, and, beside these, the values describing state of its synapses; ν - current unweighted synapse contribution to EPP or IPP, and w – synaptic weight (wr = 1). Beside that, the computational procedure given below uses the three intermediate variables: NT – number of increased excitatory synaptic weights, R – part of weight taken from depressed synapses, ΔW - total amount of redistributed synaptic weight. The computa− + tions also include several constants, three of which, W, U THR and U THR , have been considered in Section 2. The others are: D – IPP decay constant, τMIN and τMAX – minimum and maximum values of EPP decay constant, Mτ - speed of EPP decay constant variation, IMIN – minimum equilibrium postsynaptic spike frequency, νTHR – minimum value of contribution to EPP by potentiated (in case of spike emission) or depressed (in case of spike suppression) synapses, wMAX – maximum synaptic weight, sW – the neurons with stability exceeding this threshold are considered as stable, pW – decrease of synaptic plasticity for stable neurons, α - constant determining stability decrease rate for silent neurons, Δs MAX - maximum value of stability increment. Iteration includes the following operations (in order of their enumeration): 1.if Ii then ν i ← 1

(

)

else ν i ← 1 − 1 ν i τ 2.if Ir then ν r′ ← 1 else ν r′ ← Dν r Wν i wi

∑W + w 4. IPP ← ∑ν

3. EPP ←

i

r

+ 5. υ ← r = 0 & EPP > U THR 6.if υ then τ ′ ← max(τM τ ,τ MIN )

else τ ′ ← min(τM τ− I MIN ,τ MAX ) − 7. v ′ ← υ & IPP < U THR 8.if υ then N T ← 0

else if ν ′ then N T ← {i | ν i ≥ ν THR & wi < wMAX }

else N T ← {i | ν i < ν THR & wi < wMAX } 9.if s < sW then R ← H else R ← HpW 10.if ν ′ then R ← RΔs else R ← RΔs MAX

Self-organized Short-Term Memory Mechanism in Spiking Neural Network

11.if NT = 0 then ΔW ← 0 else if ν ′ then ΔW ← R else ΔW ← R

∑

∑

wi i∈{i|ν i <ν THR }

wi i∈{i|ν i ≥ν THR }

12.if NT = 0 then wi′ ← wi else if ν ′ &ν i < ν THR ∨ ν ′ &ν i ≥ ν THR then wi′ ← (1 − R) wi ΔW else if wi < wMAX then wi′ ← wi + NT else wi′ ← wi −

s

13. s ← s − αI MIN sW e sW 14.if ν ′ then s′ ← s + Δs else if υ & ν ′ then s ′ ← s − Δs MAX else s′ ← s 15.if υ then Δs ′ ← min( Δs + I MIN , Δs MAX )

Δs + I MIN 2 16.if υ then ν i′ ← 0 else ν i′ ← ν i else Δs ′ ←

129

Approximation of Functions by Multivariable Hermite Basis: A Hybrid Method Bartlomiej Beliczynski Warsaw University of Technology, Institute of Control and Industrial Electronics, ul. Koszykowa 75, 00-662 Warszawa, Poland [email protected]

Abstract. In this paper an approximation of multivariable functions by Hermite basis is presented and discussed. Considered here basis is constructed as a product of one-variable Hermite functions with adjustable scaling parameters. The approximation is calculated via hybrid method, the expansion coeﬃcients by using an explicit, non-search formulae, and scaling parameters are determined via a search algorithm. A set of excessive number of Hermite functions is initially calculated. To constitute the approximation basis only those functions are taken which ensure the fastest error decrease down to a desired level. Working examples are presented, demonstrating a very good generalization property of this method. Keywords: Function approximation, Neural networks, Orthonormal basis.

1

Introduction

Thanks their elegance and usefulness, for many years Hermite polynomials and Hermite functions have been attractive in various ﬁelds of science and engineering. In quantum mechanics of harmonic oscillators, ultra high band telecommunication channels, ECG data compression and various sorts of approximation tasks they proved to be useful tools. A set of Hermite functions forming an orthonormal basis is naturally suitable for approximation, classiﬁcation and data compression tasks. These basis functions are deﬁned over the real numbers set R and they can be recursively calculated. The approximating function coeﬃcients can be determined relatively easily to achieve the best approximation property. Since Hermite functions are eigenfunctions of the Fourier transform, time and frequency spectra are simultaneously approximated. Each subsequent basis function extends frequency bandwidth within a limited range of well concentrated energy; see for instance [1]. By introducing a scaling parameter we may control this bandwidth inﬂuencing at the same time the dynamic range of the input argument. As pointed out in [2] the product of time and frequency bandwidths for Hermite functions, is the largest over set of continuous functions. Hermite functions display various geometrical shapes controlled by simple parameter(s). It was suggested to use Hermite functions as activation functions in ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 130–139, 2011. c Springer-Verlag Berlin Heidelberg 2011

Approximation of Functions by Multivariable Hermite Basis

131

neural schemes. In [3], a so called “constructive” approximation scheme is used. It is a type of incremental approximation developed in [4], [5]. Every node in the hidden layer has a diﬀerent activation function. Intuitively the most appropriate shape can be applied. However, in such approach the orthogonality of Hermite functions is not really exploited. If Hermite, one-variable functions are extended into two-variables, the approximation retains the same useful properties and it turns out to be very suitable for image compression tasks. For n-variables case, although main features are the same, the whole process become more complicated. The biggest advantage of approximation by Hermite basis is, that due to its orthonormality, the approximation does not involve search algorithms. However for an initial step of approximation, one has to consider the time and frequency bandwidths. For one-variable case, these two bandwidths could be controlled by a simple scaling parameter which could be selected to some extent arbitrarily. Much more diﬃcult is to choose appropriate scaling parameters in a multivariable case. So we are suggesting to use a search algorithm for that, while the expansion coeﬃcients are calculated explicitly via appropriate formulae. Because approximation by orthonormal basis is numerically very eﬃcient, one can take advantage of that and calculate a larger number of basis functions, then select from them only those which contribute the most to the approximation error decrease. It seems, that this basis selection procedure is the main reason for a good generalization property of this method. This paper is organized as follows. In Section 2 basic facts about approximation needed for later use are recalled. In Section 3 one-variable Hermite functions as basic components for multivariable case, are shortly described. Then we present our results in Section 4, describing multivariable Hermite basis construction, scaling parameters selection, ﬁnal choice of basis functions and working examples. Finally in Section 5, conclusions are drawn.

2

Approximation Framework

Some selected facts on function approximation useful for this paper will be recalled. Let us consider the following function fn+1 =

n

wi gi ,

(1)

i=0

where gi ∈ G ⊂ H, and H is a Hilbert space H = (H,||.||), i = 0, ..., n, and wi ∈ IR, i = 0, . . . , n. For any function f from a Hilbert space H and a closed (ﬁnite dimensional) subspace G ⊂ H with basis {g0 , ..., gn } there exists a unique best approximation of f by elements of G ([6]). Let us denote it by gb . Because the error of the best approximation is orthogonal to all elements of the approximation space f −gb ⊥G, the coeﬃcients wi may be calculated from the following set of linear equations gi , f − gb = 0 for i = 0, ..., n where ., . denotes inner product.

(2)

132

B. Beliczynski

n nThe formula (2) can also be written as gi , f − k=0 wk gk = gi , f − k=0 wk gi , gk = 0 for i = 0, ..., n or in the matrix form Γ w = Gf

(3)

where Γ = [gi , gj ], i, j = 0, ..., n, w = [w0 , ..., wn ] , Gf = [g0 , f , ..., gn , f ]T and “T” denotes transposition. Because there exists a unique best approximation of f in a n + 1 dimensional space G with basis {g0 , ..., gn }, the matrix Γ is nonsingular and wb = Γ −1 Gf . For any basis {g0 , ..., gn } one can ﬁnd such orthonormal basis {e0 , ..., en }, ei , ej = 1 when i = j and ei , ej = 0 when i = j that span{g0, ..., gn } = span{e0 , ..., en }. In such a case, Γ is a unit matrix and T wb = e0 , f , e2 , f , . .., en , f . (4) T

Finally (1) will take the form fn+1 =

n

ei , f ei ,

i = 0, 1, ..., n.

(5)

i=0

The squared error errorn+1 =< f − fn , f − fn > of the best approximation of a function f in the basis {e0 , ..., en } is thus expressible by 2

2

||errorn+1 || = ||f || −

n

wi2 .

(6)

i=0

In a typically stated approximation problem, a basis of n + 1 functions {e0 , e1 , .. ., en } is given and we are looking for their expansion coeﬃcients wi = ei , f , i = 0, 1, ..., n. According to formula (6) those expansion coeﬃcients are contributing directly to the error decrease, and they can be used to order the basis from the most to the least signiﬁcant as far as error decrease is concerned.

3

One-Variable Hermite Functions

Our multivariable basis for approximation will be composed from one-variable Hermite functions, so we will brieﬂy describe these components. Let us consider +∞ a space L2 (−∞, +∞) with the inner product deﬁned < x, y >= x(t)y(t)dt. −∞

In such space a sequence of orthonormal functions could be deﬁned as follows (see for instance [6]): h0 (t), h1 (t), ..., hn (t), ... (7) where t2

2

hn (t) = cn e− 2 Hn (t); Hn (t) = (−1)n et and Hn (t) is a polynomial.

dn −t2 1 (e ); cn = n √ 1/2 . (8) dtn (2 n! π)

Approximation of Functions by Multivariable Hermite Basis

133

The polynomials Hn (t) are called Hermite polynomials and the functions hn (t) Hermite functions. According to (8) the ﬁrst several Hermite functions could be calculated h0 (t) =

1 π 1/4

t2

e− 2 ;

h1 (t) = √

t2 1 e− 2 2t; 2π 1/4

(9)

t2 t2 1 1 h2 (t) = √ e− 2 (4t2 − 2); h3 (t) = √ e− 2 (8t3 − 12t) 1/4 1/4 2 2π 4 3π

(10)

Plots of several Hermite functions are shown in Fig.1. 0.8 h0 h3 h9

0.6

0.4

0.2

0

−0.2

−0.4

−0.6 −8

−6

−4

−2

0

2

4

6

8

Fig. 1. Hermite functions h0 , h1 , h9

One can see that increasing of indices of Hermite functions cause enlarging bandwidths in time and frequency. So when approximating a function, it is reasonable to start from lower indices basis functions and gradually go for higher ones. If approximated function is located not in the range of a Hermite function as displayed in Fig. 1, then one can modify the basis (7) by scaling t variable via σ ∈ (0, ∞) as a parameter. So if one substitutes t := σt into (8) and modiﬁes cn to ensure orthonormality, then hn (t, σ) = cn,σ e and

2 − t 2 2σ

t 1 √ Hn ( ) where cn,σ = n σ (σ2 n! π)1/2

√ 1 t hn (t, σ) = √ hn ( ) and hn (ω, σ) = σ hn (σω) σ σ

(11)

(12)

134

B. Beliczynski

Note that hn as deﬁned by (11) is the two arguments function whereas hn as deﬁned by (8) has only one argument. These functions are related by (12). Thus by introducing scaling parameter σ into (11) one may adjust both the dynamic range of the input argument hn (t, σ) and its frequency bandwidth √ √ 1√ 1√ t ∈ [−σ 2n + 1, σ 2n + 1] ; ω ∈ [− 2n + 1, 2n + 1] σ σ

(13)

Suppose that one-variable function f deﬁned over the range of its argument t ∈ [−tmax , tmax ] has to be approximated by using Hermite expansions. Assume that the retained function angular frequency should at least be ωr , then according to (13), the following two conditions should be fulﬁlled √ σ 2n + 1 ≥ tmax

and

1√ 2n + 1 ≥ ωr σ

or tmax σ ∈ [σl , σh ] where σl = √ and σh = 2n + 1 One would expect that σl ≤ σh , what is equivalent to

√

2n + 1 ωr

tmax ωr ≤ 2n + 1

(14)

(15)

(16)

In order to preserve orthonormality of the set {h0 (t, σ), h1 (t, σ), ..., hn (t, σ)}, σ must be chosen the same for all functions hi (t, σ), i = 0, ..., n. Widely discussed on such occasion the lost of basis orthonormality due to basis truncation, in many practical cases is not crucial [7].

4 4.1

Multivariable Function Approximation Multivariable Hermite Basis

Let function to be approximated f belongs to Hilbert space f ∈ H, H = (H,||.||) and be function of n-variables. Let denote it explicitly as f (x1 , x2 , ..., xn ). Let one-variable Hermite function be denoted as hi (xj , σj ), where j ∈ {1, ..., m} and i ∈ {0, 1, ..., n}

(17)

and multivariable basis function hl (x1 , x2 , ..., xm , σ1 , σ2 , ..., σm ) be the following hl (x1 , x2 , ..., xm , σ1 , σ2 , ..., σm ) = hi1 (x1 , σ1 )hi2 (x2 , σ2 )...him (xm , σm )

(18)

where i1 , i2 , ..., im ∈ {0, 1, ..., n}. Clearly for each one out of m variables, there are n + 1 indices of Hermite functions. This gives total (n + 1)m basis functions. They can be enumerated l=

m j=1

so l ∈ {0, 1, ..., (n + 1)m − 1}.

ij (n + 1)j−1

(19)

Approximation of Functions by Multivariable Hermite Basis

135

Naming now x = (x1 , x2 , ..., xm ) and σ = (σ1 , σ2 , ..., σm ), then instead of hl (x1 , x2 , ..., xm , σ1 , σ2 , ..., σm ), we will write in short hl (x, σ) or hl . Finally the multivariable basis is the following h0 , h1 , ..., h(n+1)m −1 (20) One can easily verify that the multivariable basis is orthonormal i.e.

1 for i = j hi , hj = 0 elsewhere The approximant f(n+1)m of f will be expressed as (n+1)m −1

f(n+1)m (x, σ) =

wl hl (x, σ)

(21)

l=0

where

wl = hl , f .

(22)

f(n+1)m approaches function f if number of elements n goes to inﬁnity f = f∞ = ∞ l=0 wl hl . An interesting survey of math research on multivariables polynomials and Hermite interpolation one can ﬁnd in [8]. 4.2

Scaling Parameters

Hermite functions are well localized in frequency and time. If a scaling parameter is introduced, it inﬂuences both time and frequency ranges but in opposite ways (13). If it is chosen too small, then a fragment of function could poorly be approximated. If it is chosen too large only part of the approximated function spectrum is preserved. If only one-variable function is being approximated, the scaling parameter σ can even intuitively be chosen. If however several variables are involved, the best choice is more complicated and must be calculated. We suggest the following criterion 2 σ0 = arg min f(n+1)m (x, σ) − f (x) σ

Usually, in order to get σ0 , a number of iterations is needed. 4.3

Basis Selection

If we approximate m-variables function and along each variable we use n + 1 orthonormal components, then it will be (n + 1)m summation terms in (21). For instance if we approximate a 3 variables function with 15 Hermitian components along each variable, then we have 3375 summation terms. One expects that a signiﬁcant part of all components have a very small, practically negligible inﬂuence to the approximation. As clearly visible from formula (6), the components

136

B. Beliczynski

associated with large wi2 (or |wi |) are contributing the most to the error decrease. So taking advantage of eﬃciency of approximation by orthonormal basis, we initially calculate an excessive number of Hermite expansion terms and select only the most signiﬁcant as far as error decrease is concerned. This basis selection can be interpreted as a simple pruning method, a classical neural technique improving generalization, see for instance [9]. 4.4

Examples

Example 1. Let function to be approximated be the following 2

2

f (x1 , x2 ) = x1 e−x1 −x2 .

(23)

Its plot is presented in Fig.2. Let us approximate the function in the range [−3, 3]2 . We take 41 points along each axis obtaining totally 1681 pairs of the (argument, function value) to be processed. Along each axis number of Hermite components was set to 3, so every one-variable Hermite function could have indices 0, 1 or 2. We obtained 32 Hermite components. The expansion coeﬃcients (weights) were calculated according to (22). Two scaling factors σ1 and σ2 were determined via search-type procedure. Finally we found that σ1 = σ2 = 0.7071. The Hermite expansion components (21) were ordered by squares of their coefﬁcients wi . The ﬁrst two components are written in (24). f9 (x1 , x2 , σ1 , σ2 ) = w1 h1 (x1 , σ1 )h0 (x2 , σ2 ) + w7 h1 (x1 , σ1 )h2 (x2 , σ2 ) + ... (24) and their expansion coeﬃcients were w1 = 0.6267 and w7 = 1.5613e − 018. It is clear that to approximate this function it is suﬃcient to take only one node. Finally the result is the following f1 (x, σ) = 0.6267h1 (x, σ), or f1 (x1 , x2 , 0.7071, 0.7071) = 0.6267h1(x1 , 0.7071)h0(x2 , 0.7071) The h0 and h1 functions are calculated by using (12) and (9). Mean Squares Error (MSE) of the approximation is 5.6e − 12, so the approximant is almost exactly the same as the origin. Performance of this approximation is an argument in favour of a good generalization property of this Hermite function based approximation. In fact one can write the following √ 2 x2 − 12 π 1 1 − x22 f (x1 , x2 ) = x1 e = ( √ )( √ 1 e 2σ1 2x1 )( 1 e 2σ2 ) = 2 2 2π 4 π4 √ π 1 1 = ( √ )h1 (x1 , √ )h0 (x2 , √ ) = 0.6267h1(x1 , 0.7071)h0(x2 , 0.7071) 2 2 2 2 −x21 −x22

what means that generalization from numerical data is almost perfect. We have obtained the function formula which is suitable to be used anywhere, also outside the given region [−3, 3]2 .

Approximation of Functions by Multivariable Hermite Basis

137

z

0.5

0

−0.5 4 2

4 2

0

0

−2

−2 −4

y

−4

x

Fig. 2. The original function

More demanding generalization experiment is the following. For every function value, the noise signal is randomly generated in the range [−0.1, 0.1] and added to the function. The noised function is presented and Fig.3.

0.5

0

−0.5 4 2

4 2

0

0

−2

−2 −4

−4

Fig. 3. Random noise added to the function values to be used as an input for approximation algorithm

138

B. Beliczynski

As in the previous case there was only one expansion term suﬃcient. Because random feature of the experiment, we ran it 5 times, averaging obtained numbers. As the result w1 = 0.6283, σ1 = 0.7039, σ2 = 0.7050 were calculated. Those parameters are very close to the originals. MSE between the original function and the approximation obtained from noisy function was 1.42e − 5, what seems to be very good result of generalization. Example 2. In this example the function to be approximated is the following 2

2

f (x1 , x2 , x3 ) = x1 e−x1 −x2 sin(x1 + x2 + x3 ).

(25)

Let us use again the range [−3, 3]2 . We take 21 points along each axis obtaining totally 9261 pairs of arguments and function values to be processed. Along each axis, the number of Hermite components was set again to 3, so every one-variable Hermite function could have indices 0, 1 or 2. We obtained 33 Hermite components. Squares of the expansion coeﬃcients (weights) ordered in nonincreasing order are plotted in Fig.4 It is clear from this plot that 14 out of 27 expansion Hermite terms is suﬃcient to approximate function (25). MSE between the original function and approximated function is on the level of 3.4e − 4. If instead, one takes only 10 out of 27, this ensures 99% of error reduction. When similarly to the previous example a noise generated randomly from the range [−0.1, 0.1] was added and noisy data were used to process function approximation, then again diﬀerence between the original function (25) and the approximant (MSE), was on similar level 3.6e − 4. Again this is a good sign of generalization ability of this type Hermite based approximation.

10

10

10

10

10

10

10

−5

−10

−15

−20

−25

−30

−35

5

10

15

20

25

l

Fig. 4. Squares of wi (22) versus l from the most signiﬁcant to the least

Approximation of Functions by Multivariable Hermite Basis

5

139

Conclusions

We presented a hybrid method of multivariable function approximation by Hermite basis. The basis is composed from one-variable Hermite functions. Scaling parameters are determined via search algorithm, while expansion coeﬃcients are calculated explicitly from appropriate formulae. Initially we take an excessive number of expansion terms and select only those which contribute the most to the error decrease. This procedure seems to be the reason for a very good generalization property of the method.

References 1. Beliczynski, B.: Properties of the Hermite activation functions in a neural approximation scheme. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007, Part II. LNCS, vol. 4432, pp. 46–54. Springer, Heidelberg (2007) 2. Hlawatsch, F.: Time-Frequency Analysis and Synthesis of Linear Signal Spaces. Kluwer Academic Publishers, Dordrecht (1998) 3. Ma, L., Khorasani, K.: Constructive feedforward neural networks using Hermite polynomial activation functions. IEEE Transactions on Neural Networks 16, 821–833 (2005) 4. Kwok, T., Yeung, D.: Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Trans. Neural Netw. 8(3), 630–645 (1997) 5. Kwok, T., Yeung, D.: Objective functions for training new hidden units in constructive neural networks. IEEE Trans. Neural Networks 8(5), 1131–1148 (1997) 6. Kreyszig, E.: Introductory functional analysis with applications. J. Wiley, Chichester (1978) 7. Beliczynski, B., Ribeiro, B.: Some enhanencement to approximation of one-variable functions by orthonormal basis. Neural Network World 19, 401–412 (2009) 8. Lorentz, R.: Multivariate hermite interpolation by algebraic polynomials: A survey. Journal of Computational and Applied Mathematics 122, 167–201 (2000) 9. Reed, R.: Pruning algorithms - a survey. IEEE Trans. on Neural Networks 4(5), 740–747 (1993)

Using Pattern Recognition to Predict Driver Intent Firas Lethaus, Martin R.K. Baumann, Frank Köster, and Karsten Lemmer Institute of Transportation Systems, German Aerospace Center (DLR), Germany {firas.lethaus,martin.baumann,frank.koester,karsten.lemmer}@dlr.de

Abstract. Advanced Driver Assistance Systems (ADAS) should correctly infer the intentions of the driver from what is implied by the incoming data available to it. Gaze behaviour has been found to be an indicator of information gathering, and therefore could be used to derive information about the driver’s next planned objective in order to identify intended manoeuvres without relying solely on car data. Previous work has shown that signiﬁcantly distinct gaze patterns precede each of the driving manoeuvres analysed indicating that eye movement data might be used as input to ADAS supplementing sensors, such as CAN-Bus, laser, or radar in order to recognise intended driving manoeuvres. Drivers’ gaze behaviour was measured prior to and during the execution of diﬀerent driving manoeuvres performed in a dynamic driving simulator. The eﬃcacy of Artiﬁcial Neural Network models in learning to predict the occurrence of certain driving manoeuvres using both car and gaze data was investigated, which could successfully be demonstrated with real traﬃc data [1]. Issues considered included the amount of data prior to the manoeuvre to use, the relative diﬃculty of predicting diﬀerent manoeuvres, and the accuracy of the models at diﬀerent pre-manoeuvre times. Keywords: Pattern Recognition, Driver Intent, Driving Manoeuvres, Eye Tracking, Artiﬁcial Neural Networks, Machine Learning, Signal Detection Theory, ROC curves.

1

Introduction

Human error is the main cause of more than 90 percent of traﬃc accidents [2], which presents an opportunity to the automotive industry to tackle this problem via exploiting Advanced Driver Assistance Systems (ADAS). In general, ADAS should adapt to diﬀerent situations and manoeuvres thereby increasing its reliability. The use of manoeuvre recognition can help to avoid mismatches between the driver’s intention and the system’s reaction. In a situation where a driver intends to change lanes in order to overtake the lead car an incipient collision warning may confuse and irritate the driver as this would be a false alarm. The extent to which the driver perceives this confusion or action intrusion is strongly linked to the way the false warning is being presented. Haptic warnings A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 140–149, 2011. c Springer-Verlag Berlin Heidelberg 2011

Using Pattern Recognition to Predict Driver Intent

141

in the form of short duration brake have been demonstrated to be more eﬀective than acoustic warnings [3,4]. However, false haptic warnings distract the driver and can lead to additional driving errors, thus, negating the beneﬁt of such systems. Therefore, a collision warning system is required to be biased towards avoiding false alarms in order to work reliably. Manoeuvre identiﬁcation enables assistance to be allocated to manoeuvres relevant to the driving situation. In the example described above, the assistance system fails to assist the driver appropriately as the ’collision warning’ function has been designed for the manoeuvre ’car following’. Hence, the assistance system interprets the car’s fast approach, i.e. its decreasing time-to-collision (TTC), towards the car it will overtake as being a critical situation possibly resulting in a collision. The driver’s intention to change lanes while TTC is decreasing is not being recognised by the system. 1.1

Benefit of Gaze as a Data Source

The recognition of driving manoeuvres can be based on various data sources, such as CAN-Bus, where the change in the vehicle’s motion in response to the driver’s input is used. As a result, a manoeuvre can be detected once the driver has started carrying out the manoeuvre. However, if manoeuvre recognition is based on the driver’s gaze behaviour this corresponds to the cognitive phase of information gathering [5]. Thus, the gaze data source refers to the driver’s intent rather than to his execution of the manoeuvre [6] creating a temporal beneﬁt for gaze-based recognition. The example of collision warning demonstrates the necessity to identify driving manoeuvres in advance in order to keep false warnings to a minimum. This temporal advantage also allows the driver to be kept in the loop while using assistance systems ensuring that there is enough time to inform drivers and allow them to take appropriate action. Considering the collision warning example, the warning may be given earlier, if the driver does not plan to overtake the lead car and if the driver’s intentions are known to the system. Using gaze data as a source for recognising driving manoeuvres could also be applied in the context of a Lane Departure Warning System (LDW) that warns the driver if the vehicle should start to leave the lane, even if the driver intends to execute a lane change, or a Lane Change Assist System (LCA) that monitors the adjacent lanes, that is, the area around and behind the vehicle, to avoid lane change collisions that may occur, and would give an early warning if the driver is unaware of an approaching vehicle in the adjacent lane. A CAN-Bus-data-based and vehicle-sensor-based system identiﬁes an intended lane change as soon as the indicator is being set and a change occurs in the steering wheel angle. As a result, a warning is given to the driver while being in the beginning phase of executing a lane change manoeuvre. However, during the cognitive phase of action execution, the resistance of humans to change their intended behaviour is higher than during the information gathering phase [7,8,9]. Therefore, the use of gaze data could lead to earlier warnings that have a higher impact on the driver.

142

2

F. Lethaus et al.

Manoeuvre Recognition

A number of studies have been carried out in real traﬃc as well as in driving simulators focussing on recognising and identifying driving manoeuvres. Hidden Markov Models (HMMs) were applied to vehicle data from real traﬃc in order to recognise and identify driving manoeuvres using a batch algorithm [10]. Here, entire instances of a manoeuvre were detected as opposed to a constant stream of real-time data, and contextual information, such as gaze behaviour, lane, and surrounding traﬃc was used for their models. Gaze was fed into the model as a discrete signal with six possible values (front road, rear view mirror, right mirror, left mirror, right, and left). Drivers’ gaze behaviour was identiﬁed to be a relevant feature for driving manoeuvre prediction and recognition, predominantly in connection with lane changes overtaking, and executing turns. A combination of vehicle and gaze data delivered the best results. Further results showed that the discrimination of manoeuvres, such as overtaking and lane change left, is relatively poor, if only based on vehicle data, and that recognising turns and lane changes requires contextual information. On average, driving manoeuvres were recognised one second prior to a signiﬁcant change. A study focussed on continuously recognising lane change manoeuvres in realtime using a moving-base driving simulator was also undertaken [11]. Steering behaviour models were produced that were based on HMMs aiming at recognising and characterising emergency and normal lane changes as well as lane keeping manoeuvres. Information on the surrounding situation was ignored. However, exploiting contextual information is indispensable for making solid and robust detections available. Using a real-time system in a ﬁxed-base driving simulator, the detection of lane change manoeuvres was studied [12,13] by implementing a cognitive driver model in the Adaptive Control of Thought-Rational (ACTR) cognitive architecture [14]. The driver model was validated by comparing its behaviour with that of drivers. A combination of car and gaze data was used to infer information about drivers’ intentions. The detection of intended lane changes in real-time achieved an accuracy of 85 percent and detection rates of 80 percent within half a second and 90 percent within 1 second after initial behaviour leading into the execution of manoeuvres. It can be concluded from these results that the detection rate can be improved by adding gaze behaviour to recognition models for driving manoeuvres. 2.1

Patterns of Gaze Behaviour

It has been demonstrated that signiﬁcantly distinct gaze patterns precede each of the driving manoeuvres analysed [15,16] indicating that eye movement data may be used as input to ADAS supplementing sensors, such as CAN-Bus, laser, lidar or radar in order to recognise intended driving manoeuvres. The data were gathered in a ﬁeld study with an instrumented vehicle, logging data from the driver, the vehicle, and the environment. Drivers were asked to drive approximately 110 kilometres (approx. 68 miles) on a three-lane and a two-lane motorway as well as on one-lane rural roads. By performing Markov analyses (zero- and ﬁrst-order)

Using Pattern Recognition to Predict Driver Intent

143

it was found that the tendency to use the left wing mirror more often than the rear view mirror for manoeuvres to the left is the reverse for manoeuvres to the right, also observed by [17]. Overall, it was concluded that the number of mirror inspections increases with the number of lanes of a road, which concurs with conclusions drawn in other studies [18,19,20]. Taking the outcomes of the study described above into account a model based on gaze data from real traﬃc was created in order to predict speciﬁc driving manoeuvres [1]. A Feedforward neural network (FFNN) was trained using the Backpropagation algorithm [21] to be able to predict lane changes. The study using the real-world data demonstrated that it was possible to discriminate lane change left and lane change right from lane keeping prior to the manoeuvre actually taking place by building models using gaze data. However, due to the fact that the data was gathered in a real world environment the opportunities to perform these manoeuvres during the trial were relatively few and could not be guaranteed to be the same for every driver resulting in models based on a small pool of data. It was decided to carry out trials in a simulated environment where the level of traﬃc and opportunities to change lanes could be controlled providing the opportunity to gather a larger volume of data upon which to base predictive models of lane change.

3

Simulator Study

The data were gathered in a dynamic driving simulator in order to provide repeatability of scenarios, that is, each driver was exposed to same driving conditions, as well as to be able to provide many safe opportunities to change lane. This could also be achieved by controlling the volume of traﬃc, which is known to be a problematic factor in ﬁeld studies. 3.1

Driving Task

The study included a total of 10 participants (5 female, 5 male) aged 23 to 36 years (M=29.8, SD=4.6). All had normal vision, had held their driving licence for at least 5 years and drove more than 10,000 kilometres p.a. (∼6250 miles p.a.). Informed consent was obtained from each driver who participated prior to testing. The driving task took place in simulated traﬃc and comprised a drive on a three-lane and two-lane motorway each having a length of 70 kilometres (∼43.5 miles). Drivers were instructed to drive on the right-most lane throughout the experiment and to only use the centre lane (for three-lane motorway) or the leftmost lane (two-lane motorway) for the purpose of overtaking lead cars. Each drive took approximately 40 minutes and started with overtaking a group of lead cars, which was repeated ten times, was followed by 10 kilometres (∼6.2 miles) of car following, and ended with overtaking a single lead car, also repeated ten times. Prior to the beginning of the experiment, drivers were given oral and written instructions, followed by a gaze calibration procedure of the eye tracking system.

144

3.2

F. Lethaus et al.

Equipment

The driving task was performed in a dynamic driving simulator, a motion system based on a hexapod system, which allows motion with six degrees of freedom (Figure 1). The cabin hangs below the upper couplings. This allows a larger range of motion in a smaller space than would be possible with simulators whose couplings connect to the bottom of the cabin. A complete real vehicle is mounted in the simulator’s cabin. The vehicle is surrounded by the projection system of the dynamic driving simulator which provides a wide visual ﬁeld covering the front and the sides of the vehicle (270◦ horizontal × 40◦ vertical) and a high resolution presentation (approx. 9200 × 1280 pixels). The rear view mirror and TFT-displays in the side mirrors allow the driver to keep an eye on the rear traﬃc. A large plasma screen in the back of the vehicle displays the scenery to the rear. Communication between the cockpit and the simulation system is realised via CAN-Bus, which makes it possible to transmit all driver actions and to control the instruments inside the cockpit. The system also allows all inputs and actions made by the driver to be recorded and analysed.

Fig. 1. Dynamic driving simulator at the German Aerospace Center

Eye movements were recorded using a head-mounted eye tracking system, SMI iView XTM HED, and ﬁve non-overlapping viewing zones were deﬁned inside the vehicle (windscreen, left window/wing mirror, rear view mirror, speedometer, right-hand side) in order to analyse the driver’s gaze behaviour. 3.3

Data Processing

The gaze data was processed such that the ability to form a predictive model could be investigated in terms of: – the amount of data available prior to the manoeuvre taking place and – the amount of time before manoeuvre occurred at which the prediction was made by the model.

Using Pattern Recognition to Predict Driver Intent

145

Fig. 2. Gaze data used from a section of typical gaze patterns prior to a lane change left manoeuvre

For further modelling, two groups of data samples were selected, representing 5 second and 10 second windows of data preceding manoeuvres of interest (see Figure 2). Previous work [15,16] showed that a 10 second window of data preceding the manoeuvre was rich enough in information that distinct gaze patterns could be recognised. In order to establish whether there is redundancy within this 10 seconds of data, 5 second samples were also used here so that the predictive accuracy of models built using 10 seconds of data could be compared with models built using 5 seconds of data. Figure 2 shows the distribution of gaze behaviour across 5 viewing zones (areas of interest) from a 10 second window of data prior to a lane change manoeuvre to the left. In this study, each window of data was truncated by 0, 0.5, 1.0, and 1.5 seconds prior to the beginning of a manoeuvre. For instance, the 5-second window was either used as a whole sample (a) or reduced by half a second (b) (=4.5sec data sample), by 1 second (c) (=4sec), or by 1.5 seconds (d) (=3.5sec). The data were encoded into a format suitable for use in supervised learning, i.e. a vector of input and target data. The input section of the vector described the proportion of time spent looking in the ﬁve viewing zones (1=windscreen, 2=left window/wing mirror, 3=rear view mirror, 4=speedometer, 5=right window/wing mirror) during the selected period of time prior to execution of the manoeuvre, the target section of the vector indicated the ’class’ represented by the input vector and used a binary encoding, i.e. if the data represented lane keeping the target output was 0 and if it represented one of the manoeuvres of interest then the target was given as 1. The result was three groups each consisting of 284 (input, target) vectors of instances of lane change left, lane change right and lane keeping.

4

Artificial Neural Networks

Feedforward neural networks with two hidden nodes and a single output node were trained using the Backpropagation algorithm [21] to be binary classiﬁers

146

F. Lethaus et al.

ideally outputting 1 when an instance of the manoeuvre of interest was detected in the input data and 0 when the inputs indicated that the car was keeping to the lane. The actual output of the ANNs was a real number in the range [0.0, 1.0], which can be taken a measure of probability that the manoeuvre of interest has been detected. A threshold T , was then used in order to decide which of the two classes (lane change, lane keeping) the neural network output represented, i.e. the output of the neural network is P (C|x) where C is the class lane change and x is the input vector, hence the threshold T can be used to decide which class the neural network output represents as follows: IF P (C|x) > T → lane change ELSE → lane keeping

5

Analysis and Results

It is of importance to establish the ability of predictive models to correctly identify instances of the phenomenon of interest while avoiding false alarms. Analysis of the results has been carried out using methods from signal detection theory so that variation in the neural network models’ sensitivity and speciﬁcity, as its discriminating threshold is changed, can be measured, and the position of the threshold at which the best balance between the two measures exists can be found. Sensitivity measures the proportion of positive examples which are correctly identiﬁed (also known as the ’true positive’ rate or ’hit’ rate). Speciﬁcity measures the proportion of negative examples which are correctly identiﬁed (also known as the ’true negative’ rate). The false positive rate (FPR) indicates the number of examples incorrectly identiﬁed as positive examples and it can be shown that Speciﬁcity = (1-FPR). The threshold, T , was varied from 0.0 to 1.0 in order to create plots of the true positive rate (sensitivity) and the false positive rate (1-Speciﬁcity), known as ROC curves, which give a graphical representation of the tradeoﬀ between false alarms and higher detection rate of the phenomenon of interest with changing T . Table 1. Lane Change Left, highest d’ values with threshold, T 5 seconds 10 seconds 0s 0.5s 1.0s 1.5s 2.0s 0s 0.5s 1.0s 1.5s 2.0s d’ 3.121 2.887 2.551 2.261 2.092 2.611 2.416 2.117 1.95 1.571 T 0.5 0.4 0.45 0.3 0.3 0.4 0.45 0.4 0.35 0.45 Table 2. Lane Change Right, highest d’ values with threshold, T 5 seconds 10 seconds 0s 0.5s 1.0s 1.5s 2.0s 0s 0.5s 1.0s 1.5s 2.0s d’ 2.235 1.956 1.67 1.385 1.099 1.821 1.68 1.347 1.043 0.623 T 0.3 0.35 0.4 0.5 0.55 0.4 0.4 0.45 0.5 0.45

Using Pattern Recognition to Predict Driver Intent

147

Fig. 3. ROC curves for lane change left Fig. 4. ROC curves for lane change left manoeuvre, 5 second window of data manoeuvre, 10 second window of data

Fig. 5. ROC curves for lane change right Fig. 6. ROC curves for lane change right manoeuvre, 5 second window of data manoeuvre, 10 second window of data

The sensitivity index d’ (d prime) is a measure of the diﬀerence between the true positive rate and false positive rate, with larger values of d’ indicating a better performing model. The d’ values [22] obtained using 5 seconds of data were found to be larger than when using 10 seconds of data when predicting both lane change right and lane change left (see Tables 1 ,2). The d’ values were found to be smaller for lane change right than those obtained when predicting lane change left for both the 10 second and 5 second windows of data. This indicates that a better model can be obtained using just 5 seconds of data with the additional data in the 10 second windows constituting noise that the ANN has to learn to ignore. The task of predicting lane change right appears to be more diﬃcult than predicting lane change left as indicated by the d’ values obtained. Performance

148

F. Lethaus et al.

of the ANN models decreases as the time before the manoeuvre is increased as indicated by the changing d’ values and ROC curves (Figures 3-6). However, the ROC curves indicate that the predictions made by the ANNs are far better than chance and therefore are of predictive value. Visual inspection of gaze behaviour shows that planning starts earlier and occurs over a longer period of time for lane change left than lane change right.

6

Conclusion

This initial investigation has shown that gaze data is viable as a ’stand alone’ predictive measure of lane change manoeuvres. The size of the window of data has been shown to be important in producing good predictive models and care should be taken to use as small a window of data as possible in order to avoid the inclusion of irrelevant sections which act as additional noise. It has been concluded that it is more challenging to produce a predictive model for lane change right than lane change left and that better results may be obtained on the lane change right problem by generating a larger dataset and / or using a more complex model, which will be an objective in future work. Future work will also seek to combine gaze data with other available car data in order to produce models with improved predictive capability.

References 1. Lethaus, F.: Using eye movements to predict driving manoeuvres. In: Europe Chapter of the Human Factors and Ergonomics Society Annual Conference, Linköping, Sweden (2009) 2. German Federal Statistical Oﬃce: Unfallgeschehen im Straßenverkehr 2006 (Road Traﬃc Accidents 2006). Federal Statistical Oﬃce 2006-PressOﬃce, Wiesbaden (2007) 3. Dingus, T.A., Hulse, M.C., Barﬁeld, W.: Human-system interface issues in the design and use of advanced traveller information systems. In: Barﬁeld, W., Dingus, T.A. (eds.) Human Factors in Intelligent Transportation Systems, pp. 359–395. Lawrence Erlbaum Associates, London (1998) 4. Suzuki, K., Jansson, H.: An analysis of driver’s steering behaviour during auditory or haptic warnings for the designing of lane departure warning system. JSAE Review 24, 65–70 (2003) 5. Henderson, J.M., Ferreira, F.: Scene perception for psycholinguists. In: Henderson, J.M., Ferreira, F. (eds.) The Interface of Language, Vision, and Action: Eye Movements and the Visual World, pp. 1–58. Psychology Press, New York (2004) 6. Liu, A.: Towards predicting driver intentions from patterns of eye ﬁxations. In: Gale, A.G., Brown, I.D., Haslegrave, C.M., Taylor, S.P. (eds.) Vision in Vehicles VII, pp. 205–212. Elsevier, Amsterdam (1999) 7. Vollrath, M., Totzke, I.: Möglichkeiten der Nutzung unterschiedlicher Ressourcen für die Fahrer-Fahrzeug-Interaktion. In: Proceedings of Der Fahrer im 21. Jahrhundert, Braunschweig (2003) 8. Wickens, C.D.: The structure of attentional resources. In: Nickerson, R. (ed.) Attention and Performance VIII, pp. 239–257. Lawrence Erlbaum Associates, Hillsdale (1980)

Using Pattern Recognition to Predict Driver Intent

149

9. Wickens, C.D.: Processing resources in attention. In: Parasuraman, R., Davies, D.R. (eds.) Varieties of Attention, pp. 63–102. Academic Press, San Diego (1984) 10. Oliver, N., Pentland, A.P.: Graphical models for driver behavior recognition in a SmartCar. In: Proceedings of IEEE Conference on Intelligent Vehicles, Detroit, MI, USA (2000) 11. Kuge, N., Yamamura, T., Shimoyama, O., Liu, A.: A driver behavior recognition method based on a driver model framework. In: Proceedings of SAE World Congress 2000, Detroit, MI, USA (2000) 12. Salvucci, D.D., Boer, E.R., Liu, A.: Toward an integrated model of driver behavior in a cognitive architecture. Transportation Research Record 1779, 9–16 (2001) 13. Salvucci, D.D.: Inferring driver intent: a case study in lane-change detection. In: Proceedings of the HFES 48th Annual Meeting, Santa Monica, CA, USA (2004) 14. Anderson, J.R., Lebiere, C.: The atomic components of thought. Lawrence Erlbaum Associates, London (1998) 15. Lethaus, F., Rataj, J.: Do eye movements reﬂect driving manoeuvres? IET Intelligent Transport Systems 1(3), 199–204 (2007) 16. Lethaus, F., Rataj, J.: Using eye movements as a reference to identify driving manoeuvres. In: ATZ | ATZautotechnology (ed.) Proceedings of the FISITA World Automotive Congress 2008, vol. 1. Springer Automotive Media, Wiesbaden (2008) 17. Mourant, R.R., Donohue, R.J.: Acquisition of indirect vision information by novice, experiences, and mature drivers. Journal of Safety Research 9, 39–46 (1977) 18. Pastor, G., Tejero, P., Chóliz, M., Roca, J.: Rear-view mirror use, driver alertness and road type: an empirical study using EEG measures. Transportation Research: Part F 9, 286–297 (2006) 19. Underwood, G., Chapman, P., Brocklehurst, N., Underwood, J., Crundall, D.E.: Visual attention while driving: sequences of eye ﬁxations made by experienced and novice drivers. Ergonomics 46(6), 629–646 (2003) 20. Recarte, M.A., Nunes, L.M.: Eﬀects of verbal and spatial-imagery tasks on eye ﬁxations while driving. Journal of Experimental Psychology: Applied 6(1), 31–43 (2000) 21. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1986) 22. Brophy, A.L.: Alternatives to a table of criterion values in signal detection theory. Behavior Research Methods, Instruments, & Computers 18, 285–286 (1986)

Neural Networks Committee for Improvement of Metal’s Mechanical Properties Estimates Olga A. Mishulina1 , Igor A. Kruglov1, and Murat B. Bakirov2 1

National nuclear research university “MEPhI”, Moscow, Russia 2 “Center of material science and resource”, Moscow, Russia

Abstract. In this paper we discuss the problem of metal’s mechanical characteristics estimation on the basis of indentation curves. The solution of this problem makes it possible to unify computational and experimental control methods of elastic properties of materials at all stages of equipment life cycle (manufacturing, maintenance, reparation). Preliminary experiments based on data obtained by the use of ﬁnite element analysis method have proved this problem to be ill-posed and impossible to be solved by a single multilayered perceptron at the required precision level. To improve the accuracy of the estimates we propose to use a special neural net structure for the neural networks committee decision making. Experimental results have shown accuracy improvement for estimates produced by the neural networks committee and conﬁrmed their stability. Keywords: Neural networks committee, ill-posed problem, metal’s mechanical properties.

1

Introduction

Such mechanical characteristics of the metal, as yield strength, tensile strength, rate of strain hardening, characteristics of plasticity are fundamental for construction and exploitation of every object made of metal. Traditionally, these basic mechanical properties are determined by a uniaxial tensile test. A sample of material being tested is located in a special machine and is exposed to tension in elasticity and plasticity zones in a way which imitates metal’s stress conditions during its exploitation period. As a result of such a test the curve which shows the dependence between metal’s stress and strain is formed. This is a stress-strain curve. The described method is classiﬁed as destructive and can not always be applied in practice to the equipment in use. As an example, for the mechanical characteristics control of a nuclear pile during its operation period. In such situations methods that do not cause signiﬁcant equipment damage can only be applied. One of them is the kinetic indentation method [1,2,3]. The method of kinetic indentation is based on the continuous registration process of parameters of elastoplastic indentation in the sample under the stress applied at right angle to the surface. Thus, we achieve a qualitative analogy with ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 150–157, 2011. c Springer-Verlag Berlin Heidelberg 2011

Neural Networks Committee for Estimation of Metal’s Mechanical Properties

151

the uniaxial tensile test. Measurement of the material hardness implemented by indentation method attracts researchers due to its relative simplicity and the possibility to be used for a rapid test of the current state of metal equipment. A lot of theoretical and experimental researches have been made, which attempted to establish the relationship between hardness properties of materials and their characteristics of another kind (yield strength, ultimate tension) or to restore the uniaxial tensile curve. However, there is still no exact theoretical approach known which would reveal the relationship between indentation and stress-strain curves properties. In this paper a method for a stress-strain curve reconstruction is proposed which involves neural networks [4,5,6] based transformation of the indentation curve parameters.

2

Problem Statement and Data Analysis

The problem discussed in this paper is to determine the values of metal’s tensile properties σiT (yield stress) and m (rate of strain hardening) and to restore the stress-strain curve σ(ε) (stress σ vs. ε) on the basis of the corresponding indentation curve F (h) (indentation force F vs. strain h induced by the indentor). In addition to presenting indentation results in the form of a F (h) curve, it’s useful for further processing to present them in an alternative way as polynomial (or spline) approximation coeﬃcients of F (h). Thus, we consider two models for the stress-strain state of the metal: σ(ε) and F (h). The problem of estimating metal’s tensile characteristics can be divided into two separate tasks - direct and inverse. By direct task we mean building a F (h) model from a known σ(ε) model (known σiT and m values), and by inverse task - reconstruction of a σ(ε) curve basing on the corresponding F (h) curve. Using this terminology, the purpose of the paper is to solve the inverse task. At the ﬁrst step of the inverse task solution we analyze the indentation data set which was obtained with the help of Finite Element Analysis method. Each pattern in this dataset is represented in the form of a vector (x(p) , y(p) )T where (p) (p) (p) p is the number of the pattern (p = 1, 2, ..., P ), x(p) = (x1 , x2 , ..., xN )T is the vector of parameters that describe one indentation curve of the dataset and (p) y(p) = (σiT , m(p) )T stands for the values of metal’s mechanical characteristics. Vector x can consist of polynomial approximation coeﬃcients of an indentation curve, absolute and/or relative indentation strain values for the certain indentation force levels or other indentation curve characteristics. Nominally, the task is to derive from the available indentation dataset such a mapping y = φ−1 (x) (notation φ−1 is used to show that it is the inverse task solution) that can give precise y values for any allowed x values. In this paper we propose to construct such functional mapping in the class of neural networks. Since any experimentally obtained indentation curve will have some inaccuracy in the indentation strain values, the estimates of σiT and m values can’t be absolutely precise. There is a question, “Is it possible to obtain σiT and m estimates (y) for a given precision value (δy) under the condition of inexact

152

O.A. Mishulina, I.A. Kruglov, and M.B. Bakirov

measurements?” The following computer experiment has been made to answer this question. Let us ﬁnd out if there are any patterns 1 ≤ pi ≤ P , i = 1, k, k ≤ P in the dataset for which vectors x(pi ) , i = 1, k are so close that they wouldn’t be distinguished if they were obtained from true experimental (not Finite Element Analysis) data. We assume that the vector x(pi ) diﬀers from the vector x(pj ) no more than l% if N (p ) (p ) |xn i − xn j | ≤ 0.01lN. (1) (p ) |xn j | n=1 At this point two ways of forming x vectors are considered. The ﬁrst one sets the components of the x vector to cubic approximation coeﬃcients of the indentation curve, the second one – to the indentation strain values at several ﬁxed F (force which is applied to indentor) levels. The experiment has shown that for any method mentioned above and for those groups x(pi ) , i = 1, k where one vector diﬀered from another no more than 1% (diﬀerence rate was calculated using formula 1) the estimating value y could diﬀer up to hundreds percent (see Table 1). This allows us to state that the problem of the mechanical characteristics of metal estimation based on indentation curves is ill-conditioned. Table 1. Discrepancies of the σiT and m values for close x vectors Input type

Mean/maximum σiT and m discrepancy rate (%) for the x vectors that diﬀer no more than 1% σiT m

Cubic polynomial approximation coeﬃcients 0.8 / 13.6 Metal’s strain values for three force levels (50, 100, 200 N) 18.2 / 46.2

2.3 / 33.3 60.4 / 180.0

Results presented in Table 1 reveal peculiar data properties which do not allow to obtain high-precised results. But we can increase the accuracy level of y estimates produced by a single neural network by involving a more sophisticated structure such as neural networks committee [7].

3

Neural Networks Committee

The following general idea for estimates improvement of metal’s mechanical characteristics is proposed (see Fig. 1). Using diﬀerent intersecting subsets of the initial dataset, a group of multilayered perceptrons MLP-1, MLP-2, ..., MLP-K is trained to provide inverse task solution y = (m, σiT )T . Thus, we have K es˜ (k) = (m, timates y ˜ σ ˜iT )T (k = 1, 2, ..., K) of metal’s mechanical characteristics ˜ (k) a for a given vector x of indentation curve parameters. For each estimate y (k) conﬁdence measure s is calculated by an additional neural network. Estimates

Neural Networks Committee for Estimation of Metal’s Mechanical Properties

153

Fig. 1. Illustration of general neural networks committee approach to get more accurate estimates of metal’s mechanical characteristics

˜ (k) with highest conﬁdence measure values are considered to be most plausiy ble ones. Theese estimates are used by the decision making network to produce the ﬁnal estimate of metal’s mechanical characteristics. The corner stone of this method is how to deﬁne the conﬁdence measure s. There can be multiple ways to determine its value, but in this paper we propose to use a neural network solution of the direct indentation task to calculate it. As it was said before, the problem of ﬁnding a connection between an indentation curve and corresponding stress-strain curve consists of two separate tasks: direct (produce an indentation curve for a given stress-strain curve) and inverse (produce a stress-strain curve for a given indentation curve). The inverse indentation task has signiﬁcant practical meaning but it turned out to be illposed. Fortunately, the direct task is well-posed. It can be easily explained why. Suppose that we know the true solution of the inverse task: y = φ−1 (x). Then absolute error Δy depends on experimental data error Δxin the following man −1 −1 Δy ner: Δx ∝ φ . Since the inverse task is ill-posed, the φ must have a x x great numerical value. It is conﬁrmed by the data in Table 1. Direct task solution in such a case is represented by the function x = φ(y). And the error rate for it −1 −1 1 is proprtional to which is lower than φ if φ > 1. This x x (φ−1 )x leads to a conclusion that the direct task is well-posed. And experiments conﬁrm that it really is so for the indentation problem. But our aim is to improve the precision rate of the ill-posed inverse task which can be achieved by involving a direct task solution. It is possible due to construction of a neural networks committee which implements the idea illustrated in Fig. 1. To solve the inverse indentation task we propose to use a neural networks committee which structure is shown in Fig. 2. It consists of two groups of neural networks. The ﬁrst one contains K multilayered perceptrons (MLP) that are

154

O.A. Mishulina, I.A. Kruglov, and M.B. Bakirov

Fig. 2. Neural networks committee architecture

trained to produce an inverse task solution. Each of them provides an estimate of mechanical characteristics taking parameters of the indentation curve (cubic polynomial approximating coeﬃcients) as its input values. Every MLP has two hidden layers that contain 3 (for m) or 5 (for σiT ) neurons each. All the neurons, except those ones in the output layer, use hyperbolic tangent as theirs activation functions. For the output layer linear function is used. The second group in the committee consists of the only neural network which produces direct task solution. We use only one network because the direct task is well-posed and there is no need in extra neural networks. The data provided by this single network is accurate enough. For each input vector x the ﬁrst block of neural networks produces K esti˜ (k) (k = 1, 2, ..., K) of metal’s mechanical mates (one estimate for a network) y (k) ˜ estimates can be close to the true y value and our characteristics. Some of y aim is to ﬁnd all closest ones. That’s where the direct task solution network acts. ˜ (k) value is successively passed to its input (see Fig. 2), providing x ˜ (k) Each y value at the output. Inverse networks with labels ki (i = 1, 2, ..., M , M ≤ K) (ktask ˜ i ) − x (for k = 1, 2, ..., K) has lowest values among for which the diﬀerence x others (which means highest conﬁdence s values) are considered to provide best ˜ (ki ) estimates of the true y value. This selection method arises from the fact y that for true φ and φ−1 functions this diﬀerence would be equal zero. At the

Neural Networks Committee for Estimation of Metal’s Mechanical Properties

155

˜ (ki ) estimates are passed to the decision making network next step M chosen y (multilayered perceptron which has two hidden layers with 3 neurons in each layer) that produces the ﬁnal estimate of mechanical characteristics σiT and m.

4

Train Procedure

To train the committee we used a dataset, containing 86 patterns of indentation curves, built with the use of Finite Element Analysis method for the metal of a chosen type. Initial dataset was randomly split into two parts (see Fig. 3). The

Fig. 3. Partitioning scheme of the original sample into train and test parts

Fig. 4. Mean relative error histograms for the committee (front histogram) and a single network (back histogram) for 30 diﬀerent test sets

156

O.A. Mishulina, I.A. Kruglov, and M.B. Bakirov

ﬁrst one contained 80% of all patterns and was used to train and validate all the networks in the committee. The other 20% left were used to test the trained committee. It is necessary to say that each network was trained using its individual training and validation subsets that were randomly selected from the general trainingvalidation part of the original dataset. This prevented us from getting a trained committee in which all the networks would have behaved alike because they had been trained using the same training set. This sample division scheme is based on the cross-validation idea [8] and allows us to train the committee successfully even when the original dataset contains few patterns. Fig. 4 shows average relative error histograms for the neural network committee and single neural network for 30 diﬀerent train and test sets built according to the scheme described above. These results were obtained for the committee which contained 10 neural networks. The selection block received M = 2 most plausible output values choosed on the basis of the classiﬁcation results. Complementary experiments shown that the value of 2 for M was optimal. If the value of M parameter exceeds 2 than the committee won’t give any estimate improvement compared to a single network results.

5

Stability of the Committee Decision

Since true experimental data unlike Finite Element Analysis data can contain random errors caused by equipment inaccuracy, experimental environment and so on, it’s necessary to test the stability of the committee decision. To perform such a test we add random noise to the initial train and test datasets and check if the committee can still produce plausible σiT and m estimates. By the noise level of n% we mean adding to the indentation strain value d at each point of the indentation curve a random quantity which is distributed normally with zero mean and standard deviation equal to 0.01nd . Table 2 shows the results of this 3 test for the case in which σiT and m estimates are produced by a single neural network (multilayered perceptron) and by a committee. The data in Table 2 shows average results for 100 tests that have been run. Each test consisted of constructing new train and test datasets (see Fig. 3), training a single neural network and a committee and their test results comparison. As seen from Table 2, neural networks committee provides more reliable results rather than a single neural network for noised data. This fact gives the committee an advantage in practical use. Table 2. Test results for 5% noised data Accuracy ﬁgure

Single network Committee σiT m σiT m Mean relative error 14.0% 23.4% 10.8% 18.3% Maximum relative error 47.8% 110.8% 26.4% 63.8%

Neural Networks Committee for Estimation of Metal’s Mechanical Properties

6

157

Conclusion

The problem of metal’s mechanical characteristics estimation on the basis of indentation curves parameters was concerned. The proposed method to solve this task is based on neural networks. But since the experiments have shown this problem to be ill-posed, the use of a single neural network was insuﬃcient. A neural networks committee was developed in order to get more accurate estimates. Networks of the committee were trained and validated using intersecting subsets of the initial dataset which was obtained by Finite Element Analysis method. The ﬁnal decision in the committee was made by an extra generalizing neural network. This type of a committee experimentally proved to produce more precise results rather than a single neural network did even in the case of noised indentation curves which were close to the real experimental ones.

References 1. Bakirov, M.B.: Modiﬁziertes Harteprufverfahren. Kontrolle 10, 16–18 (1994) 2. Field, J.S., Swain, M.V.: A simple predictive model for spherical indentation. J. Mater. Res. 8, 297–306 (1993) 3. Ahn, J.-H., Kwon, D.: Derivation of plastic stress-strain relationship from ball indentations: Examination of strain deﬁnition and pileup eﬀect. J. Mater. Res. 16 (2001) 4. Haykin, S.: Neural networks. A Comprehensive Foundation, 2nd edn. Prentice Hall, Englewood Cliﬀs (2005) 5. Tyulyukovskiy, E., Huber, N.: Identiﬁcation of viscoplastic material parameters from spherical indentation data: Part I. Neural networks. J. Mater. Res. 21, 664–676 (2006) 6. Kl¨ otzer, D., Ullner, C., Tyulyukovskiy, E., Huber, N.: Identiﬁcation of viscoplastic material parameters from spherical indentation data: Part II. Experimental validation of the method. J. Mater. Res. 21, 677–684 (2006) 7. Verikas, A., Lipmickas, A., Malmqvist, K.: Selecting neural networks for a committee decision. International Journal of Neural Systems 5(12), 351–361 (2002) 8. Efron, B., Tibshirani, R.: An introduction to the bootstrap. Chapman and Hall, London (1993)

Logarithmic Multiplier in Hardware Implementation of Neural Networks Uroš Lotrič and Patricio Bulić Faculty of Computer and Information Science, University of Ljubljana, Slovenia {uros.lotric,patricio.bulic}@fri.uni-lj.si

Abstract. Neural networks on chip have found some niche areas of applications, ranging from massive consumer products requiring small costs to real-time systems requiring real time response. Speaking about latter, iterative logarithmic multipliers show a great potential in increasing performance of the hardware neural networks. By relatively reducing the size of the multiplication circuit, the concurrency and consequently the speed of the model can be greatly improved. The proposed hardware implementation of the multilayer perceptron with on chip learning ability conﬁrms the potential of the concept. The experiments performed on a Proben1 benchmark dataset show that the adaptive nature of the proposed neural network model enables the compensation of the errors caused by inexact calculations by simultaneously increasing its performance and reducing power consumption. Keywords: Neural network, Iterative logarithmic multiplier, FPGA.

1

Introduction

Artiﬁcial neural networks are commonly implemented as software models running in general purpose processors. Although widely used, these systems usually operate on von-Neumann architecture which is sequential in nature and as such can not exploit the inherent concurrency present in artiﬁcial neural networks. On the other hand, hardware solutions, specially tailored to the architecture of neural network models, can better exploit the massive parallelism, thus achieving much higher performances and smaller power consumption then the ordinary systems of comparable size and cost. Therefore, the hardware implementations of artiﬁcial neural network models have found its place in some niche applications like image processing, pattern recognition, speech synthesis and analysis, adaptive sensors with teach-in ability and so on. Neural chips are available in analogue and digital hardware designs [1,2]. The analogue designs can take advantage of many interesting analogue electronics elements which can directly perform the neural networks’ functionality resulting in very compact solutions. Unfortunately, these solutions are susceptible to noise, which limits their precision, and are extremely limited for on-chip learning. On the other hand, digital solutions are noise tolerant and have no technological obstacles for on-chip learning, but result in larger circuit size. Since the design of A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 158–168, 2011. c Springer-Verlag Berlin Heidelberg 2011

Logarithmic Multiplier in Hardware Implementation of Neural Networks

159

the application speciﬁc integrated circuits (ASIC) is time consuming and requires a lot of resources, many hardware implementations use of programmable integrated circuit technologies, like ﬁeld programmable gate array (FPGA) technology. The implementation of neural network models in integrated circuits is still a challenging task due to the complex algorithms involving a large number of multiplications. Multiplication is a resource, power and time consuming arithmetic operation. In artiﬁcial neural network designs, where many concurrent multiplications are desired, the multiplication circuits should be as small as possible. Due to the complexity of circuits needed for ﬂoating-point operations, the designs are constrained to the ﬁxed-point implementations, which can make use of integer adders and multipliers. The integer multiplier circuits can be further optimized. Many practical solutions like truncated and logarithmic multipliers [3,4,5] consume less space and power and are faster then ordinary multipliers for the price of introducing small errors to the calculations. These errors can cause serious problems in neural network performance if the teaching is not performed on-chip. However, if the neural network learning is performed on-chip, the erroneous calculations should be compensated in the learning phase and should not seriously degrade its performance. All approximate multipliers discard some of the less signiﬁcant partial products and introduce a sort of a compensation circuit to reduce the error. The main idea of logarithmic multipliers is to approximate the operands with their logarithms thus replacing the multiplication with one addition. Errors introduced by approximation are usually compensated by some lookup-table approach, interpolations or based on Mitchell’s algorithm [3]. The one stage iterative logarithmic multiplier [5] follows the ideas of Mitchell but uses diﬀerent error-correction circuits. The ﬁnal hardware implementation involves only one adder and few shifters, resulting in reduced usage of logic resources and power consumption. In this paper the behaviour of hardware implementation of neural network using iterative logarithmic multipliers is considered. In the next section the iterative logarithmic multiplier is introduced, outlining its advantages and weaknesses. Furthermore, a highly parallel processing unit specially suited for feed-forward neural networks is proposed. Its design allows it to be used in the forward pass as well as the backward pass during the learning phase. In section four the performance of the proposed solution is tested on many benchmark problems. The results are compared with the hardware implementation using exact matrix multipliers as well as ﬂoating-point implementation. Main ﬁndings are summarized in the end.

2

Iterative Logarithmic Multiplier

The iterative logarithmic multiplier (ILM) was proposed by Babic et al. in [5]. It simpliﬁes the logarithm approximation introduced in [3] and introduces an iterative algorithm with various possibilities for achieving an error as small as required and the possibility of achieving an exact result.

160

2.1

U. Lotrič and P. Bulić

Mathematical Formulation

The logarithm of the product of two non-negative integer numbers, N1 and N2 can be written as the sum of the logarithms, log2 (N1 · N2 ) = log2 N1 + log2 N2 . By denoting k1 = log2 N1 and k2 = log2 N2 , the logarithm of the product can be approximated as k1 + k2 . In this case the calculation of the approximate product 2k1 +k2 requires only one add and one shift operation, but has a large error. To decrease this error, the following procedure is prosed in [5]. A non-negative integer number N can be written as N = 2k + N (1)

(1)

,

where k is a characteristic number, indicating the place of the leftmost 1 or the leading 1 bit in its binary representation, and the number N (1) = N − 2k is the remainder of the number N after removal of the leading 1. Following the notation in Eq. 1, the product of two numbers can be written as (1)

(1)

Ptrue = N1 · N2 = (2k1 + N1 ) · (2k2 + N2 ) (0) = Papprox + E (0)

(2)

.

While the the ﬁrst approximation of the product (1)

(1)

(0) = 2k1 +k2 + N1 · 2k2 + N2 · 2k1 Papprox

(3)

can be calculated by applying only few shift and add operations, the term (1)

E (0) = N1

(1)

· N2

,

E (0) > 0

(4)

,

representing the absolute error of the ﬁrst approximation, requires multiplication. Similarly, the proposed multiplication procedure can be performed on multiplicands from Eq. 4 such that E (0) = C (1) + E (1)

(5)

,

where C (1) is the approximate value of E (0) , and E (1) the corresponding absolute error. The combination of Eq. 2 and Eq. 5 gives (0) (1) Ptrue = Papprox + C (1) + E (1) = Papprox + E (1)

.

(6)

By repeating the described procedure we can obtain an arbitrarily precise approximation of the product by summing up iteratively obtained correction terms C (j) (i) (0) Papprox = Papprox +

i j=1

C (j)

.

(7)

Logarithmic Multiplier in Hardware Implementation of Neural Networks

161

Table 1. Average and maximal relative errors for 16-bit iterative multiplier [5] number of iterations i 0 1 2 3 (i) average Er [%] 9.4 0.98 0.11 0.01 (i) max Er [%] 25.0 6.25 1.56 0.39

The number of iterations required for an exact result is equal to the number of bits with the value of 1 in the operand with the smaller number of bits with the value of 1. Babic at al. [5] showed that in the worst case scenario the relative error (i) introduced by the proposed multiplier Er = E (i) /N1 N2 decays exponentially −2(i+1) with the rate 2 . Table 1 presents the average and maximal relative errors with respect to the number of considered iterations. The proposed method assumes non-negative numbers. To apply the method on signed numbers, it is most appropriate to specify them in sign and magnitude representation. In that case, the sign of the product is calculated as the EXOR operation between sign bits of the both multiplicands. 2.2

Hardware Implementation

The implementation of the proposed multiplier is described in [5]. The multiplier with one error correction circuit, shown in Figure 1a, is composed of two pipelined basic blocks, of which the ﬁrst one calculates an approximate product (0) Papprox , while the second one calculates the error-correction term C (1) . The task of the basic block is to calculate one approximate product according to Eq. 3. To decrease the maximum combinational delay in the basic block, we used pipelining to implement the basic block. The pipelined implementation of the basic block is shown in Figure 1b and has four stages. The stage 1 calculates the two (1) (1) characteristic numbers k1 , k2 and the two residues N1 , N2 . The residues are (1) (1) outputted in the stage 2, which also calculates k1 + k2 , N1 · 2k2 and N2 · 2k1 . (1) (1) The stage 3 calculates 2k1 +k2 and N1 · 2k2 + N2 · 2k1 , which are summed (0) up to the approximation of the product Papprox in the stage 4. After the initial latency of 5 clock periods the proposed iterative logarithmic multiplier enables the products to be calculated in each clock period. The estimated device utilization in terms of programmable hardware components, i.e. slices and lookup tables, and power consumption at frequency of 25 MHz for the 16-bit pipelined implementations of the proposed multiplier and the classical matrix multiplier are compared in Table 2. Table 2. Device utilization and power consumption of multipliers obtained on the Xilinx Spartan 3 XC3S1500-5FG676 FPGA circuit multiplier iterative logarithmic matrix

slices lookup tables power [mW] 427 803 7.32 477 1137 9.16

162

U. Lotrič and P. Bulić

N1

N2

LOD

LOD 2 k2

2 k1

PRIORITY ENCODER

PRIORITY ENCODER N1-2

k1

k1

STAGE 1 N2-2

k2

k2

BASIC BLOCK

Register

N1 N2

STAGE 1

Register

N1-2 k1 (1)

N2-2 k2 k1

BASIC BLOCK

k2

+

BARREL SHIFTER LEFT

Register

Register

N1

STAGE 2

Register

Register

(1)

N2

STAGE 1 (2)

N1

STAGE 3

STAGE 2

BARREL SHIFTER LEFT

STAGE 2 Register ( N1-2 k1)2

(2)

N2

k2

( N2-2 k2)2

k1

k1 + k2

STAGE 4

STAGE 3

Register

(0)

Papprox

Register

STAGE 4 ( N1-2 k1)2

2 k1 + k2

Register

C

(1)

+

+

a.

STAGE 3

+

DECODER

(1) Papprox

k2

+ ( N2-2 k2)2

k1

STAGE 4

Register

(0)

b.

Papprox

Fig. 1. Block diagrams of a. a pipelined iterative logarithmic multiplier with one errorcorrection circuit, and b. its basic block

3

Multilayer Perceptron with Highly Parallel Neural Unit

One of the most widely used neural networks is the multilayer perceptron, which gained its popularity with the development of the back propagation learning algorithm [6]. Despite its simple idea the learning phase still presents a hard nut to crack when hardware implementations of the model are in question. A multilayer perceptron is a feed-forward neural network consisting of a set of source nodes forming the input layer, one or more hidden layers of computation nodes, and an output layer of computation nodes. A computation or node l xl−1 , a neuron n in a layer l ﬁrst computes an activation potential vnl = i ωni i l−1 l a linear combination of weights ωni and outputs from the previous layer xi . To get the neuron output, the activation potential is passed to an activation function, xln = ϕ(vnl ), for example ϕ(v) = tanh(v). The objective of a learning algorithm is to ﬁnd such a set of weights and biases that minimizes the performance function, usually deﬁned as a squared error between calculated outputs and target values. For the back-propagation learning rule, the weight update l l l l l l equation in its simplest form becomes ni = ηδn xi , with δn = ϕ (vn )(tn − xn ) in ωl+1 l l l+1 the output layer and δn = ϕ (vn ) o δo wno otherwise, where η is a learning parameter and tn the n-th element of a target output.

Logarithmic Multiplier in Hardware Implementation of Neural Networks

163

A multilayered perceptron exhibits two levels of concurrency: a ﬁne-grained computation of each neuron’s activation potential and a coarse-grained computation of outputs from all neurons in a layer. A lot of existing solutions bet on the latter concept [7], which complicates the hardware implementation of the learning process. Since the calculations of activation potential and delta of neurons in hidden layers are very similar, we have exploited the ﬁrst concept. For that purpose we have built a highly parallel neural unit that calculates the scalar product of two vectors in only one clock cycle [8]. The inputs to the neural units are ﬁrst passed to the multipliers from which the products are then fed to the adders organized in a tree-like structure. In order to gain as much as possible from the neural unit, it should be capable of calculating a scalar product of the largest vectors that appear in the computation. The hardware circuit thus becomes very complex and can only be operated at lowered frequencies. For example, a unit with 32 multipliers and consequently 31 adders was implemented in Spartan 3 XC3S1500-5FG676 FPGA chip. While separate multiplications can run at maximum frequency of 50 MHz, the proposed unit managed to run at still acceptable 30 MHz [8]. To use the neural unit, a set of subsidiary units is needed: RAM memory for storing weights, registers for keeping inputs, outputs and partial results, multiplexers for loading proper data to the neural unit, lookup tables with stored values of activation function (LUT) and its derivative (LUTd) and three state machines. The forward pass and the backward pass are controlled by the Learn and Execute state machines which are supervised by the Main state machine. A simpliﬁed scheme of the implementation is shown in Fig. 2.

Fig. 2. Neural network implementation scheme [8]

164

4

U. Lotrič and P. Bulić

Experimental Work

To asses the performance of the iterative logarithmic multiplier, a set of experiments was performed on multilayer perceptron neural networks with one hidden layer. The models were compared in terms of classiﬁcation or approximation accuracy, speed of convergence, and power consumption. Three types of models were evaluated: a) an ordinary software model (SM) using ﬂoating point arithmetic, b) a hardware model with exact matrix multipliers (HMM ), and c) the proposed hardware model using the iterative logarithmic multipliers with one error correction circuit (HML ). The models were evaluated on Proben1 collection of freely available benchmarking problems for the neural network learning [9]. A rather heterogeneous collection contains 15 data sets from 12 diﬀerent domains, and all but one consist of real world data. Among them 11 data sets are from the area of pattern classiﬁcation and the remaining four from the area of function approximation. The datasets, containing from few hundred to few thousand input-output samples, are already divided into training, validation and test set, generally in proportion 50 : 25 : 25. The number of attributes in input samples ranges from 9 to 125 and in output samples from 1 to 19. Before modelling, all input and output samples were rescaled to the interval [−0.8, +0.8]. The testing of models on each of the data sets mentioned above was performed in two steps. After ﬁnding the best software models, the modelling of hardware models started, keeping the same number of neurons in the hidden layer. During the software model optimization, the topology parameters as well as the learning parameter η were varied. Since the number of inputs and outputs is predeﬁned with a data set, the only parameter inﬂuencing the topology of the model is the number of neurons in the hidden layer. It was varied from one to a maximum value, determined in such a way, that the number of model weights did not exceed the number of training samples. The learning process in the backpropagation scheme heavily depends on the learning parameter η. Since the data sets are very heterogeneous, the values 2−2 , 2−4 , . . . , 2−12 were used for the learning parameter η. Powers of two are very suitable for hardware implementation because the multiplications can be replaced by shift operation. While the software model uses 64-bit ﬂoating point arithmetic, both hardware models use ﬁxed point arithmetic with weights represented with 16, 18, 20, 22, or 24 bits. For both hardware models the weights were limited to the interval [−4, +4]. The processing values including inputs and outputs were represented with 16 bits in the interval [−1, +1]. The values of activation function ϕ(v) = tanh(1.4 v) and its derivatives for 256 equidistant values of v from the interval [−2, 2] were stored in two separate lookup tables. By applying the early stopping criterion, the learning phase was stopped as soon as the classiﬁcation or approximation error on the validation set started to grow. The analysis on test set was performed with the model parameters which gave the minimal value of the normalized squared error on validation set. The normalized squared error is deﬁned as a squared diﬀerence between the calculated and the target outputs averaged over all samples and output

Logarithmic Multiplier in Hardware Implementation of Neural Networks

165

Fig. 3. Performance of the models with respect to the weight precision on Hearta1 data set

attributes, and divided with a squared diﬀerence between the maximal and the minimal value of the output attributes. Results, presented in the following, are only given for the test set samples which were not used during the learning phase. In Fig. 3, a typical dependence of the normalized squared error on the weight precision is presented. The normalized squared error exponentially decreases with increasing precision of weights. However, the increasing precision of weights also requires more and more hardware resources. Since there is a big drop in the normalized squared error from 16 to 18 bit precision and since we can make use of numerous prefabricated 18 × 18 - bit matrix multipliers in the new Xilinx FPGA programmable circuits, our further analysis is conﬁned to 18-bit weight precision. The model performance for some selected data sets from Proben1 collection is given in Table 3. Average values and standard variations for all three types of models over ten runs are given in terms of three measures: the number of epochs, the normalized squared error Ete and the percentage of misclassiﬁed samples pmiss te . The latter is only given for the data sets from classiﬁcation domain. The results obtained for software models using the backpropagation algorithm are similar to those reported in [9], where more advance learning techniques were applied. The most noticeable diﬀerence between software and hardware models is in the number of epochs needed to train a model. The number of epochs in the case of the hardware models is for many data sets and order of magnitude smaller than in the case of the software models. The reason probably lies in the inability of hardware models to further optimize the weights due to their representation in limited precision. As a rule, the hardware models exhibit slightly poorer performance in case of the normalized squared error and the percentage of misclassiﬁed samples. A discrepancy is very large at gene1 and thyroid1 data sets, where more than 18 bits representation of weights is needed to close the gap. The comparison of hardware models HMM and HML reveals that the replacement of the exact matrix multipliers with the proposed approximate iterative logarithmic multipliers does not have any notable eﬀect on the performance of the models. The reasons for the very good compensation of the errors caused by

166

U. Lotrič and P. Bulić

Table 3. Performance of software and hardware models on some data sets. For each data set the results obtained with models SM, HMM , and HML are given in the ﬁrst, second, and third row, respectively. data set cancer1

diabetes1

gene1

thyroid1

building1

flare1

hearta1

heartac1

hidden neurons epochs 6 24.0 ± 4.6 30.7 ± 4.7 30.6 ± 4.8 7 152 ± 62 28.5 ± 3.2 27.4 ± 5.5 8 1230 ± 208 20.6 ± 1.3 20.6 ± 1.3 48 6830 ± 2340 24.6 ± 5.4 23.1 ± 5.2 56 50.5 ± 6.0 16.2 ± 0.4 16.2 ± 0.4 4 21.1 ± 3.4 30.8 ± 14.6 32.5 ± 15.1 3 5640 ± 928 38 ± 13 38 ± 12.5 4 5070 ± 371 41 ± 19 44.2 ± 27.2

Ete 0.111 ± 0.114 ± 0.116 ± 0.407 ± 0.418 ± 0.416 ± 0.262 ± 0.337 ± 0.339 ± 0.132 ± 0.195 ± 0.195 ± 0.158 ± 0.217 ± 0.217 ± 0.075 ± 0.076 ± 0.076 ± 0.330 ± 0.342 ± 0.344 ± 0.250 ± 0.272 ± 0.271 ±

0.002 0.002 0.003 0.002 0.001 0.001 0.005 0.007 0.007 0.002 0.003 0.002 0.001 0.015 0.014 0.001 0.002 0.001 0.002 0.005 0.006 0.014 0.019 0.020

pmiss te [%] 1.46 ± 0.29 1.72 ± 0.00 1.72 ± 0.00 23.85 ± 0.42 25.31 ± 0.44 24.74 ± 0.51 12.03 ± 0.59 23.33 ± 1.29 23.82 ± 1.25 2.75 ± 0.13 6.09 ± 0.19 6.07 ± 0.14

inexact multiplication can be found in the high ability of adaptation, common to all neural network models. The proposed neural unit needs to be applied many times to calculate the model output, therefore it is important to be as small and as eﬃcient as possible. The estimation of device utilization in terms of Xilinx Spartan 3 FPGA programmable circuit building blocks for a model with 32 exact 16 × 18 matrix multipliers is shown in Table 4. According to the analysis of the multipliers in Table 2, the replacement of the matrix multipliers with the iterative logarithmic

Table 4. Estimation of FPGA device utilization for a neural network model with 32 inputs, 8 hidden neurons and 10 outputs using 16 × 18 - bit matrix multipliers

whole model neural unit multipliers

slices (×1000) 27 25 (92 %) 24 (89 %)

lookup tables (×1000) 38 34 (89 %) 33 (87 %)

Logarithmic Multiplier in Hardware Implementation of Neural Networks

167

multipliers can lead to more than 10 % smaller device utilization and more than 30 % smaller power consumption.

5

Conclusion

Neural networks oﬀer a high degree of internal parallelism, which can be eﬃciently used in custom design chips. Neural network processing comprises of a huge number of multiplications, i.e. arithmetic operations consuming a lot of space, time and power. In this paper we have shown that exact matrix multipliers can be replaced with approximate iterative logarithmic multipliers with one error correction circuit. Due to the highly adaptive nature of neural network models which compensated the erroneous calculation, the replacement of the multipliers did not have any notable impact on the models’ processing and learning accuracy. Even more, the proposed logarithmic multipliers require less resources on a chip, which leads to smaller designs on one hand and on the other hand to designs with more concurrent units on the same chip. A consumption of less resources per multiplier also results in more power eﬃcient circuits. The power consumption reduced for roughly 20 % makes the hardware neural network models with iterative logarithmic multipliers favourable candidates for battery powered applications.

Acknowledgments This research was supported by Slovenian Research Agency under grants P20241 and P2-0359, and by Slovenian Research Agency and Ministry of Civil Aﬀairs, Bosnia and Herzegovina, under grant BI-BA/10-11-026. [1]

References 1. Zhu, J., Sutton, P.: FPGA implementations of neural networks - a survey of a decade of progress. In: Cheung, P.Y.K., Constantinides, G.A., de Sousa, J.T. (eds.) FPL 2003. LNCS, vol. 2778, pp. 1062–1066. Springer, Heidelberg (2003) 2. Dias, F.M., Antunesa, A., Motab, A.M.: Artiﬁcial neural networks: a review of commercial hardware. Engineering Applications of Artiﬁcial Intelligence 17, 945–952 (2004) 3. Mitchell, J.N.: Computer multiplication and division using binary logarithms. IRE Transactions on Electronic Computers 11, 512–517 (1962) 4. Mahalingam, V., Rangantathan, N.: Improving Accuracy in Mitchell’s Logarithmic Multiplication Using Operand Decomposition. IEEE Transactions on Computers 55, 1523–1535 (2006) 5. Babic, Z., Avramovic, A., Bulic, P.: An Iterative Logarithmic Multiplier. Microprocessors and Microsystems 35(1), 23–33 (2011) ISSN 0141-9331, doi:10.1016/j.micpro.2010.07.001 6. Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Prentice-Hall, New Jersey (1999)

168

U. Lotrič and P. Bulić

7. Pedroni, V.A.: Circuit Design With VHDL. MIT, Cambridge (2004) 8. Gutman, M., Lotrič, U.: Implementation of neural network with learning ability using FPGA programmable circuits. In: Zajc, B., Trost, A. (eds.) Proceedings of the ERK 2010 Conference, vol. B, pp. 173–176. IEEE Slovenian section, Ljubljana (2010) 9. Prechelt, L.: Proben1 – A Set of Neural Network Benchmark Problems and Rules. Technical Report 21/94, University of Karslruhe, Karlsruhe (1994)

Eﬃciently Explaining Decisions of Probabilistic RBF Classiﬁcation Networks 1 ˇ Marko Robnik-Sikonja , Aristidis Likas2 , Constantinos Constantinopoulos3, Igor Kononenko1, and Erik Strumbelj1 1

University of Ljubljana, Faculty of Computer and Information Science, Trˇzaˇska 25, 1001 Ljubljana, Slovenia {marko.robnik,igor.kononenko,erik.strumbelj}@fri.uni-lj.si 2 University of Ioannina, Department of Computer Science, GR 45110 Ioannina, Greece [email protected] 3 Barcelona Media - Centre d’Innovaci´ o, Av. Diagonal, 177, planta 9, 08018 Barcelona, Spain [email protected]

Abstract. For many important practical applications model transparency is an important requirement. A probabilistic radial basis function (PRBF) network is an eﬀective non-linear classiﬁer, but similarly to most other neural network models it is not straightforward to obtain explanations for its decisions. Recently two general methods for explaining of a model’s decisions for individual instances have been introduced which are based on the decomposition of a model’s prediction into contributions of each attribute. By exploiting the marginalization property of the Gaussian distribution, we show that PRBF is especially suitable for these explanation techniques. By explaining the PRBF’s decisions for new unlabeled cases we demonstrate resulting methods and accompany presentation with visualization technique that works both for single instances as well as for the attributes and their values, thus providing a valuable tool for inspection of the otherwise opaque models. Keywords: classiﬁcation explanation, model explanation, comprehensibility, probabilistic RBF networks, model visualization, game theory.

1

Introduction

In many areas where machine learning and data mining models are applied, their transparency is of crucial importance. For example, in medicine the practitioners are just as interested in the comprehension of the decision process, explanation of the model’s behavior for a given new case, and importance of the diagnostic features, as in the classiﬁcation accuracy of the model. The same is true for other areas where knowledge discovery dominates prediction accuracy. Research in statistics, data mining, pattern recognition and machine learning is mostly focused on prediction accuracy. As a result we have many excellent ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 169–179, 2011. c Springer-Verlag Berlin Heidelberg 2011

170

ˇ M. Robnik-Sikonja et al.

prediction methods, which are approaching the theoretically achievable prediction accuracy. Some of the most successful and popular approaches are based on support vector machines (SVM), artiﬁcial neural networks (ANN), and on ensemble methods (e.g., boosting and random forests). Regrettably none of these approaches oﬀer an intrinsic introspection into their decision process or an explanation for labeling a new instance. The probabilistic radial basis function network (PRBF) classiﬁer [9, 10] is also a very eﬀective black box classiﬁer [9, 2]. PRBF is a special case of the RBF network [1] that computes at each output unit the density function of a class. It adopts a cluster interpretation of the basis functions, where each cluster can generate observations of any class. This is a generalization of a Gaussian mixture model [4, 1], where each cluster generates observations of only one class. In [2] an incremental learning method based on Expectation-Maximization (EM) for supervised learning is proposed that provides classiﬁcation performance comparable to SVM classiﬁers. Unfortunately, like SVM, PRBF also lacks explanation ability. Recently, in [6, 13, 12] general explanation methods have been presented that are in principle independent of the model (the model can be either transparent, e.g., decision trees and rules, or a black box, e.g., SVM, ANN and classiﬁer ensembles) and can be used with all classiﬁcation models that output probabilities. These explanation methods decompose the model’s predictions into individual contributions of each attribute. Generated explanations closely follow the learned model and enable its visualization for each instance separately. The methods work by contrasting a model’s output with the output obtained using only a subset of features. This demands either retraining of the model for several feature subsets which is computationally costly or using a simulation (e.g., averaging over all possible feature’s values). For the Naive Bayesian (NB) classiﬁer it was shown in [6] that due to the independence assumption of the classiﬁer, one can simply omit attributes from classiﬁcation and get classiﬁcation with feature subsets without retraining or approximation. In this paper we present an eﬃcient decomposition for PRBF, which is also exact and avoids the approximation techniques thereby eﬃciently transforming PRBF into a white box non-linear classiﬁer. We demonstrate how explanation methods from [6, 13] can explain the PRBF model’s decisions for new unlabeled cases and present a graphical description of the otherwise opaque model. As a result, a highly eﬀective classiﬁer such as PRBF becomes transparent and can explain its individual decisions as well as its general behavior. The aim of the paper is twofold, ﬁrst to present a speciﬁc exact solution needed for PRBF explanation, and second to show how a general explanation technique can be applied to an opaque model to explain and visualize its decisions. For a good review of neural network explanation methods we refer the reader to [3]. Techniques for extracting explicit (if-then) rules from black box models (such as ANN) are described in [11, 7]. Some less complex non-symbolic models enable the explanation of their decisions in the form of weights associated with each attribute. A weight can be interpreted as the proportion of the information

Eﬃciently Explaining Decisions of Probabilistic RBF Classiﬁcation Networks

171

contributed by the corresponding attribute value to the ﬁnal prediction. Such explanations can be easily visualized. In [5] nomograms were developed for visualization of NB decisions. In [6] the NB list of information gains was generalized to a list of attribute weights for any prediction model. Throughout the paper we use the notation where each of the n learning instances is represented by an ordered pair (x, y); each attribute vector x consists of individual values of attributes Ai , i = 1, ..., a (a is the number of attributes), and is labeled with y representing one of the discrete class values yj , j = 1, ..., c (c is the number of class values). We write p(yj ) for the probability of the class value yj . In Sect. 2 we present the explanation principle of [6] and [13]. In Sect. 3 we describe PRBF and the marginalization property of the Gaussian distribution which is exploited for explanation. In Section 4 we demonstrate instance and model based explanation possible with PRBF. Section 5 summarizes the properties of explanation and provides some directions for further work.

2

Explanation Methods

Following [6] we identify two levels of explanation: the domain level and the model level. The domain level tries to ﬁnd the true causal relationship between the dependent and independent variables. Typically this level is unreachable unless we are dealing with artiﬁcial domains where all the relations as well as the probability distributions are known in advance. The model level explanation aims to make transparent the prediction process of a particular model. The prediction accuracy and the correctness of explanation at the model level are orthogonal: the correctness of the explanation is independent of the correctness of the prediction. However, empirical observation shows [13] that better models (with higher prediction accuracy) enable better explanation at the domain level. We present two recently developed general explanation methods which are based on decomposition of a model’s predictions into contributions of each attribute. Both methods work by explaining decisions taken by the model in classifying individual instances. To get a broader picture both methods combine these individual explanations and thereby provide also a better view of the model and problem. We ﬁrst present the simpler of the two [6] (called EXPLAIN in the reminder of the text), which computes the inﬂuence of the feature value by changing this value and observing its impact on the model’s output. The EXPLAIN assumes that the larger the changes in the output, the more important role the feature value plays is in the model. The shortcomings of this approach is that it takes into account only a single feature at time, therefore it cannot detect certain higher order dependencies (in particular disjunctions) and redundancies in the model. The method which solves this problem is called IME [13] and is presented in the second subsection.

172

2.1

ˇ M. Robnik-Sikonja et al.

Method EXPLAIN

The idea of the explanations proposed in [6] is to observe the relationship between the features and the predicted value by monitoring the eﬀect on classiﬁcation caused by the lack of knowledge of a feature’s value. To monitor the eﬀect that the attributes’ values have on the prediction of an instance, the EXPLAIN method decomposes the prediction into individual attributes’ values and deﬁne the model’s probability p(y|x\Ai ), as the model’s probability of class y for instance x without the knowledge of event Ai = ak (marginal prediction), where ak is the value of Ai for observed instance x. The comparison of the values p(y|x) and p(y|x\Ai ) provides insight into the importance of event Ai = ak . If the diﬀerence between p(y|x) and p(y|x\Ai ) is large, the event Ai = ak plays an important role in the model; if this diﬀerence is small, the inﬂuence of Ai = ak in the model is minor. Due to its favorable properties the weight of evidence is an appropriate way how to evaluate the prediction diﬀerence, so we assume its use in this work. The odds of event z is deﬁned as the ratio of the probability p(z) of event z and its negation: odds(z) = p(z) p(z) = 1−p(z) . The weight of evidence of attribute Ai for class value y is deﬁned as the log odds of the class value y with the knowledge about the value of Ai and without it: WEi (y|x) = log2 (odds(y|x)) − log2 (odds(y|x\Ai )) [bit]

(1)

To get the explanation factors an evaluation of (1) is needed. To compute factor p(y|x) one just classiﬁes the instance x with the model. To compute the factors p(y|x\Ai ) the simplest, but not always the best option, is to replace the value of attribute Ai with a special unknown value (NA, don’t know, don’t care, etc.) which does not contain any information about Ai . This method is appropriate only for modeling techniques which handle unknown values naturally (e.g., the Naive Bayesian classiﬁer just omits the attribute with unknown value from the computation). For other models we have to bear in mind that while this approach is simple and seemingly correct, each method has its own internal mechanism for handling unknown values. The techniques for handling unknown values are very diﬀerent: from replacement with the most frequent value for nominal attributes and with median for numerical attributes to complex model-based implantations. To avoid the model dependent treatment of unknown values, a technique is suggested in [6] that simulates the lack of information about Ai with several predictions. For nominal attributes the actual value Ai = ak is replaced with all possible mi values of Ai , and each prediction is weighted by the prior probability of the value resulting in the following equation p(y|x\Ai ) =

mi s=1

which can be approximated as

p(Ai = as |x\Ai )p(y|x ← Ai = as )

(2)

Eﬃciently Explaining Decisions of Probabilistic RBF Classiﬁcation Networks

. p(y|x\Ai ) = p(Ai = as )p(y|x ← Ai = as )

173

mi

(3)

s=1

The term p(y|x ← Ai = as ) represents the probability for y when in x the value of Ai is replaced with as . The simpliﬁcation is used for the prior probability p(Ai = as ) which implies that (3) is only an approximation. 2.2

Method IME

The main shortcoming of EXPLAIN method described above is that it observes only a change of a single feature at a time and therefore cannot detect disjunctive or redundant concepts expressed in a model. The solution to this is the method IME (Interactions-based Method for Explanation) proposed in [13] where all attribute interactions are scanned and the diﬀerence in predictions caused by each interaction is assigned to attributes taking part in each interaction. Such a procedure demands the generation of 2a attribute subsets and is therefore limited to data sets with a relatively low number of features. Fortunately, this problem can be viewed from the point of coalitional game theory [12]. Within this framework the contributions assigned to individual feature values by IME method correspond to the Shapley value [8] and are therefore fair according to all interactions in which they are taking part. Furthermore a sampling approximation algorithm is presented in [12] which drops the requirement for all 2a subsets and makes the method practically much more attractive. Formally, for a given instance x the IME method introduces the notion of the prediction using the set of all features {1, 2, ...a}, the prediction using only a subset of features Q, and the prediction using the empty set of features {}. Let these predictions be h(x{1,2,...a} ), h(xQ ), and h(x{} ), respectively. Since an instance x is ﬁxed in explanation, for readability sake we omit it from expressions below, but remain aware that the dependence exists. The basis for explanation is therefore ΔQ , the diﬀerence in predictions using the subset of features Q and the empty set ΔQ = h(xQ ) − h(x{} ).

(4)

This diﬀerence in prediction is a result of an inﬂuence individual features may have, as well as the inﬂuence of any feature interactions IQ . The interaction contribution for the subset Q is deﬁned recursively as IQ = ΔQ − W ⊂Q IW where I{} = 0. This deﬁnition takes into account that due to interactions of the features in the set Q the prediction may be diﬀerent from prediction using any proper subset of Q. The contribution πi of the ith feature in classiﬁcation of instance x is therefore deﬁned as the sum of contributions of all relevant interactions πi =

Q⊆{1,2,...,a}∧i∈Q

IQ |Q|

(5)

174

ˇ M. Robnik-Sikonja et al.

where |Q| is the power of the subset Q. Equivalently, we can express this sum only with the diﬀerences in predictions [13]. πi =

a Q⊆{1,2,...,a}−{i}

1

(ΔQ∪{i} − ΔQ )

a−1 a−|Q|−1

(6)

Both, the complete algorithm and the sampling described in [12] require classiﬁcation with a subsample of attributes. In the next Section we show that an eﬃcient solution exists for PRBF, which does not require retraining of the classiﬁers for each feature subset, but still provides exactly the same classiﬁcation.

3

Probabilistic RBF Networks

Consider a classiﬁcation problem with c classes yk (k = 1, . . . , c) and input instances x = (A1 , . . . , Aa ). For this problem, the corresponding PRBF classiﬁer has a inputs and c outputs, one for each class. Each output provides and estimate of the probability density p(x|yk ) of the corresponding class yk . Assume that we have M components (hidden units), each one computing a probability density value fj (x) of the input x. In the PRBF network all component density functions fj (x) are utilized for estimating the conditional densities of all classes by considering the components as a common pool [9]. Thus, for each class a conditional density function p(x|yk ) is modeled as a mixture model of the form: M p(x|yk ) = πjk fj (x), k = 1, . . . , c, (7) j=1

where the mixing coeﬃcients πjk are probability vectors; they take positive values and satisfy the following constraint: M

πjk = 1,

k = 1, . . . , c.

(8)

j=1

Once the outputs p(x|yk ) have been computed, the class of data point x is determined using the Bayes rule, i.e. x is assigned to the class with maximum posterior p(yk |x) computed by p(x|yk )Pk p(yk |x) = c =1 p(x|y )P

(9)

The class priors Pk are usually computed as the percentage of training instances belonging to class yk . In the following, we assume the Gaussian component densities of the general form: 1 1 T −1 fj (x) = exp − (x − μj ) Σj (x − μj ) (10) 2 (2π)a/2 |Σj |1/2

Eﬃciently Explaining Decisions of Probabilistic RBF Classiﬁcation Networks

175

where μj ∈ a represents the mean of component j, while Σj represents the corresponding a×a covariance matrix. The whole adjustable parameter vector of the model consists of the mixing coeﬃcients πjk and the component parameters (means μj and covariances Σj ). It is apparent that the PRBF model is a special case of the RBF network, where the outputs correspond to probability density functions and the output layer weights are constrained to represent the prior probabilities πjk . Furthermore, the separate mixtures model [4] can be derived as a special case of PRBF by setting πjk = 0 for all classes k, except for the class that the component j belongs to. Training of the PRBF network is simple and eﬃcient using the EM algorithm for likelihood maximization [9, 2]. 3.1

Marginalization of PRBF

A notable convenient characteristic of the Gaussian distribution is the marginalization property: if the joint distribution of a set of random variables S = {A1 , . . . , Aa } is Gaussian with mean μ and covariance matrix Σ, then for any subset A of these variables, the joint distribution of the subset Q = S − A of the remaining variables is also a Gaussian. The mean μ\A of this Gaussian is obtained by removing from μ the components corresponding to the variables in subset A and covariance matrix Σ\A is obtained by removing the rows and columns of Σ corresponding to the variables in subset A. Therefore, if we know the mean and covariance of the joint distribution of a set of variables, we can immediately obtain the distribution of any subset of these variables. For an input x = (A1 = v1 , . . . , Aa = va ) each output p(x|yk ), k = 1, . . . , c of the PRBF is computed as a mixture of Gaussians: p(x|yk ) =

M

πjk N (x; μj , Σj )

(11)

j=1

Consequently, based on the marginalization property of the Gaussian distribution, it is straightforward to analytically compute p(x\{A}|yk ) obtained by excluding subset A of the attributes: p(x\{A}|yk ) =

M

πjk N (x\{A}; μj\{A} , Σj\{A} )

(12)

j=1

where μj\A and Σj\{A} are obtained by removing the corresponding elements from μj and Σj . Then we can directly obtain p(yk |x\{A}) as p(x\{A}|yk )Pk p(yk |x\{A}) = c =1 p(x\{A}|y )P

(13)

and use it as a replacement for (2) and (3) in EXPLAIN method. We use the same property in IME method and for a subset of features Q we get M h(xQ |yk ) = πjk N (xQ ; μjQ , ΣjQ ) (14) j=1

ˇ M. Robnik-Sikonja et al.

176

where μjQ and ΣjQ are obtained by retaining only the elements from Q in μj and Σj . We obtain h(yk |xQ ) as h(xQ |yk )Pk h(yk |xQ ) = c =1 h(xQ |y )P

(15)

and use it in (4) and (6). As a consequence, with PRBFS models EXPLAIN and IME explanation becomes much more eﬃcient and exact as no approximation is needed. Classiﬁcation with a subset of features requires only a mask which selects appropriate elements from μ and appropriate rows and columns from Σ matrix.

4

Instance Level and Model Level Explanation of PRBF

To show the practical utility of the proposed marginalization we demonstrate it by a visualization on a well-known Titanic data set. As there are no strong interactions in this data set we can use EXPLAIN method. IME methods produces similar graphs, and its details can be seen in [13]. The learning task is to predict the survival of a passenger in the disaster of the Titanic ship. The three attributes report the passenger’s status during travel (ﬁrst, second, or third class, or crew), age (adult or child), and gender of the passenger. Left-hand graph in Fig. 1 shows an example of an explanation for the decision of the PRBF network for an instance concerning a ﬁrst class adult male passenger.

Data set: titanic; model: PRBF p(titanic survived=yes|x) = 0.26; true titanic survived=yes

Data set: titanic, titanic survived=yes model: PRBF female

sex

male

male

sex

age

adult

status

first

attributes and values

attributes

child adult age crew third second first status

−6

−4

−2

0

2

weight of evidence

4

6

−50

−40

−30

−20

−10

0

10

20

30

40

50

average weight of evidence

Fig. 1. Instance explanation (left-hand side) for one of the instances in the Titanic data set classiﬁed with PRBF model. The explanation of PRBF model is presented on right-hand side.

Eﬃciently Explaining Decisions of Probabilistic RBF Classiﬁcation Networks

177

The weight of evidence is shown on the horizontal axis. The vertical axis contains names of the attributes on the left-hand side and their values for the chosen instance on the right-hand side. The probability p(y|x) for class ”survived=yes” returned by the PRBF model for the given instance x is reported on the top (0.26). The lengths of the thicker bars correspond to the inﬂuences of the given attribute values in the model, expressed by (1) and using (13). The positive weight of evidence is given on the right-hand side and the negative one is on the left-hand side. Thinner bars above the explanation bars indicate the average value of the weight of evidence over all training instances for the corresponding attribute value. For the given instance we observe that “sex = male” speaks strongly against the survival and “status=ﬁrst class” is favorable for survival, while being adult has no inﬂuence. The exact probabilities of p(yk |x\A) caused by the decomposition (13) are 0.73 for “sex = male”, 0.19 for “status=ﬁrst class”, and 0.26 for “age=adult”. Thinner average bars mostly agree with the eﬀects expressed for the particular instance (being male is on average less dangerous than in the selected case, and being in the ﬁrst class is even more beneﬁcial). To get a more general view of the model we can use explanations for the training data and visualize them in a summary form, which shows average importance of each feature and its values. An example of such visualization for titanic data set is presented in right-hand graph on Fig. 1. On the left-hand side on the vertical axis, all the attributes and their values are listed (each attribute and its values are separated by dashed lines). For each of them the average negative and the average positive weights of evidence are presented with the horizontal bar. For each attribute (a darker shade) the average positive and negative inﬂuences of its values are given. For titanic problem status and sex have approximately the same eﬀect, and age plays less important role in PRBF model. Particular values give even more precise picture: ﬁrst and second class is perceived as undoubtedly advantageous, being a child or female has greater positive than negative eﬀect, traveling in third class or being male is considered as a disadvantage, while a status of crew or being adult plays only a minor negative role.

5

Conclusions

We presented a decomposition for PRBF classiﬁer by exploiting the marginalization property of the Gaussian distribution and applied this decomposition inside two general methods for explaining predictions for individual instances. We showed how we can explain and visualize the classiﬁcations of unlabeled cases provided by the otherwise opaque PRBF model. We demonstrated a visualization technique which uses explanations of the training instances to describe the eﬀects of all the attributes and their values at the model level. The explanation methods EXPLAIN and IME exhibit the following properties: – Model dependency: the decision process is taking place inside the model, so if the model is wrong for a given problem, explanation will reﬂect that and will be correct for the model, therefore wrong for the problem.

178

ˇ M. Robnik-Sikonja et al.

– Instance dependency: diﬀerent instances are predicted diﬀerently, so the explanations will also be diﬀerent. – Class dependency: explanations for diﬀerent classes are diﬀerent, diﬀerent attributes may have diﬀerent inﬂuence on diﬀerent classes (for two-class problems, the eﬀect is complementary). – Capability to detect strong conditional dependencies: if the model captures strong conditional dependency the explanations will also reﬂect that. – EXPLAIN method is unable to detect and correctly evaluate the utility of attributes’ values in instances where the change in more than one attribute value at once is needed to aﬀect the predicted value. IME method samples the space of feature interactions and therefore avoids this problem. In PRBF it is straightforward and at no extra cost to marginalize any number of features, which is a useful property for sampling subsets of features. – Visualization ability: the generated explanations can be graphically presented in terms of the positive/negative eﬀect each attribute and its value have on the selected class. While explanation methodology presented has been successfully used in medical application [13], in future work we plan to use also PRBF and use the feedback provided by the experts to further improve the visualization tool.

References [1] Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) [2] Constantinopoulos, C., Likas, A.: An incremental training method for the probabilistic RBF network. IEEE Trans. Neural Networks 17(4), 966–974 (2006) [3] Jacobsson, H.: Rule extraction from recurrent neural networks: A taxonomy and review. Neural Computation 17(6), 1223–1263 (2005) [4] McLachlan, G., Peel, D.: Finite Mixture Models. John Wiley & Sons, Chichester (2000) [5] Moˇzina, M., Demˇsar, J., Kattan, M.W., Zupan, B.: Nomograms for visualization of naive bayesian classiﬁer. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 337–348. Springer, Heidelberg (2004) ˇ [6] Robnik Sikonja, M., Kononenko, I.: Explaining classiﬁcations for individual instances. IEEE Transactions on Knowledge and Data Engineering 20(5), 589–600 (2008) [7] Setiono, R., Liu, H.: Understanding neural networks via rule extraction. In: Proceedings of IJCAI 1995, pp. 480–487 (1995) [8] Shapley, L.S.: A value for n-person games. In: Contributions to the Theory of Games, vol. II. Princeton University Press, Princeton (1953) [9] Titsias, M.K., Likas, A.: Shared kernel models for class conditional density estimation. IEEE Trans. Neural Networks 12(5), 987–997 (2001) [10] Titsias, M.K., Likas, A.: Class conditional density estimation using mixtures with constrained component sharing. IEEE Trans. Pattern Anal. and Machine Intell. 25(7), 924–928 (2003)

Eﬃciently Explaining Decisions of Probabilistic RBF Classiﬁcation Networks

179

[11] Towell, G.G., Shavlik, J.W.: Extracting reﬁned rules from knowledge-based neural networks. Machine Learning 13(1), 71–101 (1993) ˇ [12] Strumbelj, E., Kononenko, I.: An Eﬃcient Explanation of Individual Classiﬁcations using Game Theory. Journal of Machine Learning Research 11, 1–18 (2010) ˇ ˇ [13] Strumbelj, E., Kononenko, I., Robnik-Sikonja, M.: Explaining instance classiﬁcations with interactions of subsets of feature values. Data & Knowledge Engineering 68(10), 886–904 (2009)

Evolving Sum and Composite Kernel Functions for Regularization Networks Petra Vidnerová and Roman Neruda Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod vodárenskou věží 2, Praha 8, Czech Republic [email protected]

Abstract. In this paper we propose a novel evolutionary algorithm for regularization networks. The main drawback of regularization networks in practical applications is the presence of meta-parameters, including the type and parameters of kernel functions Our learning algorithm provides a solution to this problem by searching through a space of diﬀerent kernel functions, including sum and composite kernels. Thus, an optimal combination of kernel functions with parameters is evolved for given task speciﬁed by training data. Comparisons of composite kernels, single kernels, and traditional Gaussians are provided in several experiments. Keywords: regularization networks, kernel functions, genetic algorithms.

1

Introduction

Regularization theory presents a sound framework to solving supervised learning problems. Regularization networks (RN) beneﬁt from very good theoretical background [1,2,3] and a simple, yet quite eﬃcient learning algorithm [4]. Their disadvantage is the presence of meta-parameters that are supposed to be known in advance. These meta-parameters are the type of kernel function and the regularization parameter. In addition, the kernel function typically has additional parameters, for instance the width of the Gaussian kernel. In this paper we introduce a method for optimization of RN meta-parameters. The method is based on minimization of cross-validation error, which is an estimate of generalization ability of a network, by means of genetic algorithms. Diﬀerent species are employed corresponding to diﬀerent kinds of kernel functions, thus a natural co-evolution is employed to solve the meta-learning process. The algorithm is also able to represent composite kernel functions. The composite kernel functions are known to possess the same theoretical properties as the single kernel units, yet, they can represent the underlying geometry of the data better. The paper is organized as follows. In the next section, the regularization network is introduced. In Section 3 sum and composite kernel functions are introduced. Section 4 describes the genetic parameter search. Section 5 contains results of our experiments. Conclusion can be found in Section 6. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 180–189, 2011. c Springer-Verlag Berlin Heidelberg 2011

Evolving Sum and Composite Kernel Functions for Regularization Networks

2

181

Regularization Networks

In order to develop regularization networks we formulate the problem of supervised learning as a function approximation problem. We are given a set of examples {(xi , yi ) ∈ Rd ×R}N i=1 obtained by random sampling of some real function f , and we would like to ﬁnd this function. Since this problem is ill-posed, we have to add some a priori knowledge about the function f . We usually assume that the function is smooth, in the sense that two similar inputs correspond to two similar outputs, and that the function does not oscillate too much. This is the main idea of the regularization theory, where the solution is found by minimizing the functional (1) containing both the data and smoothness information. N 1 H[f ] = (f (xi ) − yi )2 + γΦ[f ], N

(1)

i=1

where Φ is called a stabilizer and γ > 0 is the regularization parameter controlling the trade-oﬀ between the closeness to data and the smoothness of the solution. The regularization approach has sound theoretical background, it was shown that for a wide class of stabilizers the solution has a form of feed-forward neural network with one hidden layer, called regularization network, and that diﬀerent types of stabilizers lead to diﬀerent types of regularization networks [3,4]. Poggio and Smale in [4] proposed a learning algorithm (Alg. 1) derived from the regularization scheme (1). They choose the hypothesis space as a Reproducing Kernel Hilbert Space (RKHS) HK deﬁned by an explicitly chosen, symmetric, positive-deﬁnite kernel function Kx (x ) = K(x, x ). The stabilizer is deﬁned by means of norm in HK , so the problem is formulated as follows: min H[f ], where H[f ] =

f ∈HK

N 1 (yi − f (xi ))2 + γ||f ||2K . N i=1

(2)

The solution of minimization (2) is unique and has the form f (x) =

N

wi Kxi (x),

(N γI + K)w = y,

(3)

i=1

where I is the identity matrix, K is the matrix Ki,j = K(xi , xj ), and y = (y1 , . . . , yN ). The solution (3) can be represented by a neural network with one hidden layer and output linear layer. The most commonly used kernel function is Gaussian −

x−x

2

b K(x, x ) = e . The power of the Alg. 1 is in its simplicity and eﬀectiveness. However, network needs ﬁxed values of meta-parameters such as γ, the type of kernel function, and its parameters (e.g. a width of the Gaussian). Then, the algorithm reduces to the problem of solving linear system of equations (4). The real performance of the algorithm depends signiﬁcantly on the choice of meta-parameters γ and kernel function. However, their optimal choice depends on a particular data set and there is no general heuristics for setting them.

182

P. Vidnerová and R. Neruda Input: Data set {x i , yi }N i=1 ⊆ X × Y Output: Function f . 1. Choose a symmetric, positive-definite function Kx (x ), continuous on X × X. N 2. Create f : X → Y as f (x) = i=1 ci Kx i (x) and compute w = (w1 , . . . , wN ) by solving (N γI + K)w = y, where I is the identity matrix, Ki,j y = (y1 , . . . , yN ), γ > 0.

(4) =

K(xi , xj ), and

Algorithm 1. RN learning algorithm

3

Sum and Composite Kernel Functions

The kernel function, used in the RN learning algorithm Alg. 1, is traditionally supposed to be given in advance, for instance chosen by a user. In fact, the choice of a kernel function is equivalent to the choice of a prior assumption about the problem at hand. Therefore it seems that such a choice is crucial for the quality of the solution and should be always done according to the given task. However, the choice of the kernel is most often not a part of the learning algorithm. In fact, the most often chosen kernel is Gaussian which is a choice only partially supported by our experimental results. In [5] we have experimented with genetic algorithms evolving diﬀerent kernel functions, and the most successful one was inverse multiquadric which has demonstrated very good performance also in experiments with composite kernels presented bellow. In the following we will introduce sum and composite kernel functions. Following the reasoning of Aronszajn’s breakthrough paper [6], where sums of reproducing kernels are considered, it can be easily shown that these types of kernels can be used as activation functions in the regularization networks (cf. [7]). Moreover, kernel functions that are created as a combination of simpler kernel functions might better reﬂect the character of data. Therefore we propose sum and composite kernel functions to be utilized in the learning algorithm described below. By a sum kernel we mean a kernel function K that can be expressed as K(x, y) = K1 (x, y) + K2 (x, y), where K1 and K2 are kernel functions. By a composite kernel we mean a kernel function K that can be expressed as a linear combination of other kernel functions K(x, y) = αK1 (x, y) + βK2 (x, y), where K1 and K2 are kernel functions, α, β ∈ R. We can combine diﬀerent kernel functions or two kernel functions of the same type but with diﬀerent parameters, such as two Gaussians of diﬀerent widths (note that in this case the Gaussians have the same center).

Evolving Sum and Composite Kernel Functions for Regularization Networks

4

183

Genetic Parameter Search

The ﬁrst step of the learning algorithm design is the criteria selection. An optimal RN should not only approximate the data from the training set, but also have a good generalization ability. Our estimate of a generalization ability is the cross-validation error. Thus, it can be stated that we search for such network meta-parameters that optimize the cross-validation error. For this optimization we use genetic algorithms (GA) [8] which represent a useful technique used to ﬁnd approximate solutions to optimization and search problems. GA typically work with a population of individuals embodying abstract representations of feasible solutions. Each individual is assigned a fitness value which represents a measure of the quality of a solution. The better the solution, the higher the ﬁtness value. The evolution starts from a population of completely random individuals and iterates in generations. The population evolves towards better solutions; in every generation, the ﬁtness of each individual is evaluated. Individuals are stochastically selected from the current population (based on their ﬁtness), and modiﬁed by means of operators mutation and crossover to form a new population. The new population is then used in the next iteration of the algorithm. We work with individuals encoding the parameters of RN learning algorithm (Alg. 1). In our case they represent the type of a kernel function, its additional parameters, and the regularization parameter. When the type of the kernel function is known in advance, the individual consists only of the kernel’s parameter (i.e. the width in case of Gaussian kernel) and the regularization parameter. In case of simple kernel function, the individual is coded as I = {type of kernel function, kernel parameter, γ}, i.e. I = {Gaussian, width = 0.5, γ = 0.01}. In case of composite kernel function, the individual looks like I = { α, type of kernel function K1 , kernel parameter, β, type of kernel function K2 , kernel parameter, γ}. New generations of individuals are created using the operators of selection, crossover and mutation. The mutation introduces small random perturbations to existing individuals. For the ﬂoating point parameters the mutation operates by adding a small normal distributed random value. There are more diﬀerent crossover operators depending on the type of kernel function represented by the individuals. In order to work with individuals representing diﬀerent kernel types we introduce the co-evolution principle of species cf. [5]. Individuals with diﬀerent kernel functions naturally represent diﬀerent species in our case, where each specie forms one subpopulation. The selection operator is performed on the whole population, and the selected individual is inserted into subpopulation according to its kernel type. The crossover operator is then performed only among the individuals of the same subpopulation.

184

P. Vidnerová and R. Neruda

The ﬁrst type of the crossover works with individuals encoding simple kernels of the same type known in advance. Thus, only the kernel parameter and the γ parameter are subject to the crossover. New parameter values of the oﬀspring are chosen randomly from the interval deﬁned by the ancestor values. In the case of sum and composite kernels, the crossover is deﬁned as a uniform crossover operating on the sub-kernels level, randomly interchanging the sub-kernels. The ﬁtness of the individual reﬂects the objective, which is the minimization of the cross-validation error. Thus, the lower the cross-validation error, the higher the ﬁtness value. Based on the ﬁtness values, one performs the selection operator, which is either a standard roulette wheel selection, or a standard tournament selection.

5

Experiments

A collection of benchmark problems called Proben1 [9] has been used throughout our experiments. The Proben1 tasks are listed in Table 1. Each task is present in three variants in the database, corresponding to three diﬀerent partitioning to training and testing sets. The general setup of our experiments can be described by the Algorithm 2. The standard numerical library called LAPACK [10] was used to solve linear systems.

N

T Input: Training set TT : {xi , yi }i=1 ⊆ X × Y , Testing set TS : S {xi , yi }N ⊆ X × Y i=1 Output: Function f represented by the corresponding RN, testing error E of the network.

1. Find the values for γ and optimal kernels K using genetic search described in section 4. 2. Use the whole training set TT as well as the parameters found by Step 1 to estimate the outer weights of RN by linear regression. 3. Evaluate error E of the network on the testing set TS : E = 100

NS 1 ||y i − f (xi )||2 , N m i=1

(5)

where || · || denotes the Euclidean norm. Algorithm 2. Testing algorithm scheme

Three experiments were performed. First, we evolved elementary kernel functions using genetic algorithm with species [5]. The population had 4 subpopulations corresponding to the four common kernel functions: Gaussian (K(x, y) = 2 e−||x−y|| ), multiquadric (K(x, y) = (||x − y||2 + c2 )−1/2 ), inverse multiquadric

Evolving Sum and Composite Kernel Functions for Regularization Networks

185

Table 1. Overview of Proben1 tasks. Number of inputs (n), number of outputs (m), number of samples in training and testing sets (Ntrain ,Ntest ). Type of task: approximation or classiﬁcation. Task name n cancer 9 card 51 ﬂare 24 glass 9 heartac 35 hearta 35 heartc 35 heart 35 horse 58

Inverse multiquadric Etrain Etest 1.83 1.50 1.4

m 2 2 3 6 1 1 2 2 3

Ntrain 525 518 800 161 228 690 228 690 273

Ntest 174 172 266 53 75 230 75 230 91

Type class class approx class approx approx class class class

Sum Kernel Etrain Etest 0.01 1.64

Composite Kernel Etrain Etest 0.14 1.53

1.8

1.6

Inverse Multiquadratic

Inverse Multiquadratic plus Gaussian

Inverse Multiquadratic comb. with Gaussian

1.6

1.2

1.4

1.4 1.2

1 1.2

1 0.8

1 0.8 0.8

0.6

0.6 0.6 0.4 0.4

0.4 0.2

0.2

0.2

0 -10

-5

0

5

10

0 -10

-5

0

5

10

0 -10

-5

0

5

10

Fig. 1. Winning kernels for cancer1 task Inverse multiquadric Etrain Etest 1.41 2.92

Sum Kernel Etrain Etest 0.01 2.93

1

1.6

1.6

1.4

1.2

1.2

1

1

Inverse Multiquadratic plus Sigmoid

Inverse Multiquadratic plus Gaussian

Inverse Multiquadratic

1.4

Composite Kernel Etrain Etest 1.34 2.92

0.8

0.6

0.4 0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.2

0

0 -10

-5

0

5

10

0 -10

-0.2

-5

0

5

10

-0.4 -10

-5

0

5

10

Fig. 2. Winning kernels for cancer2 task

(K(x, y) = (||x − y||2 + c2 )1/2 ), and sigmoid (K(x, y) = tanh(xy − θ)). In the second experiment the sum kernels were evolved, while the third experiment dealt with evolving composite kernels. As described before, the regularization parameter and kernel parameters were also found by evolution.

186

P. Vidnerová and R. Neruda Inverse multiquadric Etrain Etest 2.32 6.13

Sum Kernel Etrain Etest 0.02 6.09

3

4.5

5

4

4

Inverse Multiquadratic plus Sigmoid

Inverse Multiquadratic plus Gaussian

Inverse Multiquadratic

4.5

Composite Kernel Etrain Etest 2.19 6.05

2.5

3.5

2

3.5 3 1.5 3 2.5 1

2.5 2 2

0.5 1.5

1.5 0 1

1

0 -10

-0.5

0.5

0.5

-5

0

10

5

0 -10

-5

0

5

10

-1 -10

-5

0

5

10

Fig. 3. Winning kernels for glass1 task Table 2. Error on training and testing set obtained by RN with Gaussian kernels and RN with meta-parameters optimized by genetic parameter search Gaussian Task Etrain cancer1 2.29 cancer2 1.85 cancer3 2.05 card1 7.17 card2 5.72 card3 5.93 ﬂare1 0.34 ﬂare2 0.41 ﬂare3 0.39 glass1 3.37 glass2 3.92 glass3 4.11 heartac1 3.39 heartac2 2.08 heartac3 2.52 hearta1 2.58 hearta2 2.12 hearta3 2.59 heartc1 7.73 heartc2 10.67 heartc3 6.88 heart1 8.56 heart2 7.99 heart3 5.85 horse1 2.52 horse2 1.58 horse3 2.27

kernel Etest 1.76 3.01 2.78 10.29 13.19 12.68 0.54 0.27 0.33 6.99 7.74 7.36 3.30 4.15 5.10 4.43 4.32 4.44 16.03 6.82 13.36 13.70 14.15 17.03 13.31 16.12 14.62

Inv. multiquadric Etrain Etest 1.83 1.50 1.41 2.92 1.74 2.54 7.56 10.03 6.06 12.74 6.35 12.28 0.35 0.54 0.41 0.27 0.39 0.33 2.32 6.13 1.06 6.79 2.67 6.27 3.66 3.03 2.42 3.95 2.85 5.13 2.73 4.28 2.41 4.20 2.84 4.40 8.23 15.93 10.99 6.47 7.22 12.86 9.21 13.55 8.17 13.88 5.92 16.85 4.15 11.77 3.63 15.22 3.84 13.53

Sum kernel Etrain Etest 0.01 1.64 0.01 2.93 1.59 2.53 7.60 10.04 5.91 12.89 6.18 12.48 0.35 0.54 0.40 0.28 0.39 0.33 0.02 6.09 0.13 6.86 2.61 6.24 3.68 3.04 2.33 4.02 2.70 5.14 2.68 4.30 2.30 4.26 2.79 4.39 8.21 15.92 0.01 7.17 7.25 12.85 0.14 13.99 8.06 14.03 5.93 16.84 0.16 12.47 3.68 15.21 3.43 13.54

Composite Etrain 0.14 1.34 0.51 6.45 4.60 6.35 0.35 0.41 0.39 2.19 0.76 0.50 3.71 2.40 0.64 2.49 2.46 1.44 1.39 0.02 7.23 0.87 8.22 5.85 0.26 0.72 0.33

kernel Etest 1.53 2.92 2.85 10.06 12.74 12.29 0.54 0.27 0.33 6.05 6.85 6.61 3.06 3.95 5.10 4.28 4.22 4.41 15.59 6.38 12.85 13.76 13.92 16.88 11.93 15.03 13.22

Evolving Sum and Composite Kernel Functions for Regularization Networks

187

Table 3. Kernel functions found by genetic algorithm. Gauss(0.15) stands for Gaussian kernel with width 0.15, Sgm stands for sigmoid, InvMq for inverse multiquadric, etc.

cancer1 cancer2 cancer3 card1 card2 card3 ﬂare1 ﬂare2 ﬂare3 glass1 glass2 glass3 heartac1 heartac2 heartac3 hearta1 hearta2 hearta3 heartc1 heartc2 heartc3 heart1 heart2 heart3 horse1 horse2 horse3

Sum Kernel Gauss(0.20)+InvMq(1.05) Gauss(0.15)+InvMq(1.05) Gauss(1.99)+InvMq(0.72) InvMq(1.9)+InvMq(1.99) Gauss(1.99)+InvMq(1.79) Gauss(1.99)+InvMq(1.99) InvMq(1.99)+InvMq(1.99) Gauss(1.98)+Gauss(1.99) InvMq(1.99)+InvMq(1.99) InvMq(0.21)+Gauss(0.03) Gauss(0.05)+InvMq(0.20) InvMq(0.19)+Sgm(0.44) InvMq(1.99)+InvMq(1.99) InvMq(1.99)+Gauss(1.99) Gauss(1.98)+InvMq(1.99) Gauss(1.99)+InvMq(1.95) InvMq(1.99)+Gauss(1.99) InvMq(1.98)+InvMq(1.99) InvMq(1.98)+InvMq(1.98) Gauss(0.15)+InvMq(1.88) InvMq(1.99)+InvMq(1.99) Gauss(0.12)+InvMq(1.97) InvMq(1.97)+Gauss(1.96) InvMq(1.97)+InvMq(1.88) Gauss(0.01)+InvMq(1.73) InvMq(1.99)+InvMq(1.99) InvMq(1.35)+InvMq(1.99)

Composite Kernel 0.07*InvMq(0.12)+0.99*Gauss(1.98) 0.55*InvMq(0.49)+0.31*Sgm(1.62) 0.77*Gauss(0.13)+0.22*Sgm(1.97) 0.35*InvMq(1.98)+0.01*Gauss(0.54) 0.04*Gauss(0.56)+0.96*InvMq(1.99) 0.95*InvMq(1.98)+0.25*InvMq(1.98) 0.19*InvMq(1.97)+0.97*InvMq(1.98) 0.09*InvMq(1.95)+0.72*InvMq(1.98) 0.69*InvMq(1.99)+0.51*InvMq(1.97) 0.51*InvMq(0.16)+0.99*Sgm(0.79) 0.59*Gauss(1.10)+0.11*InvMq(0.11) 0.92*InvMq(0.35)+0.62*Gauss(0.05) 0.50*InvMq(1.99)+0.05*InvMq(1.96) 0.22*InvMq(1.96)+0.91*InvMq(1.99) 0.90*InvMq(1.99)+0.17*Gauss(0.02) 0.01*Gauss(0.13)+0.65*InvMq(1.98) 0.02*Sig(0.59)+0.97*InvMq(1.88) 0.91*InvMq(1.95)+0.07*Gauss(0.05) 0.59*InvMq(1.99)+0.13*Gauss(0.18) 0.42*Gauss(0.15)+0.96*InvMq(1.99) 6.54*InvMq(1.87)+0.93*InvMq(1.99) 0.33*Gauss(0.10)+0.89*InvMq(1.98) 0.12*Gauss(1.97)+0.97*InvMq(1.92) 0.99*InvMq(1.99)+0.02*Gauss(1.32) 0.16*Gauss(0.01)+0.62*InvMq(1.99) 0.86*InvMq(1.97)+0.12*Gauss(0.01) 0.24*Gauss(0.03)+0.83*InvMq(1.97)

Table 2 lists training and testing errors for networks found by genetic parameter search, ﬁrst for the Gaussian kernel (which is the most commonly used kernel function), then for the inverse multiquadric (which was the winning kernel function in all cases of genetic search for optimal elementary function). Finally, the errors of networks with sum and composite kernels are presented. The Gaussian function gives best test errors only in four cases, despite the fact that it is the most commonly used kernel. The inverse multiquadric function gives the best test errors in 17 cases. Sum kernels give best test error in 6 cases, composite kernels in 14 cases. In terms of training errors, the Gaussian function gives best results in 7 cases, inverse multiquadric function in 1 case, sum kernels in 9 cases and composite kernels in 12 cases. In some cases, sum kernels and composite kernels achieved very low training errors while preserving test errors comparable to other kernels. In this cases we get very precise approximation of training data and still we have good generalization ability. Most of these kernels are sums or combinations of wide kernel and narrow kernel. The narrow member

188

P. Vidnerová and R. Neruda

ensures the precise answer for the training point, the wider kernel is responsible for generalization. Evolved kernel functions for tasks cancer1, cancer2 and glass1 are presented in Fig. 1, Fig. 2 and Fig. 3, respectively. For illustration, the winning sum and composite kernels are listed in Tab. 3.

6

Conclusion

In this paper we have explored the possibilities of using composite kernel units for regularization networks. While the subject is theoretically sound, there has not been a suitable learning algorithm that makes use of these properties so far. We have proposed an evolutionary learning algorithm that considers sum and composite units and adjusts their parameters and types of kernel functions in order to capture the geometrical properties of training data better. The learning process has been tested on several benchmark tasks with promising results. In general, the composite kernels have demonstrated superior performance in comparison to single kernel units. The most common winning architecture consists of the combination of a wide and narrow kernel function, and at the same time the majority contains inverse multiquadric or Gaussian kernels. Although only kernels deﬁned on real numbers are considered throughout this work, in practical applications we often meet data containing attributes of diﬀerent types, such as enumerations, sets, strings, texts, etc. Such data may be converted to real numbers by suitable preprocessing or the Regularization Networks learning framework may be generalized to be able to work on such types. For such generalization sophisticated kernel functions deﬁned on various types were created. Examples of kernel functions deﬁned on objects including graphs, sets, texts, etc. can be found in [11].

Acknowledgments This research has been supported by the the Ministry of Education project no. OC10047, and the GAAVČR project no. KJB100300804, and by the Institutional Research Plan AV0Z10300504 "Computer Science for the Information Society: Models, Algorithms, Applications".

References 1. Girosi, F., Jones, M., Poggio, T.: Regularization theory and Neural Networks architectures. Neural Computation 2, 219–269 (1995) 2. Kůrková, V.: Learning from data as an inverse problem. In: Antoch, J. (ed.) Computational Statistics, pp. 1377–1384. Physica Verlag, Heidelberg (2004) 3. Poggio, T., Girosi, F.: A theory of networks for approximation and learning. Technical report, Cambridge, MA, USA (1989); A. I. Memo No. 1140, C.B.I.P. Paper No. 31

Evolving Sum and Composite Kernel Functions for Regularization Networks

189

4. Poggio, T., Smale, S.: The mathematics of learning: Dealing with data. Notices of the AMS 50, 536–544 (2003) 5. Neruda, R., Vidnerová, P.: Genetic algorithm with species for regularization network metalearning. In: Papasratorn, B., Lavangnananda, K., Chutimaskul, W., Vanijja, V. (eds.) Advances in Information Technology. Communications in Computer and Information Science, vol. 114, pp. 192–201. Springer, Heidelberg (2010) 6. Aronszajn, N.: Theory of reproducing kernels. Transactions of the AMS 68, 337–404 (1950) 7. Kudová, P., Šámalová, T.: Sum and product kernel regularization networks. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 56–65. Springer, Heidelberg (2006) 8. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1996) 9. Prechelt, L.: PROBEN1 – a set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Universitaet Karlsruhe (9 (1994) 10. LAPACK: Linear algebra package, http://www.netlib.org/lapack/ 11. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)

Optimisation of Concentrating Solar Thermal Power Plants with Neural Networks ´ Pascal Richter1,2 , Erika Abrah´ am1 , and Gabriel Morin2 1

Chair of Computer Science 2, RWTH Aachen University, Aachen, Germany 2 Fraunhofer Institute for Solar Energy Systems, Freiburg, Germany

Abstract. The exploitation of solar power for energy supply is of increasing importance. While technical development mainly takes place in the engineering disciplines, computer science oﬀers adequate techniques for simulation, optimisation and controller synthesis. In this paper we describe a work from this interdisciplinary area. We introduce our tool for the optimisation of parameterised solar thermal power plants, and report on the employment of genetic algorithms and neural networks for parameter synthesis. Experimental results show the applicability of our approach. Keywords: Optimization, Solar thermal power plants, Neural networks, Genetic algorithms.

1

Introduction

The contribution of renewable energies to global energy supply has signiﬁcantly increased over the past ten years. Completely new branches of industry have developed in the ﬁelds of solar, wind, and biomass energy. Among such technologies, concentrating solar thermal power (CSP) plants are a promising option for power generation in regions with high direct solar irradiation. The principle seems to be very simple: Large mirrors concentrate rays of sunlight to heat water and the emerging vapour powers a turbine to generate electricity (see Fig. 1). In the early planning stage of commercial CSP plants, it is necessary to develope a conceptual plant design that ﬁxes the configuration of the plant, such as the solar ﬁeld size and the temperature and pressure levels in the water cycle. In the ideal case the design minimises the levelised cost of electricity (LCOE), describing the costs per generated electricity unit, for the given project and site, and taking speciﬁc properties like solar conditions and cooling water availability into account. In this paper, we describe our simulation-based techno-economical optimisation tool for the development of such project-speciﬁc plant concepts that are welldesigned with respect to economic criteria. The optimisation tool uses adequate

This work is based on the Fraunhofer ISE project ”optisim”, which was funded by the German Ministry of Environment (project number FKZ 0325045).

ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 190–199, 2011. c Springer-Verlag Berlin Heidelberg 2011

Optimisation of Concentrating Solar Thermal Power Plants

SOLAR BLOCK

191

POWER BLOCK

Collector Steam turbine

Generator

Storage (hot) Heat exchanger

Cooling tower Condenser Storage (cold) Pump

Fig. 1. Structure of a concentrating solar thermal power plant. In the solar block, large mirrors collect rays of sunlight and concentrate them on an absorber pipe. The concentrated heat of the sun is used to heat a transfer ﬂuid, usually thermal oil. The hot ﬂuid is either sent to the power block, where in a heat exchanger hot steam is generated from liquid water, or its energy is stored in a molten salt storage for later use after sun-set. In the power block the vapour streams through the turbine to power its blades so that the generator generates electricity. Source: Solar Millennium 2009, own modifications

computer science techniques: The global optimum of a solar plant conﬁguration is found using a genetic algorithm, whereas the thermodynamic-energetical procedures of a solar plant are described by an artiﬁcial neural network. The physical behaviour of CSP plants is very complex, so their analytical optimisation is in practice not possible. Instead we use genetic algorithms [3] for the optimisation. The basic idea is very simple: Given an arbitrary set of CSP conﬁgurations, we simulate each of them energetically and economically to compute their LCOE average over a year. The LCOE serves as the objective function for the optimisation. We select the best conﬁgurations with minimal costs and combine them to get a new generation of conﬁgurations for which we repeat the procedure. By mutation this method avoids getting stuck in local minima. In order to get reasonable LCOE values for a CSP conﬁguration, we need to simulate the plant behaviour for each hour of a year. Without further improvements (see Section 4) the optimisation would take about 1000 days due to the time-consuming thermodynamical power block simulation using Thermoﬂex [12]. To reduce the calculation time we approximate the thermodynamical power block simulation by bilinear interpolation: For each considered conﬁguration, instead of simulating each hour of a year we simulate only at an experimentally

192

´ P. Richter, E. Abrah´ am, and G. Morin

determined number of interpolation points. In this way we are able to reduce the running time of the optimisation to 2 days. To further reduce the running time, we train a neural network [5] to learn the required function of the power block behaviour, and replace the simulation by the trained neural network. This is the main contribution of this paper, whereby the computation time is reduced to about 2 hours (including training). Related work. There are several tools that simulate CSP plants (e.g. Thermoﬂex [12]). However, these tools are not able to optimise the conﬁgurations of CSP plants. Morin [10] connected in his PhD thesis Thermoﬂex (for power block simulation) with the solar block simulation tool ColSim [14,10] and with an optimisation algorithm using genetic algorithms and bilinear interpolation (but no neural networks). To our knowledge this is the only work on global optimisation of CSP plant designs, including power block design. There are also papers on combining genetic algorithms [3] with neural networks [5]. The NNGA approach applies neural networks to ﬁnd advantageous initial sets for the genetic algorithm (see, e.g., [6]). In reverse, the so-called GANN approach uses genetic algorithms to set the parameters of the neural network. A broad variety of problems has been investigated by diﬀerent GANN approaches, such as face recognition [4], Boolean function learning and robot control [9], classiﬁcation of the normality of the thyroid gland [11], color recipe prediction [2], and many more. In contrast to the above approaches we use in our work neural networks to generate the input data for the genetic algorithms. The rest of the paper is structured as follows. Section 2 describes the simulation of CSP plants. Section 3 is devoted to the optimisation using genetic algorithms. We use bilinear interpolation and employ neural networks to speed up the optimisation in Section 4. After presenting experimental results in Section 5, we conclude the paper in Section 6 with an outlook on future work.

2

The Simulation of CSP Plants

The aim of optimising concentrating solar thermal power plants is to generate electricity as cheaply as possible. The cost-eﬃciency of a power plant is generally speciﬁed by the so-called levelised cost of electricity (LCOE), which describes the costs per generated electricity unit (e.g. in Eurocent per kWh). We consider up to 20 design parameters of a CSP plant. Examples for such parameters are solar ﬁeld size, storage capacity, condenser size, distance between collector rows, as well as pressure and temperature levels. We use pdesign to denote the design parameters of the CSP plant (see Fig. 1 for the CSP plant structure). Our goal is to ﬁnd a conﬁguration of these parameters that yields a minimal LCOE. For a ﬁxed conﬁguration of the CSP plant the LCOE must be calculated under consideration of the seasonal and daily variations of its site parameters: The direct normal solar irradiance (DNI) has an inﬂuence on the collected thermal power in the solar block and the ambient temperature inﬂuences the cooling

Optimisation of Concentrating Solar Thermal Power Plants ECONOMIC ASSUMPTIONS

SITE PARAMETERS t Tamb

t qDNI

t Tamb

solar block fsolar

power block fpower t Ptherm

pdesign

ALGORITHM SETTINGS

cecon

cga

economic model fecon

Pelt

Eel,net

pdesign

193

genetic algorithm fga

LCOE

pdesign

DESIGN PARAMETERS OF THE CSP PLANT

change configuration parameter

Fig. 2. General structure of the optimisation procedure. Given a conﬁguration ﬁxing the design parameters of the CSP plant, we use the solar and power block models to simulate the CSP behaviour and compute the generated electrical energy for each hour of the year, under consideration of the site parameters. Summing up the generated electrical energy for each hour of the year gives the total energy amount, used as input to the economic model to compute the LCOE under some economic assumptions. The LCOE is the basis for the genetic algorithm to evaluate the conﬁguration of the CSP plant and to create new generations of conﬁgurations.

section of the power block1 . Therefore, the model for computing the LCOE is t t based on hourly time resolution over one year. We use qDNI and Tamb to denote the DNI and the ambient temperature for the tth hour of the year. The LCOE of a CSP plant for a given conﬁguration is computed in two steps: Firstly, we determine the electrical net energy Eel,net generated over a year using an energetic model of the CSP plant. Secondly, this value is used by the economic plant model to compute the LCOE under consideration of economic assumptions. 2.1

The Energetic Model of a CSP Plant

Physically the electrical energy is deﬁned as the time-integral of the electrical power: E = t Pel (t) dt. The electrical net energy generated over a year by a CSP plant for a given conﬁguration is approximated with numerical rectangle method as the sum of the electrical power generated during each hour t of a t year: Eel,net = 8760 t=1 Pel . To compute the electrical power Pelt for the tth hour of a year, we ﬁrst comt pute with the help of the solar block model the thermal power Ptherm gained by the solar block. This value serves as an input to the power block model that determines the generated power Pelt (see Fig. 2). The solar block model provides a function fsolar to calculate the thermal power t t t Ptherm = fsolar (Tamb , qDNI , pdesign ) based on the available direct normal solar 1

The hotter the ambient air, the less eﬃcient the power plant.

194

´ P. Richter, E. Abrah´ am, and G. Morin

t irradiance qDNI and the conﬁguration pdesign of the solar block. The calculation ﬁrst determines the optical collector performance and then subtracts heat loss and thermal inertia eﬀects (when heating up or cooling down). We use the ColSim tool [14,10] for these calculations. Depending on the operation strategy, either heat is stored or hot ﬂuid is sent to the power block. The eﬃciency of converting thermal energy into electrical energy in the power t t , the ambient temperature Tamb (inblock depends on the thermal power Ptherm ﬂuencing the cooling section), and the conﬁguration pdesign of the plant. The power block model provides a function fpower to specify for each hour t the t t electrical energy Pelt = fpower (Ptherm , Tamb , pdesign ). The power block receives a thermal energy ﬂow from the solar ﬁeld and/or the storage and converts it ﬁrst into mechanical and then into electrical energy. Mass and energy balances are computed for each component (e.g., steam turbine, condenser and pumps). We use the Thermoﬂex [12] tool for these computations. 8760 The total electric net energy Eel,net = t=1 Pelt serves as an input to the economic model.

2.2

The Economic Model of a CSP Plant

The economic plant model speciﬁes a function fecon to calculate the LCOE. The total investment costs for the solar block and the power block are computes depending on economic assumptions cecon (e.g. investment costs of collectors, interest rate, etc.) and the conﬁguration pdesign . The investments occuring at the initial project phase are distributed over the lifetime of a plant using the annuity factor. On top of the investment-related annuity the running costs occuring in the phase of operation of the plant need to be added. The annual running costs consist of: Staﬀ for operation and maintenance of the plant (e.g. mirror washing), water, spare parts and plant insurance. These economic assumptions are included in cecon . The levelised cost of electricity LCOE = fecon (cecon , pdesign , Peltotal ) equals the quotient of the annual costs and the electrical energy generated over a year. The ColSim tool also supports these computations.

3

Use of Genetic Algorithms to Optimise Solar Plants

As described above the techno-economical model can be used to compute the LCOE of a conﬁguration. However, the number of possible conﬁgurations grows exponentially in the number of parameters. To compute the LCOE for every conﬁguration in the search space is not realisable in practice. Hence, we need an eﬃcient heuristic approach to approximate such a multi-dimensional optimisation problem. Genetic algorithms, a special type of evolutionary algorithms, are well-suited for this purpose because they do not require knowledge about the problem structure. Furthermore, they can easily handle discontinuities, which is of crucial importance here since technically unrealisable conﬁgurations have to be sorted out. Another kind of discontinuities comes from several technological solutions (e.g. integer number of collector loops).

Optimisation of Concentrating Solar Thermal Power Plants

195

A set of initial individuals (in our case conﬁgurations) widely spread over the whole search area form an initial generation. Analogous to biological evolution, genetic algorithms select the best individuals (in our case the conﬁgurations with the smallest LCOE’s) from the current generation and use features such as selection, recombination and mutation to produce a new generation. Iterative application of this procedure leads to individuals close to the optimal solution. Our optimisation tool embeds such a genetic algorithm. The implementation is based on the free C++ library GALib [13]. We treat undesirable or contradictory conﬁgurations by penalisation: they get assigned a very high LCOE and are thus discriminated in the subsequent selection process. On average, the genetic algorithm needs about 25 iterations with 40 conﬁgurations per iteration to get close to the global minimum.

4

Improvements of the Optimisation

The running time of the optimisation based on a genetic algorithm is mainly determined by the simulation of the power block (see Section 2.1) that calculates the generated electrical energy. A typical characteristic diagram for the function fpower of the power block model is shown on Fig. 3. The computation of the function values must consider complex physical processes, and is therefore very time-consuming. For a single conﬁguration, the computation of the electrical energy Peli generated during a certain hour i of a year needs about 10 seconds2 on a standard computer. That means, it takes about a day to compute the electrical energy Peltotal generated over a year. The genetic algorithm considers about 1000 conﬁgurations until it gets close to the optimum. Thus the optimisation would need around 1000 days without further improvements. For applications we need to reduce the computation time. Below we present two improvements to speed up the optimisation. The ﬁrst approach involves bilinear interpolation, whereas the second one employs artificial neural networks. 4.1

Bilinear Interpolation

Assume a power block conﬁguration ppower is given. Instead of computing the t t generated electrical energy Pelt = fpower (Ptherm , Tamb , ppower ) for each hour t of a year we compute it only for the grid points of a two-dimensional grid in the space of thermal power and ambient temperature. We use these grid points together with their computed function values to interpolate the function fpower for the given conﬁguration, i.e., to get approximations for Pelt for each hour of the year. We use bilinear interpolation for this purpose, which performs linear interpolation ﬁrst in one dimension and then in the other dimension. Experiments have shown that it is suﬃcient to compute the values of fpower for a 4×4 grid, i.e., to simualate a conﬁguration, the modiﬁed power block model needs only 16 simulations (10 seconds each) instead of 8760 simulations. As the genetic algorithm needs about 1000 conﬁgurations to get close to the optimum, we have an approximate running time of about 2 days. 2

See Section 5 for more details.

196

´ P. Richter, E. Abrah´ am, and G. Morin Electrical power Pel [M Wel ] 50

30 20 10 0 0

10

20

30

40

Thermal power Ptherm [M Wtherm ]

147,7 133.0 118.2 103.4 88.6 73.8 59.1 44.3

40

Ambient temperature Tamb [◦ C] 50

Fig. 3. Typical characteristic diagram of the function fpower for a ﬁxed power block conﬁguration specifying the electrical power for each thermal power and ambient temperature value pairs

4.2

Artificial Neural Networks

To further reduce the computation time, we could think of applying linear interpolation in the dimension of conﬁgurations as we did for the environmental state. However, the behaviour in this dimension is highly non-linear, which leads to an unacceptable eﬀect on simulation accuracy. For this purpose we use neural networks [5] instead of linear interpolation. Neural networks are able to learn how to approximate functions without knowing the function itself. The learning process is based on a training set consisting of points from the domain of the function as input and corresponding function values as output. We use neural networks to learn the behaviour fpower of the power block model, i.e., to determine the generated electrical energy for a given conﬁguration and for some thermal power and ambient temperature values. There exists a wide range of diﬀerent neural networks. We use multilayer perceptron, a network consisting of diﬀerent neuron layers. As shown in Fig. 4, there is a ﬂow of information through a number of hidden layers from the input layer which receives the input, to the output layer which deﬁnes the net’s output. Fig. 5 shows the general scheme of the neurons in the layers. Each neuron weights its inputs and combines them to a net input on the basis of a transfer function (usually the sum of the weighted inputs). The activation function determines the activation output under consideration of a threshold value. During the learning process the number of hidden layers and the number of neurons in the layers do not change, but each neuron adopts the weights of its inputs. We train the multilayer perceptron during the optimisation as follows: Before the network is trained we apply bilinear interpolation as before. The results are

Optimisation of Concentrating Solar Thermal Power Plants

input layer

hidden layer 1

197

output layer

hidden layer 2

input 1 input 2

output

input 3 input 4

Fig. 4. Topology of an example multilayer neural network with two hidden layers

inputs

weights

x1

w1j

.. .

.. .

xk

wkj

transfer function

activation function netj net input

ϕ

oj activation

θj threshold

Fig. 5. Scheme of an artiﬁcial neuron j

used as training set for the network, conﬁguration and environmental state are used as input and the interpolated values as output. Experiments show that it is suﬃcient to train with the ﬁrst 20 conﬁgurations. After the network has been trained we use the neural network approach instead of the bilinear interpolation. For each of the 20 conﬁgurations we need 16 simulations for the bilinear interpolation, leading to a total simulation time of about 2 hours. The remaining computation times (bilinear interpolation, training, etc.) are insigniﬁcant compared to the simulation times. The quality of the results is comparable to the results of the approach using only bilinear approximation.

5

Experimental Results

We use the Flood tool [7] to deﬁne, train, and use multilayer perceptrons with two hidden layers. The networks can be conﬁgured by ﬁxing diﬀerent network parameters. The conﬁguration inﬂuences the network’s output, which again inﬂuences the quality of the optimisation result. To attain good results it is therefore of high importance to ﬁnd an appropriate conﬁguration. We determined a well-suited network conﬁguration experimentally: For each parameter we trained networks with diﬀerent parameter values and compared the quality of the outputs. As a measure for the quality of the training we used

198

´ P. Richter, E. Abrah´ am, and G. Morin

maximum relative validation error [%]

relative validation error [%]

number of neurons in first hidden layer

number of neurons in second hidden layer

number of neurons in first hidden layer

number of neurons in second hidden layer

Fig. 6. The relative (left) and maximum relative (right) error for multilayer perceptrons, all having 2 hidden layers but a diﬀerent number of neurons in the hidden layers

the relative error between the target and the predicted values for a test set with 12000 data points. During training [1] a network tries to adapt its free parameters (edge weights and neuron thresholds) such that the diﬀerence between the target values and the network output for a set of training points gets smaller. There are diﬀerent training algorithms using diﬀerent objective error functions. The best predictions were received by application of the conjugate gradient algorithm as training algorithm and the regularised Minkowski error with regularisation weight s = 0.1. As optimal parameters we determined as optimal parameters the sum of the weighted inputs as the transfer function and the hyperbolic tangent as the activation function for the neurons. We ﬁxed the above parameter values and varied the number of neurons in the hidden layers. Fig. 6 presents the relative validation errors. The best results were gained with 15 neurons in the ﬁrst and 17 neurons in the second hidden layer with a relative error of 1.2%. With these settings of a neural network, an annual simulation of a CSP plant predicted the LCOE with a relative error of 0.67% and a maximum error of 1.3%. The eﬀects for the optimisation of using a neural network should be determined in future work. It is expected, that the approximated optimum determined by using a neural network should be close to the optimum found by a time-consuming simulation-based optimisation.

6

Conclusion and Outlook

We described and applied an approach to optimise concentrating solar thermal power plants by determining economically optimal design parameters. The

Optimisation of Concentrating Solar Thermal Power Plants

199

combination of simulation, genetic algorithms, bilinear interpolation and neural networks allowed us to reduce the calculation time of the optimisation procedure by around 90% compared to an approach without neural networks. Neural networks were used here to detect complex thermodynamic analogies between diﬀerent plant designs. The achieved accuracy for prediction of the LCOE is a relative error of less than 1%. In this paper simple multilayer perceptrons were used. In future work here is room for improvements, e.g. using recurrent neural networks [8], which would need fewer parameters to optimise and due to their feedback, they are likely to suit the problem more.

References 1. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) 2. Bishop, J.M., Bushnell, M.J., Usher, A., Westland, S.: Genetic optimisation of neural network architectures for colour recipe prediction. In: International Joint Conference on Neural Networks and Genetic Algorithms, pp. 719–725 (1993) 3. Goldberg, D.E., et al.: Genetic algorithms in search, optimization, and machine learning. Addison-wesley, Reading (1989) 4. Hancock, P., Smith, L.: GANNET: Genetic design of a neural net for face recognition. In: Schwefel, H.-P., M¨ anner, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 292–296. Springer, Heidelberg (1991) 5. Hopﬁeld, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America 79(8), 2554 (1982) 6. Liau, E., Schmitt-Landsiedel, D.: Automatic worst case pattern generation using neural networks & genetic algorithm for estimation of switching noise on power supply lines in cmos circuits. In: European Test Workshop, IEEE, pp. 105–110 (2003) 7. Lopez, R.: Flood: An open source neural networks C++ library. Universitat Polit`ecnica de Catalunya, Barcelona (2008), http://www.cimne.com/flood 8. Mandic, D.P., Chambers, J.A.: Recurrent neural networks for prediction: Learning algorithms, architectures and stability. Wiley, Chichester (2001) 9. Maniezzo, V.: Genetic evolution of the topology and weight distribution of neural networks. IEEE Transactions on Neural Networks 5(1) (1994) 10. Morin, G.: Techno-economic design optimization of solar thermal power plants. PhD thesis, Technische Universit¨ at Braunschweig (2010) 11. Schimann, W., Joost, M., Werner, R.: Application of genetic algorithms to the construction of topologies for multilayer perceptrons. In: International Joint Conference on Neural Networks and Genetic Algorithms, pp. 675–682 (1993) 12. Software Thermoﬂex: software developed and distributed by Thermoﬂow Inc. http://www.thermoflow.com/. 13. Wall, M.: GAlib: A C++ library of genetic algorithm components (1996), http://lancet.mit.edu/ga/ 14. Wittwer, C.: ColSim Simulation von Regelungssystemen in aktiven Solarthermischen Anlagen. PhD thesis, Universit¨ at Karlsruhe (1998)

Emergence of Attention Focus in a Biologically-Based Bidirectionally-Connected Hierarchical Network Mohammad Saifullah and Rita Kovordányi Department of Information and Computer Science, Linköping University Linköping, Sweden {mohammad.saifullah,rita.kovordanyi}@liu.se

Abstract. We present a computational model for visual processing where attentional focus emerges fundamental mechanisms inherent to human vision. Through detailed analysis of activation development in the network we demonstrate how normal interaction between top-down and bottom-up processing and intrinsic mutual competition within processing units can give rise to attentional focus. The model includes both spatial and object-based attention, which are computed simultaneously, and can mutually reinforce each other. We show how a non-salient location and a corresponding non-salient feature set that are at first weakly activated by visual input can be reinforced by top-down feedback signals (centrally controlled attention), and instigate a change in attentional focus to the weak object. One application of this model is highlight a task-relevant object in a cluttered visual environment, even when this object is non-salient (non-conspicuous). Keywords: Spatial attention, Object-based attention, Biased competition, Recurrent bidirectionally connected networks.

1 Introduction Image processing techniques for object recognition can be made to learn relatively easily to recognize or classify a single object in the field of input. However, if more than one object is present simultaneously, it would be difficult to separate information about the target object from information about other objects in the field of input. In such cases, simpler techniques will fail to produce an output. Artificial neural techniques for image processing will on the other hand tend to produce an output that reflects a mixture of the objects, recognizing neither the target object, nor the clutter in the background. Neither solution is satisfactory. In contrast, the human visual system can function without problem in the face of potentially irrelevant objects cluttering up the visual field. Multiple visual inputs are disambiguated via top-down, knowledge-driven signals that reflect previous memory for an object, present task requirements, or goals and intentions. Hence, on the basis of top-down signals, the human visual system can focus on those objects that are relevant to the task, while inhibiting irrelevant information. As the relevancy of information changes from environment to environment and task to task, attention has to work as an adaptive filter. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 200–209, 2011. © Springer-Verlag Berlin Heidelberg 2011

Emergence of Attention Focus

201

In order to adapt to visual saliency as well as task requirements, attention is controlled by bottom-up sensory cues as well as by top-down, task dependent information. In image processing, this process is often divided into a two-stage process. At the first stage, a saliency map is created based on the visual conspicuity or intensity of the input. At the second stage, attention is focused on the most salient region, and an attempt is made to recognize the relevant object in the focused region. As a further development, top-down image processing approaches allow for direct, simultaneous top-down influence of the saliency map through a feedback-loop, so that the position of the most salient region is determined not only on the basis of saliency, but also whether the region contains features that (could) belong to the sought object For example, Lee and colleagues use a feedback loop, where object-information is injected into the early stage of spatial processing stream [1]. Other notable examples include [2, 3]. Truly mutual interaction between bottom-up and top-down driven computation cannot be achieved unless computation is naturally implemented in small, incremental update steps, for example, using artificial neural networks (ANN) where unit activations are calculated incrementally. In addition to this, biologically-based ANN-approaches capitalize on the fact that 1. The human visual system is intrinsically bidirectionally connected, which allows for mutual interaction between bottom-up and top-down information, and 2. There is inherent competition within groups of processing units (corresponding to ANN-layers or groups of artificial units), which indirectly creates an inhibitory effect where unwanted information outside the region of interest is suppressed [4-6]. Among the early biologically-based ANN-approaches is MAGIC, developed by Behrmann and colleagues, which uses bidirectional connections to model grouping of features into objects within the ventral what-pathway (cf. Fig. 1). MAGIC models feature-based object-driven selection, but does not combine this with selection of a spatial region of interest [7]. Hence, it is not possible to focus on a particular location in the image. Phaf and coworkers [8] present SLAM, which include spatial attention and models the earliest stages of visual processing. Subsequently, these biologicallybased models were extended to eye movement control. For example, Hamker uses top-down connections in a recurrent (bidirectional) network to mediate task information and thereby influence the control eye movements [9]. Likewise, Tsotsos and coworkers present a partially recurrent network for eye movement control [10]. Sun and Fisher present a system where object-based and space-based attention work together to control eye movements [11].

2 Our Approach None of the above models utilize the potential of a fully recurrent network, where topdown feedback can naturally permeate bottom-up-driven computation. In the approach we have chosen, attentional focus arises as an emergent side-effect of normal visual computation. No extra mechanisms are required (for example, special feedback loops) to achieve an attentional focus with inhibitory fringe, as we use the normal visual feedback connections and the intrinsic competition mechanisms that are present in the visual system and are required for normal learning and processing.

202

M. Saifullah and R. Kovordányi

Similar ideas of integrated or biased competition that requires no extra mechanisms were presented by Duncan and O’Reilly [4, 5]. We use a network (Fig. 1) where we model the interaction between the dorsal and the ventral pathway in the primate visual system using a bidirectionally-connected hierarchically-organized artificial neural network. We use this network to study the interaction between top-down and bottom-up information flow during the emergence of attention focus. The network architecture is based on O’Reilly’s model of attention [5], where we re-modeled the dorsal pathway for spatial attention, allowed for object based top-down information flow along the ventral pathway, and let this information interact with the normal bottom-up flow of visual information. In contrast to many image processing approaches, the proposed model does not consider spatial attention and the object recognition as two separate processing stages, rather they work in parallel and attention emerges as a natural consequence of interactions between top-down and bottom-up influences.

PFC

V4

Spatial attention

V2

Saliency map

V1

Fig. 1. A schematic diagram showing the interactions in the ventral what-pathway and the dorsal where-pathway, as well as cross-communication between the pathways

As a result of mutual interaction between top-down and bottom-up signals, the region of interest and the object in focus emerges in parallel, mutually reinforcing each other. The bottom-up saliency of visual cues in the input image is computed in terms of the strongest activated area in the early stages of processing (V1, V2). The inputdriven feedforward flow of information across the hierarchy of the network is modulated by top-down task information at each incremental update of all activations across the network. In this way, which object should be focused on modulates the bottom-up driven saliency map and helps to activate the most relevant region (in the where-pathway) and the relevant features of the task specific object (in the whatpathway). The interaction between top-down and bottom-up signals give rise to a focus which activates the most salient and task relevant location, as well as the corresponding object and its features (constituent line segments) at that location in the input image.

Emergence of Attention Focus

203

2.1 A Closer Look at the Network Used in the Simulations The network used in our simulations (Fig. 2) is a combination of two sub-networks: ‘what’ and ‘where’. The ‘what’ network models the ventral or ‘what’ pathway and composed of five layers: Input, V1, V2, V4 and Object_cat, with layer sizes 60x60, 60x30, 33x33, 14x14 and 5x1 respectively. The units in layers V1 and V2 were divided into groups of 2x4 and 8x8 units respectively. Each unit within the same group in V1 was looking at the same spatial part in the image, that is, all units within a group had the same receptive field. Similarly, all units within the same group in V2 received input from the same four groups in V1. These sending groups in V1 were adjacent to each other and covered a contiguous patch of the visual image. Object_cat is an output layer and its size depends on the number of categories used for simulations. Layers V2, V4 and Object_cat are bidirectionally connected in hierarchy, while Input, V1 and V2 are connected in bottom up fashion. The ‘where’ network, which simulates the functionality of the dorsal pathway and mediates spatial information, is a two layer network. The network layers, Saliencey_map and Attention, are bidirectionally connected to each other, using all-to-all connections. The Saliency_map layer identifies the salient locations within the input and the Attention layer selects the most salient location from these. For the simulations, we combined the object recognition and spatial attention networks by bidirectionally connecting the Saliency_map and Attention layers to V1 and V2 layers respectively.

Fig. 2. Network used in the simulations. All connections were bidirectional except for connections going from Input to V1, and V1 to V2 and Saliency_map). The connections are schematically displayed: Connections were either all-to-all, or were arranged in a tiled fashion, where each connection mediated information from a sending unit group to a receiving group).

204

M. Saifullah and R. Kovordányi

The network was developed in Emergent [12], using the biological plausible algorithm Leabra [5]. Each unit of the network had a sigmoid-like activation function: y

j

=

γ [V m − Θ ]+

γ [V m − Θ ]+ + 1

,

[z ]+

⎧ z if z ≥ 0 =⎨ ⎩ 0 if z < 0

(1)

γ = gain Vm = membrane potential Θ = firing threshold Learning was based on a combination of Conditional Principal Component Analysis (CPCA), which is a Hebbian learning algorithm and Contrastive Hebbian learning (CHL), which is a biologically-based alternative to back propagation of error, applicable to bidirectional networks [5]: CPCA: Δhebb.= εyj(xi − wij) ε = learning rate xi = activation of sending unit i yj = activation of receiving unit j wij = weight from unit i to unit j ∈ [0, 1] CHL:

Δerr = ε(xi+ yj+ − xi− yj−) x−, y− = act when only input is clamped x+, y+ = act when also output is clamped

L_mix: Δwij = ε[ chebb Δhebb + (1− chebb)Δerr] chebb = proportion of Hebbian learning

(2)

We used a relatively low amount of Hebbian learning, 0.1% of the total L_mix (Eq. 2), for all connections in the two networks except for the connections from Input to V1. For Input to V1, Gabor filters were used to extract oriented bar like feature for four orientations. These values for the Hebbian learning, as well as other parameters, are based on previous work [13, 14]. 2.2 Data Set For this study we took five object categories from the Caltech-101 data set [15]. For each object category, three images were selected. Each object image was converted to gray scale before detecting the edges in the image. Each image was resized to 30 x 30 pixels (Fig. ). The size of the input to the network is 60 x 60. This implies that object size is approximately one fourth of the network input size, so that each object could appear in one of four locations in the input (Fig. ). During training each object was presented to the network at all four locations, one location at a time, so that network could learn the appearance of the object in a position invariant manner at all locations.

Emergence of Attention Focus

205

Fig. 3. Top row: Examples of the five object categories that were selected for training. Bottom row: Edge representation of the images.

Fig. 4. Top: A few input images as illustration of the four positions where data were presented during training of the network. Object size was almost one fourth of the input size so that the objects could appear at one of four locations within the input image. Bottom: sample test images containing multiple objects in various locations.

2.3 Procedure First of all, we trained the network on all five object categories. After training, we made sure by testing that the network has learnt all the object at all four locations. After that, we evaluated the network in three steps. In the first step, to get a baseline for how the network performs on several simultaneous objects, we fed to the network test images containing more than one objects at different locations. The network response was quite arbitrary and most of the time results in error. In the next step, we connected the ‘where’ network with object recognition network to evaluate how spatial attention interacted with the object recognition pathway to facilitate object recognition, when more than one objects were present in the input image. When input containing two objects was presented to the network, parallel processing of the input begin along two pathways. Object recognition pathway extracted features from the input and processing of features moves along its hierarchy.

206

M. Saifullah and R. Kovordányi

In the meanwhile, the Saliency_map layer in the dorsal pathway generated a saliency map of the input and the next layer in the hierarchy, the Attention layer, selected one of the blob-shaped activation patches that were formed in the saliency map due to kWinners-Take- All (kWTA) competition within the layer which allowed only 50% of the units to become activated. As the Attention layer was connected with the V4 layer in the ‘what’ network, its interaction with the V4 layer reinforced via feedback connections the activation of the corresponding V2 units at the location which was sharing connections with active unit of Attention layer. This interaction caused V2 unit at a particular location, which represent the activation of a particular object to be more active and thus reducing the activation of all other units at all other locations due to inhibition within V2 layer. Consequently, higher layers of the networks get feedback of the one object, which has higher activation due to spatial attention mechanism. The network correctly recognizes the object which got focus of spatial attention. In the third step, we investigated that how interaction between top down, goaldirected and bottom-up image based affects lead to focusing of attention on a specific object (Fig. ). For this purpose the role of the Object_cat layer was changed from output to input, and Intention layer was set as output layer, in order to observe that at what location attention focuses on, as a result of interaction between top-down and bottom-up effects. For this simulations, input with three and four objects at a time, and at different locations within input, are presented to the network. The Saliency_map layer indicated the salient regions in the input, and Atttention layer selected the most salient region in Saliency_map. In the meantime, top-down effects along the object recognition pathway strengthened the relevant features of the specific object category; category information was fed at Object_cat in a top-down fashion, by interacting with the input/bottom-up activations along the same pathway. This results in the activations of the category specific units at all four locations at V2 layer. But, the location where object of the specific category was presented became more active comparative to other locations. The local inhibition also play its role and inhibit the activation of less active units, thereby most of the remaining active units belong to true location of the specific object category. As V2 layer is bidirectionally connected with the Attention layer, it interacts with the Attention layer and a sort of competition starts between the top-down effects through V2 layer and bottom-up effects through Saliency_map. The active unit at Attention layer indicates the true location of attention of the focus of attention on the network.

3 Results and Analysis 3.1 Multiple Objects without Any Attention We fed the train network with two, three and four objects at a time to observe network behavior. The network output was arbitrary, as it could not handle multiple objects. The multiple objects, at multiple locations activated units representing their features. But, due to kWTA, the most active units in all objects remain active. The representation at the higher layers contains features belonging to all objects fed at input. Now, it is the sheer chance that which category is decided by the network at output. It clearly exposes the network’s inability to deal with multiple objects simultaneously.

Emergence of Attention Focus

207

Fig. 5. Consecutive snapshots of activations in the various network layers (each layer is made up of a matrix of units, and the activation values of these matrices are shown here). The recorded changes in activation for different processing cycles illustrate how task-based focus of attention emerges as a result of top-down and bottom-up interactions. For each graph, in the order from left to right, the columns represent: Number of processing cycle (how far computation of activation has gone), activations in Object_cat layer, Input layer, Saliency_map layer, and Attention layer of the network. Yellow (light) colors denote high activation values, red (dark) colors low activation. Gray (neutral) color means no activation. 3.2 Multiple Objects with Bottom-Up Attentional Effects

Bottom-up effect, along the dorsal pathway, removed the arbitrary behavior of the network in the presence of multiple objects as input. Network, begin to recognize the most salient object in the input field. This is because, the Saliency_map generate a saliency map on the basis of input from V1. Due to kWTA, the activations of the strongest object, in terms of activation survive, while rest gets inhibited. The mutual interaction between Salincy_map and Attention layer activates the location unit of the said object. The active unit in the Attention layer, through interaction, enhances the activations of units pertaining to location which represents the strongest object in the input. This process interpreted as the attentional focus on particular location due to dorsal pathway. If the proper, inhibition-of-return phenomenon is implemented, then the next focus of attention would move to the second strongest object and so on. But, if the objective is to look for a specific object, top-down effects are required.

208

M. Saifullah and R. Kovordányi

3.3 Multiple Objects with Both Bottom-Up as Well as Top-Down Attentional Effect The interaction between the bottom-up and top-down effects, as well as between these two effects and feed forward flow of the image information along the ventral pathway lead to a more controlled network behavior. For example consider a single case, Figure. 4, an input containing two different objects, a cup and a crayfish, are presented at the input layer of the network. The activations produces by cup are stronger as compare to those of crayfish. Therefore, saliency map would select cup as an object to put attention on. But the Object_cat layer, that has changed its role from output to input layer, is clamped with the pattern that represent crayfish. It will work as top-down effects are biasing the network for searching crayfish. Now from the figure 4, It is evident that initially Saliency_map layer build a saliency map which shows the most salient object i.e. cup (cycle:11-14). And Attention layer, through interaction with Salincy_map layer, activate the position of cup in Attention layer, in this case the right top unit in the Attention layer (cycle 19, 20). Meanwhile, top down effects interacts with the unit activation at the V4, V2 units and bias the layer activations towards crayfish specific features. This interaction leads to strengthening activity at location where crayfish was actually presented. This interaction in turn activates the location unit, representing the crayfish location, at Attention layer. It is the time, when a completion for winner takes place for locations, between bottom-up and top-down effects (cycles 21-42). In this case, this competition gives way to attention focus on crayfish (cycle 42-113).

4 Conclusions We have presented a model where attentional focus arises as an emergent side-effect of mechanisms that are inherent to visual computation, namely interaction between topdown expectations and bottom-up visual information, and intrinsic lateral competition within processing modules (layers or groups of artificial units). In this model, attentional focus arises from the mutual interaction between the top-down and bottom-up influences. We have shown how these influences give rise to a step-by-step development of activations in the network that reflect attention focused on a specific location and at the same time on a specific object. We have demonstrated how top-down influence can override bottom-up saliency in some cases, so that an object that is at first only weakly activated based on visual cues, can be focused on intentionally (top-down), so that its activation is enhanced and finally emerges as a winner (gains focus). While top-down effects ensure that only task relevant objects and features get activated, their interaction with bottom-up effects helps delineate the object’s features and ignore the clutter within the image.

References 1. Lee, S., Kim, K., Kim, J., Kim, M., Yoo, H.: Familiarity based unified visual attention model for fast and robust object recognition. Pattern Recognition 43, 1116–1128 (2010) 2. Poggio, T., Serre, T., Tan, C., Chikkerur, S.: An integrated model of visual attention using shape-based features. MIT-CSAIL-TR-2009-029 (2009)

Emergence of Attention Focus

209

3. Navalpakkam, V., Itti, L.: Modeling the influence of task on attention. Vision Research 45, 205–231 (2005) 4. Duncan, J.: Converging levels of analysis in the cognitive neuroscience of visual attention. Philosophical Transactions of the Royal Society B: Biological Sciences 353, 1307–1317 (1998) 5. O’Reilly, R.C., Munakata, Y.: Computational Explorations in Cognitive Neuroscience: Understanding the Mind by Simulating the Brain. MIT Press, Cambridge (2000) 6. O’Reilly, R.C.: Six principles for biologically based computational models of cortical cognition. Trends in Cognitive Sciences 2, 455–462 (1998) 7. Behrmann, M., Zemel, R., Mozer, M.: Object-Based Attention and Occlusion: Evidence from Normal Participants and a Computational Model. Journal of Experimental Psychology: Human Perception and Performance 24, 1011–1036 (1998) 8. Phaf, R., Van der Heijden, A., Hudson, P.: SLAM: A connectionist model for attention in visual selection tasks. Cognitive Psychology 22, 273–341 (1990) 9. Hamker, F.H.: The emergence of attention by population-based inference and its role in distributed processing and cognitive control of vision. Computer Vision and Image Understanding 100, 64–106 (2005) 10. Rothenstein, A.L., Rodríguez-Sánchez, A.J., Simine, E., Tsotsos, J.K.: Visual feature binding within the selective tuning attention framework. International Journal of Pattern Recognition and Artificial Intelligence 22, 861 (2008) 11. Sun, Y., Fisher, R., Wang, F., Gomes, H.: A computer vision model for visual-objectbased attention and eye movements. Computer Vision and Image Understanding 112, 126–142 (2008) 12. Aisa, B., Mingus, B., O’Reilly, R.: The emergent neural modeling system. Neural Networks: The Official Journal of the International Neural Network Society 21, 1146–1152 (2008) 13. Kovordányi, R., Roy, C.: Cyclone track forecasting based on satellite images using artificial neural networks. ISPRS Journal of Photogrammetry and Remote Sensing 64, 513–521 (2009) 14. Kovordanyi, R., Saifullah, M., Roy, C.: Local feature extraction — What receptive field size should be used? In: Presented at the International Conference on Image Processing, Computer Vision and Pattern Recognition, Las Vegas, USA (2009) 15. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: IEEE. CVPR 2004. Workshop on Generative-Model Based Vision (2004)

Visualizing Multidimensional Data through Multilayer Perceptron Maps Antonio Neme1,2 and Antonio Nido2 1

Adaptive Informatics Research Centre Aalto University, Helsinki, Finland [email protected] 2 Complex Systems Group Universidad Autonoma de la Ciudad de Mexico, Mexico

Abstract. Visualization of high-dimensional data is a major task in data mining. The main idea of visualization is to map data from the highdimensional space onto a certain position in a low-dimensional space. From all mappings, only those that lead to maps that are good approximations of the data distribution observed in the high-dimensional space are of interest. Here, we present a mapping scheme based on multilayer perceptrons that forms a two-dimensional representation of highdimensional data. The core idea is that the system maps all vectors to a certain position in the two-dimensional space. We then measure how much does this map resemble the distribution in the original highdimensional space, which leads to an error measure. Based on this error, we apply reinforcement learning to multilayer perceptrons to find good maps. We present here the description of the model as well as some results in well-known benchmarks. We conclude that the multilayer perceptron is a good tool to visualize high-dimensional data. Keywords: data visualization, reinforcement learning, multilayer perceptrons.

1

Introduction

Visualizing multidimensional data in a low-dimensional space such as in the computer screen is a high-eﬀort task. Several algorithms have been proposed during the last years and dozens of groups are still working on it. The relevance of data visualization should be clear when considering data sets from, say, molecular biology, in which the features that deﬁne the data space are numbered by hundreds [1], or in natural language processing, in which relevant features tend to exceed the houndreds. A good visualization tool may enhance the understanding of data, or even may lead to the discovery of meaningful patterns. In general, every measured feature deﬁnes a dimension. As it is impossible to visualize more than two (three if projections are considered) in the computer, then it is necessary to deal with either feature selection or feature transformation [2]. The ﬁrst one is about tryiing to ﬁnd a pair of dimensions that explains A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 210–219, 2011. c Springer-Verlag Berlin Heidelberg 2011

Visualizing Multidimensional Data through Multilayer Perceptron Maps

211

the most of data distribution in the high-dimensional input space. This is, of course a very limited scheme. Instead, the idea of a nonlinear mapping refers to ﬁnding a nonlinear projection in which the data distribution observed in the high-dimensional, or input space, is maintained in the low-dimensional, or output space. Here, we present a visualization system based on multilayer perceptrons (MLP). MLP maps high-dimensional data to a two-dimensional space, and such maps present similar data distributions to those observed in the high-dimensional space. In section 2 we present a general overview of data visualization and a brief description of some of the best-known methods. In section 3 we present the high-dimensional data visualization scheme based on multilayer perceptrons and in section 4 we present the performance of the proposed model by comparing it with several visualization models. We ﬁnally present some conclusions in section 5.

2

Multidimensional Data Visualization

Several methods have been proposed to visualize high-dimensional data. Among the most common are principal component analysis (PCA), a linear transformation that projects data to a new space, deﬁned by the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of data. This method maps data in such a way that a ﬁrst-order statistics is preserved (mean square error) [3]. PCA has a very limited use in high-dimensional data visualization as it is not able to approximate data distribution when data is not linearly separable. In the case of non-linear projections, the self-organizing map is a major tool, as it is able to recognize high-order statistics , and not only ﬁrst or second order statistics as in PCA [4,5]. One of the major properties of the SOM is the ability to preserve in the output map those topographical relations present in the input data. This attribute is achieved through the transformation of an incoming signal pattern of arbitrary dimension into a low-dimensional discrete map (usually two-dimensional) [4]. One major concern for visualizing data with SOM is that there is not an error measure that guides the learning. Also, it presents some errors as underrepresented regions in the high-dimensional space tend to be overrepresented in the low-dimensional map [4]. In general, when unsupervised models are applied for visualization, it is not possible to have an error measure during the training or mapping process. That is, the model learns the relations present in the data distribution but there is no possible feedback to modify the achieved maps, although in some variants of unsupervised learning, it is possible to include some clues that may modify the map formation [6]. A new group of visualization tools is known as manifold learning, in which a model tries to learn the manifold in which high-dimensional data is embedded. One of the most well-known manifold learning algorithms is that of ISOMAP, in which a graph is constructed so the points in the high-dimensional deﬁne nodes and edges connect close nodes. It makes use of the geodesic distance induced

212

A. Neme and A. Nido

by the deﬁned neighborhood graph [7]. One of the main disadvantages of this method is that, although very good for manifold learning, it is not very good for data visualization [8]. A recent algorithm proposed in [9] is able to ﬁnd a low-dimensional distribution of high-dimensional data by minimizing an error measure. This algorithm ﬁnd the location of data by optimizing an error function with respect to each axis in a low-dimensional space, by means of a standard conjugate gradient algorithm. Although it is able to ﬁnd good maps, it is unable to map new data, that is, data that were not part of the trainning set. Roughly speaking, there are two kinds of visualization tools, based on their capabilities to include in the low-dimensional representation data that were not previously presented. The ﬁrst kind refers to models that have to be retrained, or the positions of all vectors have to be recalculated. The second kind includes models that are able to modify the media in which low-dimensional representations are mapped. By doing so, the model should not be retrained, as only the new input vector is presented and its position is calculated without modifying the position in the low-dimensional space of other vectors. Multidimensional scaling, PCA, neighborhood retrieval visualizer and ISOMAP are among the visualization tools of the ﬁrst kind. Among the second kind, SOM is one of the most well-known methods. Our model is also part of the second kind, as it does not need to be retrained to allow the visualization of new data. There is then a trade-oﬀ between learning the data with high accuracy, or learning the relations present in data. SOM and the proposal introduced in this contribution belong the the latter case, whereas the other mentioned models are part of the former case As a general case, the statistical relations in high-dimensional data are of higher-order, which means that to have a good representation in a lowdimensional space, a non-linear projection has to be achieved. In this contribution, we propose the use of multilayer perceptrons as models to achieve those non-linear projections from high-dimensional spaces to a low-dimensional one. As MLP are universal approximators [10], we applied them to try to learn data relations in data. For a more detailed description of MLP, several works may be consulted, as for example [11].

3

Data Mapping through Multilayer Perceptrons

HDVMLP, the acronym for High-dimensional data visualization through multilayer perceptron is a visualization tool that maps from high-dimensional spaces to two-dimensional spaces. The map is obtained by a multilayer perceptron (MLP). The MLP has as many inputs as dimensions in the high-dimensional space and the MLP has two outputs, one for each axis in the two-dimensional map space. The MLP is presented with the training data and it generates a twodimensional representation of the data. We now proceed to discuss the details of this map formation model.

Visualizing Multidimensional Data through Multilayer Perceptron Maps

213

Fig. 1. a) Visualization of high-dimensional data through a mapping obtained by MLP. b) Some examples showing the error for strict and weak neighborhood cases. Ni (r) is the list of the closest r elements to data i in the k−dimensional space. ni (r) is the list of the closest r elements to data i in the two-dimensional space. Es and Ew correspond to the strict and weak neighborhood errors.

A MLP is evaluated only when all data vectors have been mapped. What is measured is the map, and it may only be evaluated once it is completed. Let D be the distance matrix for the high-dimensional data. For each input vector i, the r closest vectors (besides itself) to it form the neighborhood Ni (r). From the twodimensional distribution obtained by MLP, the distance matrix d is calculated. For each vector i, the r closest vectors (besides itself) to it are selected, ni (r). Now, to measure the map goodness, we have to compare Ni (r) with ni (r) for all vectors. We compare neighborhoods instead of distances between pairs of vectors, since we are not interested in scaling distances, which is the case in models related to multidimensional scaling. To refer to the j-th element in Ni (r) or in ni (r), we call the function Ni (r, j), and ni (r, j), respectively. There are two possible scenarios to evaluate maps. In the ﬁrst one, referred as an strict mapping, each element in Ni (r) must be in the same position in ni (r). As Ni (r) and ni (r) are ordered accordingly to distance, the ﬁrst element is the closest one, the second element is the second-closest, and so on. Then, for each vector i, the vector in position j in Ni (r) should be also in position j in ni (r): Es = (1/M r)

r M

δ(Ni (r, j), ni (r, j))

(1)

i=1 j=1

with δ(a, b) = 0 if a = b and δ(a, b) = 1 otherwise. M is the number of vectors, so the term 1/M r is a normalizing factor. The strict error evaluation Es will be 0 only when the map is order-preserving for all vectors and neighbors. It will be 1 when all neighbor vectors are mapped in a diﬀerent order. Note that the

214

A. Neme and A. Nido

elements in Ni (r) may be the same elements in ni (r), but if the order is diﬀerent, then there will be a penalty (see ﬁg. 1-b). A relaxation in this measure leads to the second scenario, which is a weakneighborhood measure, related to those proposed in [9]. Following their proposal, we deﬁne a weak neighborhood error. In ﬁg. 1-b it is shown an schematic explanation of the error. Note that in [9], neighborhood is deﬁned in terms of distance between vectors. In that scheme, the number of vectors in the neighborhood may diﬀer from the high-dimensional space to the low-dimensional space. By deﬁning neighborhood in terms of distance, they are able to have two types of errors for weak neighborhood, namely precision and recall. Precision is deﬁned in terms of vectors that are not neighbors in the high-dimensional space but are neighbors in the low-dimensional space, and recall is deﬁned in terms of vectors that even though neighbors in the high-dimensional space are not part of the vector neighborhood in the low-dimensional case. In our proposal, as the number of neighbors is static (only the r closest vectors are neighbors), there is no diﬀerence between precision and recall. This is valuable as there is no necessity for the user to decide what is worse, if precision or recall, as our proposal consider all errors in the same proportion. Thus, if a vector is in ni (r) and it is also in Ni (r) regardless its position on it, then there is not penalty for that vector: Ew = (1/M r)

M r

Δ(ni (r, j), Ni (r))

(2)

i=1 j=1

Where Δ(i, L) = 0 if i ∈ L and Δ(i, L) = 1 otherwise. There are, then, two possibilities to evaluate a map. The MLP has one input for each dimension of the high-dimensional input space, and two outputs, one for each one of the axes in a two-dimensional output space. Each input vector i is presented to the MLP and an output is obtained. The output (Xi , Yi ) is the position assigned to i in the low-dimensional space. A map is formed when all input vectors have been mapped, so the map (X, Y ) is subject to evaluation as deﬁned in eq. 1 and 2. This training scheme is based on batch learning, in which weight adaptation (learning) is possible only when all input vectors have been presented to the MLP. This is, of course, an example of reinforcement learning [12], in which the system may be evaluated only when it has completed its task. Once that a map (X, Y ) has been obtained and evaluated, then it is possible to apply a learning methodology to the MLP. We applied genetic algorithms with elitism to train the MLP. Genetic algorithms are heuristic search methods that have been useful in several contexts [13,12], even though there are welldocumented constraints [13]. In our model, each MLP on a population has a ﬁxed number of hidden units, as well as the same activation function for neurons, which reduces the searching space as it is not necessary to search for the network topology. Each MLP a is coded as vector:

Visualizing Multidimensional Data through Multilayer Perceptron Maps

215

y 1 2 hu x y x Va = [w11 , w12 , ..., w1hu , w21 , w22 , ..., w2hu , ..., wK , wK , ..., wK , v1 , v1 , ..., vhu , vhu ]

in which wij refers to the weight vector that connects input i to hidden neuron j, K is the number of variables (the dimension of the high-dimensional space) and hu is the number of units in the hidden layer. vix is the weight that connects the hidden unit i to the output unit x, whereas viy is the weight that connects hidden unit i to output unit y. The number of weights is then hu × (K + 2), as there are two output units. All units are activated by a sigmoid function of the form f (x) = 1/(1 + e−α×x ), α = 2, and there was only one hidden layer. The general algorithm is sketched in ﬁg. 2.

Fig. 2. The algorithm to form two-dimensional maps by multilayer perceptrons

The objective is to ﬁnd a weight vector such that the error measure for a map obtained by a MLP is minimal, and thus, the quality of the map is maximal. The foundations for this come from the neural theory: if there is a function f : K → 2 such that the neighborhood error is minimal, then the multilayer perceptron is able to approximate f [11].

4

Results

To measure the quality of the maps achieved by HDVMLP, several data sets were considered. The ﬁrst one is an artiﬁcial one, shown in ﬁg. 3. It consists in ﬁve welldeﬁned clusters in a three-dimensional space, plus a conﬂicting group of points that may not be deﬁned as a cluster (sinusoidal-shape). For each one of the six clusters, identiﬁed as 0, ..., 5, there were 300 points. The remaining data sets are benchmarks for cluster analysis, described in table 1. We compared our model with SOM, PCA (provided by the R software), ISOMAP (isomap.stanford.edu) and NeRV (http://www.cis.hut.ﬁ/projects/mi/software/dredviz/). To train HDVMLP, we set the population size to 50, probability of crossing = 1.0, probability

216

A. Neme and A. Nido

of mutation 0.07, and the number of epochs to 100. The number of units in the hidden layer was ﬁxed in 15 for all data sets. The SOM was trained under a 30 × 30 lattice with hexagonal neighborhood, initial neighborhood area of 20, ﬁnal neighborhood of 0, initial learning parameter equal to 0.1, and exponentially decreasing neighborhood. NeRV was trained with parameter λ = 0.5, which establishes the tradeoﬀ between precision and recall. The best result out of 100 experiments was considered for SOM and NeRV.

Fig. 3. The artificial data set consists of five well-defined clusters, plus a conflicting one (sinusoidal shape). It is indicated the cluster identifier (0, ..5).

Table 1. Analyzed data sets Name Dimension Number of vectors Number of clusters 3D artificial data set 3 1800 6 Iris 4 150 4 Ionosphere 34 350 NA Letter 16 20000 26 Landsat 36 2000 6

As the quality of the map is a function of both the number of neighbors, r, and the type of evaluation (weak, strict), we performed several evaluations for the data sets in table 1 varying these parameters. Fig. 4-b shows the error of the map for the compared visualization models for weak neighborhood and ﬁg. 4-a shows the error when strict neighborhood is considered. The strict neighborhood measure is a decreasing function for all methods, as it becomes increasingly diﬃcult to maintain the order for neighbors. In general, the PCA method is the worst, as expected. HDVMLP is in general a very good tool, as the errors in all cases are lower than the observed in other tools. For the artiﬁcial and iris data sets, we show the maps obtained by HDVMLP in ﬁg. 5-a). It is observed that clusters (classes) tend to be mapped in wellsegregated areas. In a second set of experiments, the visualization of new data (not present during map construction) was evaluated. From the compared models, only the SOM is able to map data a posteriori without additional modiﬁcations, so we compared the performance of HDVMLP and SOM. In ﬁg. 5b it is shown the

Visualizing Multidimensional Data through Multilayer Perceptron Maps

217

Fig. 4. Error for strict and weak neighborhood for the analyzed datasets. HDVMLP presents a good performance when compared to other visualization tools.

218

A. Neme and A. Nido

Fig. 5. Maps obtained by HDVMLP. a, Maps for the artificial and for the iris data sets. b) Once the HDVMLP is trained a test with previously unseen data is settled. Left corresponds to the Letter dataset, right for the Landsat dataset.

map obtained for the letter and landsat data sets. The test errors for the two data sets are shown in table 2. It is observed that the performance of HDVMLP is better.

Table 2. Test errors for the Letter and Landsat data sets r 1 2 3 4 1 2 3 4

Data set SOM (weak) SOM (strict) HDVMLP (weak) HDVMLP (strict) Letter 0.583 0.583 0.547 0.547 Letter 0.596 0.613 0.568 0.578 Letter 0.598 0.658 0.583 0.591 Letter 0.613 0.715 0.612 0.617 Landsat 0.612 0.612 0.563 0.563 Landsat 0.637 0.689 0.593 0.624 Landsat 0.659 0.734 0.626 0.638 Landsat 0.712 0.756 0.687 0.667

The MLP ensemble approximate the best function that maps from a highdimensional space to a low-dimensional space, under the constraints that neighbor vectors should be placed in nearby areas. The advantage of the HDVMLP over the SOM is not only in the performance already shown in tables, but also,

Visualizing Multidimensional Data through Multilayer Perceptron Maps

219

as there is the possibility to control the mapping. In SOM, for instance, there is not an error measure. In contrast, in our model, there is an error measure that is useful to guide the construction of better maps. Also, it is possible to decide the precision of the map, as the number of neighborhoods may be speciﬁed.

5

Conclusion

We presented a tool for high-dimensional data visualization. The idea is that a good map is that in which data distribution in the map space, in general twodimensional, resembles data distribution observed in the high-dimensional input space. From here, it is possible to deﬁne an error measure, which can be applied to conduct reinforcement learning on a multilayer perceptron such that the error measure is minimized. If there is a function, possibly non-linear, that maps from a high-dimensional space to a low-dimensional space such that error is very low, then the multilayer perceptron should be able to approximate that function. We conducted several experiments in which our proposal presents lower errors that other visualization tools. However, one drawback of our proposal is that of time. Our model may not suit well for online applications, as the training time is at least one order of magnitude greater than the time required to train the SOM. However, if the visualization may be oﬄine, then our proposal is worth to be considered since errors tend to be lower than in SOM.

References 1. Wang, et al.: Data mining in bioinformatics. Springer, Heidelberg (2005) 2. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research 3, 1415–1438 (2003) 3. Jollife, I.: Principal Component Analysis, 2nd edn. Springer, Heidelberg 4. Kohonen, T.: Self-Organizing maps, 3rd edn. Springer, Heidelberg (2000) 5. Hujun, Y.: The self-organizing maps: Background, theories, extensions and applications. In: Computational Intelligence: A Compendium, pp. 715–762 (2008) 6. Kaski, S., Sinkkonen, J.: Principle of learning metrics for exploratory data analysis. The Journal of VLSI Signal Processing 37(2-3), 177–188 (2004) 7. Tenenbaum, J., da Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction Science 290, 2319–2323 (2000) 8. Venna, K., Kaski, S.: Local multidimensional scaling. Neural Networks 19, 889–899 (2006) 9. Venna, J., Peltonen, J., Nybo, K., Aidos, H., Kaski, S.: Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research 11, 451–490 (2010) 10. Deco, G., Schúrman, B.: Information dynamics, foundations and applications. Springer, Heidelberg (2001) 11. Rojas, R.: Neural networks, a systematic introduction. Springer, Heidelberg (1996) 12. Chellapilla, K., Fogel, D.: Evolving an Expert Checkers Playing Program without Using Human Expertise. IEEE Tr. on Evol. Comp. 5(4), 422–428 (2001) 13. Mitchell, M.: An introduction to genetic algorithms. The MIT press, Cambridge (1998)

Input Separability in Living Liquid State Machines Robert L. Ortman1,3, Kumar Venayagamoorthy4, and Steve M. Potter1,2 1 Laboratory for Neuroengineering, Georgia Institute of Technology Coulter Department of Biomedical Engineering, Georgia Institute of Technology 3 School of Electrical and Computer Engineering, Georgia Institute of Technology, 313 Ferst Dr., Atlanta, GA 30313, USA 4 Real-Time Power and Intelligent Systems Laboratory, Missouri University of Science and Technology, 1870 Miner Circle, Rolla, MO 65409, USA [email protected], [email protected], [email protected]

2

Abstract. To further understand computation in living neuronal networks (LNNs) and improve artificial neural networks (NNs), we seek to create ahybrid liquid state machine (LSM) that relies on an LNN for the reservoir.This study embarks on a crucial first step, establishing effective methods for findinglarge numbers of separable input stimulation patternsin LNNs. The separation property is essential forinformation transfer to LSMs and therefore necessary for computation in our hybrid system. In order to successfully transfer information to the reservoir, it must be encoded into stimuli that reliably evoke separable responses. Candidate spatio-temporal patterns are delivered to LNNs via microelectrode arrays (MEAs), and the separability of their corresponding responses is assessed. Support vector machine (SVM)classifiers assess separability and a genetic algorithm-based method identifiessubsets of maximally separable patterns. The tradeoff between symbol set sizeand separabilityis evaluated. Keywords: Separation property, cultured neuronal network, liquid state machine, support vector machine, microelectrode array.

1 Introduction The central nervous system of complex organisms is well known to effectively handle complex computational problems such as pattern recognition and non-linear control, having an unmatched ability to adapt in real time to changing environmental cues. The most advanced artificial neural networks (NNs) have yet to rival the computational performance of simple brains for certain types of problems, leaving vast potential for improvement of NNs if their living counterparts can be better understood. While the nervous system has been examined on a multitude of scales, including the single-neuron level and the whole-brain level, only recently have neuroscientists and neuro engineers had the ability to study neurons on the small network level. This has been done by plating living neuronal cultures, known as living neuronal networks (LNNs), on microelectrode arrays (MEAs), forming a bi-directional interface between A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 220–229, 2011. © Springer-Verlag Berlin Heidelberg 2011

Input Separability in Living Liquid State Machines

221

living neurons and computer systems. Electrodes embedded in the MEA substrate allow cultures to be monitored and stimulated for extended periods of time [1]. 1.1 Motivation This research embarks on preliminary steps necessary to enable advanced future studies of network learning and computational mechanisms in closed-loop hybrid neural microsystems. The external computer system interfaced to the LNN can send and receive arbitrary sensory and motor information to and from the LNN, limited only by the maximum data rate to which the LNN can respond and generate meaningful responses. Identifying a set of input stimulation patterns for a specific LNN that achieves an optimal tradeoff between raw data rate and separability is therefore of paramount importance. While previous studies of hybrid computational systems using in vitro LNNs have been limited to very low input/output rates and simple computational tasks with static goals, we aspire to perform far more advanced computations, including the control of complex dynamical systems. Past research by the authors includes the use of multi-electrode stimuli to train an LNN to control the behavior of a simulated animal (animat) in a goal-directed navigation task [2]. Currentresearchaims to examine the ability of hybrid LNN-based systems to accurately predict and control non-linear, non-stationary dynamical systems, including simulated power systems (NSF EFRI-COPNProject#0836017). Other studies have applied NNs and advanced computational techniques to identification, control, and optimization of power systems and achieved promising results [3]. However, NNs have not been able to achieve the degree of optimal control and significant scalabilityevident in biological networks. Knowledge gained from studying LNNs can be exploited to better emulate their behavior in next generation NNs and produce vastly superior computational systems. 1.2 Neuronal Cell Cultures Cells from E18 (embryotic day 18) rat cortices (supplied by BrainBits®, Springfield, IL, USA) wereenzymatically and mechanically dissociated to obtain a target density of approximately 2500 cell/μL of medium and then layered onto laminin-coated 60electrode (59 recording/stimulation electrodes plus one ground) Multichannel Systems MEAs (30 μm diameter titanium nitride electrodes in a square grid with 200 μm spacing) [2]. Cells were plated and grown in Jimbo’s medium (containing 10% equine serum, sodium pyruvate, insulin, and GlutaMAX™) [4] [5]. When not in use, the LNNs were stored in an incubator at 35oC with 5% CO2, 9% O2, and 65% relative humidity in Teflon®-membrane sealed MEAs [6]. Experiments were performed during three to five weeks in vitro on cultures of approximately 20 000 living neurons. 1.3 Data Acquisition and Stimulation System Our customized electrophysiology system, NeuroRighter™[7] [8] allows for versatile low-latency closed-loop experiments. The hardware for stimulation and recording includesa Multichannel Systems MEA60 preamp to which the MEA directly connects. The amplified MEA output, containing neural signals, passes through custom signal conditioning interface boards before terminating onto two National Instruments™ (NI)

222

R.L. Ortman, K. Venayagamoorthy, and S.M. Potter

PCIe-6259 data acquisitioncards (32 analog input channels each) installed in a PC. The stimulation output originates from the computer froma PCIe-6259 card via itsfour analog outputs and then passes through custom interface boards, multiplexer headstages, and finally into the MEA. The PCIe-6259 cards’ digital outputs are used to control the multiplexers. Independent recording and stimulation is possible from all 59 electrodes.

MCS Preamp Custom Interface Boards

Computer with A/D NeuroRighter Cards (PCIe-6259s)

Fig. 1. NeuroRighter System

The stimulation and recording protocol was programmed as a plug-in for NeuroRighter and is written in C#.The DAQ cards are controlled by the NeuroRighter software via NI Measurement Studio™usingNI DAQmx drivers.

2 Living Reservoir LSM Experiment Previous research has demonstrated that the LSM model provides an excellent framework for inducing computation in excitable dynamical systems, referred to as reservoirs, such as biological neuronal networks [9]. Typical LSM research uses an NN as the reservoir, but recently the behavior of systems utilizing in vitroLNNs as a reservoir has been studied [11] [12]. Our experiments investigate the essential separation property for large numbers of input stimuli in a living reservoir LSM. 2.1 Method Details Each input symbol was represented as a stimulation pattern defined by a specific sequence of four electrodes and a particular interpulse frequency. Data was collected from two different LNNs using stimulus pattern sets composed of 48 unique symbols (patterns). The maximally separable subsets of length two through approximately 25 were computed for each experimental data set using the methods described in Section 3. Patterns were tested in random order to minimize the effects of any potential shortterm plasticity. Biphasic square pulses 0.6 V peak-to-peak, 800 µs long were used. From the 59 MEA electrodes, 32 stimulation electrodes were chosen at random from the subset found to evoke responses in a preliminary electrode screening procedure executed in NeuroRighter. The 32 stimulation electrodes were further subdivided into eight randomly chosen, non-overlapping groups of four electrodes each. In addition, six stimulation frequencies were selected in uniform steps between 15 Hz and 40 Hz (inclusive), yielding a total of 48 individual symbols. Each symbol set was tested using uniformly distributed random symbol input files with 1083 total symbol stimulations,

Input Separability in Living Liquid State Machines

223

corresponding to about five minutes of real-time per experiment. Consequently each unique pattern was tried about 23 times in a given experiment. Twenty such experiments were performed on each of the two LNNs. Raster Plot of Stimulation and LNN Response for Three-Pattern Subsequence 60

Electrode Number

50

40

30

20

10

0

16.5

16.6

16.7

16.8

16.9

17

17.1

Time (seconds)

Fig. 2. The plot above shows three stimulation patterns (symbols) being delivered to the LNN and their corresponding responses. The red asterisk symbols mark the times and electrode numbers on which stimulation occurred. The three input patterns shown here all differ in both spatial region (electrode set) and frequency. The blue dots indicate detected spikes after artifact removal.

2.2 Classification of LNN Responses to Sensory Input Stimulation patterns (symbols) were interleaved with quiet (non-stimulation) periods.Pattern time length varied from approximately 100–300 ms (depending on frequency), and inter-symbol delay was fixed to 100 ms. Pattern parameters were chosen such that the meanstimulation frequency across the entireMEA was constrained to be approximately 20 Hz in order to prevent bursting activity that would disrupt meaningful responses [10]. .

200 ms

100 ms

200 ms

100 ms

200 ms

Sym3

Response

Sym42

Response

Sym15

…

Fig. 3. The figure above is a graphical representation of the interleaved stimulation and response phase architecture described in this section. The times given are approximate mean durations for the different phases. Each Symx indicates stimulation with a randomly chosen symbol (pattern) from the set of 48.

224

R.L. Ortman, K. Venayagamoorthy, and S.M. Potter

During the non-stimulation period following each input symbol, the recordedspike activity was passed through a leaky integrator function, (1), producing the liquid state output. The output was subsequently sampled at 16 evenly spaced intervals five ms apart. Such an approach his based on commonly used techniques for extracting responses from LSMs and LNNs [11] [12]. /

∑

(1)

In (1), xi(t) are the values of the liquid state readout function over time for electrode i, and the si values are the time stamps designating when spikes were detected during the response period for a particular electrode.Time is represented by t; the time constant, τ = 60 ms, limits the memory of the output. Each stimulation trial corresponds to 59 liquid state outputs, one per electrode. Since each electrode’sintegrator output is represented by 16 samples, every response is represented as a 944-dimensional vector. Example Leaky Integrator Output During Response Period 60

Electrode Number

50 40 30 20 10 0

20

40

60

80

100

Time Elapsed Since End of Stimulus Pattern (ms)

Fig. 4. The figure shows an example liquid state output from the leaky integrator in response to neuronal spikes recorded following a stimulation pattern (stimulus not shown).Liquid state output voltage is represented by color, with red indicating highest voltage and dark blue indicating zero. The time constant is 60 ms. Such responses are computed for every trial of each pattern and then sampled (as described above) to produce the input data for the classifier.

2.3 Stimulus Artifact Suppression Techniques Stimulus artifacts present a formidable obstacle to reliable data collection and analysis for these types of experiments. Several techniques are used to ensure recorded results reflect actual neuronal activity. NeuroRighter incorporates band pass filtering and thresholding to detect spikes. In addition, NeuroRighter includes post processing

Input Separability in Living Liquid State Machines

225

using the SALPA algorithm[8], a variable time-constant polynomial curve fit used to subtract large voltage changes due to stimulation [13]. The real-timeSALPA (subtraction of artifacts by local polynomial approximation) algorithm in NeuroRighter effectively suppresses a large amount of stimulation artifacts [8] [13] but some still remain. Further post-processing of data removes spikes with greater than 300μV peak-peak amplitudes. Finally, offline waveform clustering softwareis used to cluster spike response waveforms into up to 192 units [14]. Spikes belonging to the units whose average waveform deviates greatly from valid spike waveforms are removed. For typical data sets, 20% of spikes detected by NeuroRighter were eliminated by the clustering approach.

3 Response Classification and Determination of Separable Pattern Subsets Offline analysis software was used for detection of stimulus patterns given the liquid state responses described in Section 2.2.The open source support vector machine (SVM) package, LIBSVM, was used to solve the multiclass SVMdetection problem [15].Training was performed on 33% of the data, and classification was attempted on the remaining 67%. Repeated random sub-sampling cross-validation was used in order to eliminate the bias that might occur from only choosing one random training and classification set [15] [16]. The cross-validation reduces variance and protects against Type IIIstatistical errors [17]. After the first training and testing iteration, the training and classification groups were randomly reselected, new support vectors were trained, and classification was attempted with the same approach as before. This process was repeated for 50trial iterations, and the overall mean performance was calculated. Fifty iterations were used because increasing iteration count beyond 2050 (depending on the data set) produced no further statistically significant reduction in the standard deviation of results. Performance was evaluated by mean classification accuracy for each particular set of patterns evaluated. Classification accuracy (1 – probability of symbol error) is defined as the ratio of the number of symbols correctly identified by the classifier out of the total number tested. 3.1 Classifier Parameter Optimization Leaky integration and SVM parameters were varied to optimize detection. The time constant, τ, was varied over a range of 5-100ms, and 60 ms was found to be generally optimal. The number of samples taken between zero and 80 ms of the leaky integrated response period response was varied between one and 32, inclusive, in powers of two. Accuracy increased with increasing number of samples until about eight, above which improvements substantially diminished. A setting of 16 samples was chosen since no classification performance improvement was observed beyond 16 samples, and more samples (therefore more SVM input features) significantly increases computation time and memory requirements.

226

R.L. Ortman, K. Venayagamoorthy, and S.M. Potter

The SVM kernel and corresponding parameters were also optimized for best classification accuracy. The kernels evaluated were linear, quadratic, third-order polynomial, forth-order polynomial, radial basis functions, and sigmoid (multilayer perceptron (MLP)). The sigmoid kernel (2) was selected since it produced the best results. The training data vectors comprise u and v, and the kernel parameters are given by γ and C0.The optimal parameters were found to be as follows: C = 16 (cost parameter of C-SVC SVM), γ = (1 / number of features) = 1/944, and C0 = 0.

tanh

(2)

3.2 Genetic Algorithm-Based Approach for Finding Best Pattern Subsets From eachinitial candidate set of 48 patterns, we sought an algorithmic approach for finding the most separable pattern subsetsof varying sizes in an effort to construct subsets corresponding to the greatest net bit rate (occurring at the optimal tradeoff between maximizing symbol set size andminimizing error probability). The chosen technique, based on genetic algorithms [18], was found to be highly effective. For each data set, some of the candidate patterns were found to evoke far more complex and reliable responses than others. Patterns evoking no response in the LNN formore than 33% of their trials were excluded at the start of the algorithm. Sizen subsets of the most separable patterns were determined from the remaining patterns in the following manner:First, separability between all pairwise combinations (n = 2) of the remaining patterns was assessed. The top 5%most separable subsets were selected to serve as the basis for forming the next generation, pattern subsets of sizen = 3. This process was repeated until subset size reached the total number of usable patterns (typically about 20-25 of the original 48). The metric, classification accuracy, was used as the fitness function. Next generation candidate pattern sets of (n + 1) patterns were bred from the most separable5% of the size n subsets by appending a single additional pattern chosen from the set of those whose fitness was in the best 10% of pattern subsets of size two. All possible combinations of new candidate subsets subject to these fitness constraints (the next generation) were formed and evaluated, and the process was repeated to produce the subsequent generations. 3.3 Results The plots below show the results for maximally separable stimulus pattern (symbol) subsets based on data from the two different LNNs. Separability is indicated by mean classifier performance(1 – probability of error) for the chosen patterns and data set. The results portray the best pattern sets determined for the two LNNs. Based on these data, one could reasonably expect to transfer data into the living reservoirL SMs used in our experiments at a rate of approximately one byte per second.

Input Separability in Living Liquid State Machines

227

Probability of Error vs. Number of Symbols

Probability of Symbol Error

1.0

Culture 1 Culture 2

0.8

0.6

0.4

0.2

0.0 0

5

10

15

20

25

Number of Symbols (Patterns) Probability of Error vs. Raw Bit Rate

1.0

Culture 1 Culutre 2

Probability of Error

0.8

0.6

0.4

0.2

0.0 0

5

10

15

20

Raw Bit Rate (bit/s)

Fig. 5. The above plots show performance for the system as it varies with the number of distinct patterns (symbols) and corresponding bit rate. Results for two LNNs, labeled Culture 1 and Culture 2, are displayed.

4 Conclusions and Future Work The results show that substantially separable input patterns can be found, and in future experiments, greater performance is expected. Given the present results, we can

228

R.L. Ortman, K. Venayagamoorthy, and S.M. Potter

expect to transfer input data to the LNNs at a rate of about one byte per second. Such performance is useful for certain types of potential invitrolearning experiments but falls short of that expected given better MEAs, cultures, and coding schemes. Of the two cultures used in these experiments, one of them produced reliable recordings from only one-fifth to one-fourth of its electrodes due to a combination of MEA electrode failure and culture death. The other LNN exhibited pathological bursting that often could not be consistently suppressed. In subsequent studies, we plan to collect data from large numbers of high quality LNNs plated on fully functional MEAs and use a wider range of stimulus patterns. In addition to SVM-based classification, other approaches may be explored for assessing separability. In upcoming experiments, stimulation patterns will be optimized online during experiments via a closed-loop stimulation pattern plug-in for NeuroRighter. We also seek to identify which stimulus pattern parameters correlate with greater separability and reliability over large time scales (hours to days). 4.1 Closed-Loop Input Pattern Separability Optimization Although these experiments were successful in finding some highly separable pattern sets, the patterns were found much later than the experiments on the living cultures that responded to them. This is problematic for future computational experiments, since we aim to find separable patterns and subsequently use them to transfer sensory input to the LNNs: Since living neuronal networks are highly dynamic over time, even the most reliable patterns may not be effective at later times. Separable patterns must ideally be found just prior to a learning experiment and their performance frequently reevaluated over time. With a closed-loop system, candidate patterns will be tested on live cultures and analyzed online for separability using variants of the methods previously described. Ineffective patterns will be eliminated and replaced with new patterns. The process will repeat until pattern sets are found that achieve the net bit rate required for a particular experiment. The sets’ effectiveness can be monitored over time, and patterns can be adjusted in anattempt to maintain successful communication. 4.2 Characteristics of Separable Patterns Analysis of the electrode subsets and frequencies corresponding to the most and least separable patterns for an LNN is ongoing. Preliminary results show that a large difference between spatial positions of electrodes is much more correlated with high separability than difference in frequency between patterns. We intend to analyze performance with respect to each pattern’s frequency, electrode impedances, spatial locations in the MEA, and other parameters in an effort to determine which characteristics correlate with high reliability and separability. Acknowledgments. This work is supported under a grant from the National Science Foundation, “Neuroscience and Neural Networks for Engineering the Future Intelligent Electric Power Grid,” as part of the Emerging Frontiers in Research and Innovation (EFRI)program for Cognitive Optimization and Prediction (COPN) projects, NSF EFRI-COPN Grant #0836017.

Input Separability in Living Liquid State Machines

229

References 1. Taketani, M., Baudry, M.: Advances in Network Electrophysiology Using Multi-Electrode Arrays. Springer, Heidelberg (2006) 2. Bakkum, D.J., Chao, Z.C., Potter, S.M.: Spatio-Temporal Electrical Stimuli Shape Behavior of an Embodied Cortical Network in a Goal-Directed Learning Task. J. Neur. Eng., 310–323 (2008) 3. Park, J., Harley, R., Venayagamoorthy, G.: Adaptive Critic Based Optimal Neurocontrol for Synchronous Generator in Power System Using MLP/RBF Neural Networks. IEEE Trans. on Industry Applications 39(5) (2003) 4. Potter, S.M., Wagenaar, D.A., DeMarse, T.B.: Closing the Loop: Stimulation Feedback Systems for Embodied MEA Cultures (2001) 5. Wagenaar, D.A.: Persistent Dynamic Attractors in Activity Patterns of Cultured Neuronal Networks. Phys. Rev. E. 73(5) (2006) 6. Potter, S.M., DeMarse, T.B.: A New Approach to Neural Cell Culture for Long-Term Studies. J. Neurosci. Methods 110, 17–24 (2001) 7. Rolston, J.: Creating NeuroRighter (2009), http://groups.google.com/group/neurorighter-users 8. Rolston, J.D., Gross, R.E., Potter, S.M.: A Low-Cost Multielectrode System for Data Acquisition Enabling Real-Time Closed-Loop Processing with Rapid Recovery from Stimulation Artifacts. Frontiers in Neuroengineering 2(12), 1–17 (2009) 9. Maass, W., Natschlaeger, T., Markram, H.: A model for real-time computation in Generic Neural Microcircuits. Adv. Neural Info. Proc. Sys. 15, 229–236 (2003) 10. Wagenaar, D.A., Madhavan, R., Potter, S.M.: Controlling Bursting in Cortical Cultures with Multi-Electrode Stimulation. J. Neurosci. 25(3), 680–688 (2005) 11. Hafizovic, S., Heer, F., et al.: A CMOS-Based Microelectrode Array for Interaction with Neuronal Cultures. J. Neurosci. Methods, 93–106 (2007) 12. Dockendorf, K.P., Park, I.: Liquid State Machines and Cultured Cortical Networks: The Separation Property. Biosystems 95, 90–97 (2009) 13. Wagenaar, D.A., Potter, S.M.: Real-Time Multi-Channel Stimulus Artifact Suppression by Local Curve Fitting. J. Neurosci. Methods 120, 113–120 (2002) 14. QuianQuiroga, R., Nadasdy, Z., Ben-Shaul, Y.: Unsupervised Spike Detection and Sorting with Wavelets and Superparamagnetic Clustering. Neural Comput. 16, 1661–1687 (2004) 15. Chang, C.-C., Lin, C.-J.: LIBSVM: A Library for Support Vector Machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm (September 2010) release 16. Geisser, S.: Predictive Inference, New York (1993) 17. Mosteller, F.: A k-sample SlippageTest for an Extreme Population. Annals of Mathematical Statistics 19(1), 58–65 (1948) 18. Fraser, A.S.: Simulation of Genetic Systems by Automatic Digital Computers. Australian Journal of Bio. Sci. 10, 484–491 (1957)

Predictive Control of a Distillation Column Using a Control-Oriented Neural Model Maciej L awry´ nczuk Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland Tel.: +48 22 234-76-73 [email protected]

Abstract. This paper describes a special neural model developed with the speciﬁc aim of being used in nonlinear Model Predictive Control (MPC). The model consists of two neural networks. The model structure strictly mirrors its role in a suboptimal (linearisation-based) MPC algorithm: the ﬁrst network is used to calculate on-line the inﬂuence of the past, the second network directly estimates the time-varying stepresponse of the locally linearised neural model, without explicit on-line linearisation. Advantages of MPC based on the described model structure (high control accuracy, computational eﬃciency and easiness of development) are demonstrated in the control system of a distillation column. Keywords: Process control, Model Predictive Control, neural networks, optimisation, soft computing.

1

Introduction

Model Predictive Control (MPC) is a computer control algorithm in which an explicit model of the process is used on-line to predict its future behavior over some horizon and to optimise the future control policy [6,10]. MPC algorithms have been successfully used for years in advanced industrial applications [9]. It is because MPC algorithms, unlike any other control technique, can take into account constraints imposed on process inputs (manipulated variables) and outputs (controlled variables). Moreover, MPC can be eﬃciently used for multivariable processes, with many inputs and outputs, and for processes with diﬃcult dynamic properties (e.g. with signiﬁcant time-delays). The role of a model in MPC algorithms is fundamental. The model must be selected very carefully, bearing in mind its role in MPC. Not only accuracy but also suitability for on-line control must be taken into account. First of all, the model must be accurate enough, capable of precise long-range prediction. An inaccurate model gives erroneous predictions which is likely to result in unacceptable control. Although classical linear MPC algorithms which use linear models give good or satisfactory results quite frequently, the majority of processes are nonlinear by nature. For such processes nonlinear MPC algorithms based on nonlinear models must be used [7,10]. Application of neural models in MPC [4,5,8,10] is an ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 230–239, 2011. c Springer-Verlag Berlin Heidelberg 2011

Predictive Control of a Distillation Column

231

interesting option because neural models have excellent approximation abilities, a simple, regular structure and a limited number of parameters (when compared to other model types). Thanks to it, neural models can be used in MPC very eﬃciently. A rudimentary neural model is the simplest choice, specialised neural models designed for MPC can be also used, for example multi-models [3]. Such models are trained easily as one-step ahead predictors (no recurrent training is necessary), but for prediction they are not used recurrently. This paper details a special control-oriented neural model and its application in MPC of a distillation column. The model is designed taking into account its speciﬁc role in MPC. It consists of two neural networks. The model structure strictly mirrors its role in a suboptimal (linearisation-based) MPC algorithm [4,5]: the ﬁrst network is used to calculate on-line the free trajectory (i.e. the inﬂuence of the past), the second network directly estimates the time-varying step-response of the locally linearised model, without explicit on-line linearisation. The algorithm requires solving on-line a quadratic programming task.

2

Model Predictive Control Algorithms

In MPC algorithms [6,10] at each consecutive sampling instant k, k = 0, 1, 2, . . ., a set of future control increments T

u(k) = [u(k|k) u(k + 1|k) . . . u(k + Nu − 1|k)]

(1)

is calculated. It is assumed that u(k + p|k) = 0 for p ≥ Nu , where Nu is the control horizon. The objective is to minimise diﬀerences between the reference trajectory y ref (k + p|k) and predicted values of the output yˆ(k + p|k) over the prediction horizon N ≥ Nu . Constraints are usually imposed on input and output variables. Future control increments (1) are determined from the following MPC optimisation task (hard output constraints [6,10] are used for simplicity) N N u −1 (y ref (k + p|k) − yˆ(k + p|k))2 + λp (u(k + p|k))2 min u(k)

p=1

subject to umin ≤ u(k + p|k) ≤ umax ,

p=0

p = 0, . . . , Nu − 1

(2)

−umax ≤ u(k + p|k) ≤ umax, p = 0, . . . , Nu − 1 y min ≤ yˆ(k + p|k) ≤ y max , p = 1, . . . , N Only the ﬁrst element of the determined sequence (1) is applied to the process, i.e. u(k) = u(k|k)+u(k−1). At the next sampling instant, k+1, measurements are updated and the whole procedure is repeated. Predictions yˆ(k + p|k) are calculated from a dynamic model of the process.

3

Classical Linear MPC of a Distillation Column

The plant under consideration is a distillation column used for separation and puriﬁcation of a mixture of cyclohexane and heptane shown in Fig. 1. The feed-

232

M. L awry´ nczuk

Fig. 1. The distillation column control system structure

stream is introduced on stage 17 and it has composition xf = 0.5. The column has 30 trays. In order to control the composition x1 of the top product (distillate) a supervisory MPC algorithm is used. From the perspective of MPC, the process has one manipulated variable R which is the reﬂux ratio (R = L1 /D where L1 – the ﬂow rate of the reﬂux, D – the ﬂow rate of the top product), and one controlled variable x1 which is composition of the top product. The sampling time of MPC is 1 min. Two additional fast single-loop PID controllers (denoted as LC) are used to stabilise levels in the reﬂux tank and in the bottom product tank. The distillation column manifests very nonlinear behaviour. In particular, its steady-state characteristics is highly nonlinear as shown in Fig. 2 (left). The manipulated variable changes from Rmin = 1.5 (which corresponds to composition xmin = 0.8301) to Rmin = 7 (which corresponds to xmax = 0.9975). As a result, 1 1 for the considered operation range the steady-state gain changes approximately 201 times, from 1.295 × 10−1 to 6.425 × 10−4 which is illustrated in Fig. 2 (right). The fundamental model of the distillation column (a set of ordinary diﬀerential equations solved using the Runge-Kutta RK45 method) is used as the real process during simulations. At ﬁrst, it is simulated open-loop in order to obtain training and validation data sets depicted in Fig 3. Each set has 2000 samples. The output signal contains small measurement noise. A linear model is obtained y(k) = b1 u(k − 1) + b2 u(k − 2) − a1 y(k − 1) − a2 y(k − 2)

(3)

Predictive Control of a Distillation Column

233

Fig. 2. The steady-state characteristics of the process (left) and the gain (right)

Fig. 3. The training data set (left) and the validation data set (right)

where u = R − Rnom , y = 10(x1 − x1,nom ), Rnom = 3.9236, x1,nom = 0.99 correspond to the nominal operating point. The linear model has very poor accuracy as shown in Fig. 4. In terms of Sum of Squared Error, SSEtraining = 3.380 × 102 , SSEvalidation = 1.552 × 102 . Simulation results of the classical MPC algorithm based on the linear model (3) are depicted in Fig. 5. Tuning parameters are: N = 10, Nu = 1, λp = 0.5. As the reference trajectory (xref 1 ) ﬁve set-point changes are considered which covers the whole range of interest. Unfortunately, due to inaccuracy of the linear model, the classical MPC algorithm is slow.

4

Nonlinear MPC Using Control-Oriented Neural Model

Let the dynamic process under consideration be described by the following discrete-time Nonlinear Auto Regressive with eXternal input (NARX) model y(k) = f (x(k)) = f (u(k − τ ), . . . , u(k − nB ), y(k − 1), . . . , y(k − nA ))

(4)

234

M. L awry´ nczuk

Fig. 4. The output of the process (solid line) vs. the output of the linear model (dashed line) for the training data set (left) and for the validation data set (right)

Fig. 5. Simulation results of the MPC algorithm based on the linear model

where f : RnA +nB −τ +1 → R is a nonlinear function realised by a neural network, integers nA , nB , τ deﬁne the order of dynamics, τ ≤ nB . If the nonlinear model (4) is directly used for prediction, the MPC optimisation problem (2) becomes a nonlinear one. To eliminate the necessity of on-line nonlinear optimisation (which is computationally demanding and may terminate in local minima), a suboptimal MPC algorithm with Nonlinear Prediction and Linearisation (MPC-NPL) [4,5] can be used. A linear approximation y(k) =

nB l=τ

bl (k)u(k − l) −

nA

al (k)y(k − l)

(5)

l=1

of the nonlinear model (4) is successively calculated on-line for the current operating point and next used for prediction and calculation of the future control policy (1). Coeﬃcients al (k) and bl (k) are calculated on-line [4,5]. It can be

Predictive Control of a Distillation Column

235

shown [4] that thanks to linearisation, the output prediction vector is T ˆ (k) = [ˆ y y (k + 1|k) . . . yˆ(k + N |k)] = G(k)u(k) + y 0 (k)

(6)

The output prediction is expressed as the sum of a forced trajectory, which depends only on the future (on future control moves u(k)) and a free trajectory T y 0 (k) = y 0 (k + 1|k) . . . y 0 (k + N |k) , which shows the inﬂuence of the past. The dynamic matrix G(k) of dimensionality N × Nu contains step-response coeﬃcients of the linearised model (5). The free trajectory and step-response coeﬃcients are calculated on-line for the current operating point [4,5]. Thanks to the fact that a linearised model is used for prediction, the MPC optimisation problem (2) becomes an easy to solve quadratic programming task 2 min y ref (k) − G(k)u(k) − y 0 (k) + u(k)2Λ u(k)

subject to umin ≤ J u(k) + u(k − 1) ≤ umax −umax ≤ u(k) ≤ umax

(7)

y min ≤ G(k)u(k) + y 0 (k) ≤ y max Deﬁnitions of all vectors and matrices are given in [4,5]. In contrast to nonlinear optimisation, the MPC-NPL quadratic programming problem (7) can be solved on-line within foreseeable time, its unique global solution is always found. 4.1

Control-Oriented Neural Model

The control-oriented model and the discussed MPC algorithm are depicted in Fig. 6. The model structure strictly mirrors its role in the MPC-NPL algorithm. For prediction the suboptimal prediction equation (6) is used, but unlike the MPC-NPL algorithm the model is comprised of two neural networks. The ﬁrst network (NN1 ) is used to calculate on-line the free trajectory, the second network (NN2 ) directly estimates the time-varying step-response of the locally linearised neural model. On the one hand, the algorithm calculates on-line the control policy from the MPC-NPL quadratic programming task (7). On the other hand, in contrast to the classical linearisation-based MPC-NPL algorithm the nonlinear model is not explicitly linearised on-line and step-response coeﬃcients of the linearised model are not calculated on-line from this linearisation. At each sampling instant k of the algorithm the following steps are repeated: 1. The ﬁrst neural network (a dynamic model of the process – NN1 ) is used to calculate the nonlinear free trajectory y 0 (k) [4,5]. 2. The second neural network (a neural approximator – NN2 ) is used to ﬁnd step-response coeﬃcients s1 (k), . . . , sN (k) of the linearised model (NN1 ). 3. The future control policy u(k) is calculated from (7). 4. The ﬁrst element of the calculated sequence is applied to the process, i.e. u(k) = u(k|k) + u(k − 1). 5. The iteration of the algorithm is increased, i.e. set k := k + 1, go to step 1.

236

M. L awry´ nczuk

Fig. 6. The MPC algorithm based on the control-oriented neural model

4.2

Neural Models and Training

Both neural networks (NN1 and NN2 ) used in this study are MultiLayer Perceptron (MLP) structures with one hidden layer and a linear output [1]. They are trained oﬀ-line. The ﬁrst network is the classical NARX neural model (4). It is trained using data sets shown in Fig. 3, preferably in a recurrent mode. The second network calculates on-line the approximation of step-response coeﬃcients s1 (k), . . . , sN (k) for the current operating point which is deﬁned by T ¯ A )] the vector x(k) = [u(k − 1) . . . u(k − n ¯ B ) y(k) . . . y(k − n s1 (k) =g1 (x(k)) .. . sN (k) =gN (x(k))

(8)

(9)

The most straightforward choice is n ¯ A = nA , n ¯ B = nB , where nA and nB determine the order of dynamics of the NARX neural model (NN1 ). The nonlinear mapping g : Rn¯ A +¯nB +1 → RN is realised by the second network. It is necessary to emphasise the fact that unlike the ﬁrst network, the second network works as an ordinary (steady-state) approximator. Hence, it is not trained recurrently. Delayed input and output signals deﬁning the operating points are inputs of the network NN2 , step response coeﬃcients are desired outputs (targets) of the network. There are 3 diﬀerent methods of obtaining data sets for training: 1. The neural model NN1 is simulated open-loop, as excitation signals training and validation data sets (Fig. 3) used for training this model are used. During simulation the model NN1 is successively linearised and step response coeﬃcients are calculated. As a result of simulation, data sets necessary for training the second neural network are generated.

Predictive Control of a Distillation Column

237

Fig. 7. The output of the process (solid line with dots) vs. the output of the neural model NN1 (dashed line with circles) for the training data set (left) and for the validation data set (right)

2. Having obtained the neural model NN1 , the MPC-NPL algorithm is developed and simulated for a randomly changing reference trajectory. During simulation the model is linearised and step response coeﬃcients are calculated. As a result, data sets for the second network training are recorded. 3. Step responses of the process for diﬀerent operating conditions must be recorded. Next, the neural network is trained to ﬁnd relations (8)-(9) between the operating point and step response coeﬃcients [2]. The operation point can be deﬁned in a straightforward way by the most recent measurements of input and output signals, i.e. by u(k − 1) and (or) y(k).

5

Nonlinear MPC of a Distillation Column

The dynamic neural model (the neural network NN1 ) has the same arguments as the linear model (3) y(k) = f (u(k − 1), u(k − 2), y(k − 1), y(k − 2)) As in case of the linear model, the same data sets are used (Fig. 3). A few model candidates are trained, with a diﬀerent number of hidden nodes. As a good compromise between accuracy and complexity, the model with 6 hidden nodes is selected. Unlike the linear model (Fig. 4), accuracy of the neural model is very high which is depicted in Fig. 7. Model errors are: SSEtraining = 7.955 × 10−2 , SSEvalidation = 7.705 × 10−2 . Next, the neural model NN1 is simulated open-loop, as excitation signals data sets shown in Fig. 3 are used. During simulation the model is linearised and step response coeﬃcients are calculated. As a result, data sets necessary for training and validation of the network NN2 are generated. They are shown in Fig. 8. The operating point is deﬁned by the most recent measurements of input and output T signals, i.e. x(k) = [u(k − 1) y(k)] in (8)-(9). The model NN2 with 3 hidden nodes is chosen as suﬃciently precise.

238

M. L awry´ nczuk

Fig. 8. Data sets for training and validation of the neural network NN2

Fig. 9. Simulation results: the MPC-NO algorithm with nonlinear optimisation based on the classical neural model (solid line with dots) and the MPC algorithm based on the control-oriented neural model and quadratic programming (dashed line with circles)

The following MPC algorithms are compared: a) the discussed MPC algorithm based on the control-oriented neural model and quadratic programming, b) the classical MPC-NPL algorithm with on-line model linearisation and quadratic programming [4,5], c) the MPC-NO algorithm with on-line nonlinear optimisation [5]. In the last two algorithms the classical neural model (NN1 ) is used. Parameters of all algorithms are the same as in the case of linear MPC. Fig. 9 shows trajectories obtained in the MPC-NO algorithm (it is treated as the reference) and in the described algorithm based on the control-oriented neural model. Table 1 shows accuracy and computational burden of compared algorithms. Results obtained in both suboptimal approaches are very similar, but the described algorithm is 15.28% less computationally demanding than the classical MPC-NPL algorithm. The MPC-NO algorithm is slightly faster, but computationally demanding.

Predictive Control of a Distillation Column

239

Table 1. Accuracy (SSE) and computational load (MFLOPS) of compared algorithms Algorithm MPC based on the control-oriented model MPC-NPL based on the classical model MPC-NO based on the classical model

6

Optimisation

SSE

MFLOPS

Quadratic Quadratic Nonlinear

0.0916 0.0918 0.0896

0.4317 0.5096 3.4518

Conclusions

The control-oriented neural model described in this paper is designed taking into account its speciﬁc role in suboptimal (linearisation-based) MPC algorithms. In comparison with classical suboptimal MPC algorithms the model is not successively linearised on-line but the time-varying step-response coeﬃcients are directly estimated by a neural network. The algorithm requires solving on-line a quadratic programming task. Advantages of the described algorithm are: high control accuracy (close to that of the MPC-NO algorithm with nonlinear optimisation), computational eﬃciency (better than that of the classical suboptimal MPC-NPL algorithm) and easiness of development. Acknowledgement. The work presented in this paper was supported by Polish national budget funds for science.

References 1. Haykin, S.: Neural networks – a comprehensive foundation. Prentice Hall, Englewood Cliﬀs (1999) 2. L awry´ nczuk, M.: Neural Dynamic Matrix Control algorithm with disturbance compensation. In: Garc´ıa-Pedrajas, N., Herrera, F., Fyfe, C., Ben´ıtez, J.M., Ali, M. (eds.) IEA/AIE 2010. LNCS (LNAI), vol. 6098, pp. 52–61. Springer, Heidelberg (2010) 3. L awry´ nczuk, M., Tatjewski, P.: Nonlinear predictive control based on neural multimodels. International Journal of Applied Mathematics and Computer Science 20, 7–21 (2010) 4. L awry´ nczuk, M.: Neural networks in model predictive control. In: Nguyen, N.T., Szczerbicki, E. (eds.) Intelligent Systems for Knowledge Management. Studies in Computational Intelligence, vol. 252, pp. 31–63. Springer, Heidelberg (2009) 5. L awry´ nczuk, M.: A family of model predictive control algorithms with artiﬁcial neural networks. International Journal of Applied Mathematics and Computer Science 17, 217–232 (2007) 6. Maciejowski, J.M.: Predictive control with constraints. Prentice Hall, Harlow (2002) 7. Morari, M., Lee, J.H.: Model predictive control: past, present and future. Computers and Chemical Engineering 23, 667–682 (1999) 8. Nørgaard, M., Ravn, O., Poulsen, N.K., Hansen, L.K.: Neural networks for modelling and control of dynamic systems. Springer, London (2000) 9. Qin, S.J., Badgwell, T.A.: A survey of industrial model predictive control technology. Control Engineering Practice 11, 733–764 (2003) 10. Tatjewski, P.: Advanced control of industrial processes, Structures and algorithms. Springer, London (2007)

Neural Prediction of Product Quality Based on Pilot Paper Machine Process Measurements Paavo Nieminen1 , Tommi Kärkkäinen1 , Kari Luostarinen2 , and Jukka Muhonen2 1

Department of Mathematical Information Technology, University of Jyväskylä, Finland {paavo.j.nieminen,tommi.karkkainen}@jyu.fi 2 Metso Paper, Jyväskylä, Finland {kari.luostarinen,jukka.muhonen}@metso.com

Abstract. We describe a multilayer perceptron model to predict the laboratory measurements of paper quality using the instantaneous state of the papermaking production process. Actual industrial data from a pilot paper machine was used. The ﬁnal model met its goal accuracy 95.7% of the time at best (tensile index quality) and 66.7% at worst (beta formation). We anticipate usage possibilities in lowering machine prototyping expenses, and possibly in quality control at production sites. Keywords: MLP, Prediction, Paper quality, Pilot paper machine.

1

Introduction

Challenges in the paper industry include degradation of raw materials to reduce costs, minimizing environmental impacts, and producing paper of tolerable quality with high throughput. Paper machine design must evolve accordingly. New designs are tested in a pilot paper machine in which components, raw materials, and process control parameters can be varied. The eﬀect on the end product quality is measured in a paper laboratory. This is a necessary but costly activity. In this article, we envision a real-time system to help with the pilot paper machine measurements: while an experiment is taking place, a database of preceding recordings of process states and quality measurements would be used to train a neural network model in real time to predict paper quality based on the current process state. As soon as new laboratory measurements are made, they would be appended to the database. The goal is that after a while of accumulating data, the predictions would become accurate enough to interpolate or extrapolate quality, thus reducing the number of costly experiments required. A similar system could be installed in a production machine in which similar measurements are performed. The predictions might detect anomalous conditions in the process state by showing degradation in the predicted “virtual quality”, perhaps before the condition becomes noticeable in the actual product. Section 2 brieﬂy discusses the industry and related research. Section 3 describes our data, Section 4 our algorithm, and Section 5 the tuning of metaparameters. Section 6 discusses our results and future research topics. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 240–249, 2011. c Springer-Verlag Berlin Heidelberg 2011

Neural Prediction of Paper Quality

2

241

Related Work

The modern paper machine is a complex process line with an extensive set of components, subprocesses, and interactions, many of which are diﬃcult to model using ﬁrst principles. Soft modeling tools, such as artiﬁcial neural networks, are therefore required [1]. Neural networks are often applied in models and controllers of the various subprocesses found in pulp and paper industry [2,3,4]. Examples of systems close to our current goal are [5] and [6] that predict curl, an important quality attribute of ﬁnal paper, from selected process parameters. Even more similar to our work is [7] that addresses multiple paper quality measurements using a number of input variables comparable to ours. Our work, on the other hand, focuses more on pilot machine operations (i.e., prototyping workﬂow) rather than ﬁnding process control paths at a delivered production machine. We also touch the issue of algorithmic meta-parameter selection as a follow-up to earlier methodological research described in [8,9,10,11].

3

Pilot Paper Machine Experiments

The expensive pilot paper machine experiments need to be carefully planned beforehand. An experiment plan contains a number of “trial points”, i.e., instants of time when the paper production process is set up to a state of interest. While the machine is running, an expert whom we call the “trial leader” gives instructions to the machine crew of how to run it for the next trial point. Online measurements from the sensors installed in the process are then stored, and samples of the produced paper are taken to the laboratory for oﬀ-line physical and chemical analyses. The trial leader receives the laboratory results, which may necessitate ﬁne-tuning of the plan amidst the experiments. At times, a “change of mechanical settings” is performed, which can mean adjusting existing settings, replacing a prototype component with another one, and/or changing the composition of the raw materials inserted into the process. We analyzed two datasets, referred to as D1 and D2, spanning 10 days of pilot paper machine runs distributed over seven months. They are from typical experiments in which a new machine concept had been installed and the quality response to various process conditions had been examined. There were 229 trial points involving 10 mechanical settings. Some trials addressed process behaviour instead of end product quality, and therefore they lacked all quality measurements. Additionally, in some trials, interesting behaviour had been expected of only some of the qualities, and other measurements had been omitted. We restricted our analysis to the available measurements. Table 1 lists the six quality variables that we considered and the number of trial points in which they had been measured. The two ﬁrst ones are alternative quantities indicative of the uniformity of the paper; the others reﬂect the performance in the ﬁnal application of the product. Table 1 shows also the measurement units (some of which are derived, dimensionless indices) and rough

242

P. Nieminen et al. Table 1. Quality variables, number of samples, and measurement errors Quality variable Formation Index (Kajaani) Beta-formation Air permeance Huygen – low range Oil absorbency symmetry Tensile index

Unit (index) g/m2 ml/min J/m2 (index) (index)

N 182 178 182 118 140 182

Error ±5% ±5% ±10% ±8% ±7% ±6%

estimates of the relative error of each measurement, used as goal accuracies of our prediction algorithm. Figure 1 shows all the 229 trial points on the horizontal axis, and two of the quality measurements are plotted on the vertical axes. To conserve space, we omit the other four qualities from the illustrations in this paper. Continuous lines are plotted between consecutive trial points, i.e., when the mechanical setting remains the same and the quality measurement is available for at least two sequential trial points. Gaps show trial points where no quality data was available. Each vertical divider line marks the ﬁrst trial point after a change of mechanical settings. Numbering is shown on the bottommost plot, beginning with the trial points 1, 2, . . . , 124 of D1 and followed by 1, 2, . . . , 105 of D2. In this research, we refrain from any interpretations of the quality values, i.e., whether high or low values should be pursued for each. Instead, we focus on assessing the accuracy of the developed neural prediction algorithm. 80 70 60 Formation − ForInd

250 200 150 1

26

38 47

72

90 104 1 Huygen, J/m2 − low range

32

50

76

Fig. 1. Measurements of two quality variables from a pilot paper machine. 124 trial points in D1 and 105 in D2.

In addition to quality measurements (output data), time-averaged process state measurements (input data) had been stored at each trial point. We used in total 213 state variables (including various pressures, speeds and material consistencies), known to industry professionals to be the most inﬂuential ones. Also available were the original experiment plans which contain the goal states of process control parameters (subset of the 213 state variables) and additional notes written by the pilot machine personnel. We omit details, but in Table 2 we give a coarse overview of the experiment plans with expert assessments of how big the expected diﬀerences between the mechanical settings were. Every change was highly signiﬁcant except perhaps that on trial point 47 of D1.

Neural Prediction of Paper Quality

243

Table 2. Trial points and mechanical setting changes Dataset Trial points Mechanical setting D1 1–25 #1 (ﬁrst one in our data) 26–37 #2 38–46 #3a (very big change) 47–71 #3b (very small change) 72–89 #4 (big change) 90–103 #5 104–124 #6 D2 1–31 #7 (greatly diﬀerent from those in D1) 32–49 #8 50–75 #9 76–105 #8 (same as for 32–49)

4

Method for Predicting Quality from Process State

To simulate potential real-world use, we proceed as follows: – Assume that in the beginning there is no previous knowledge; append only the ﬁrst three trial points of D1 to our database of previous knowledge. – Then, for each successive trial point (beginning from the fourth one of D1), train a neural network model using the previous points as training samples and attempt to predict the newest quality measurement. – Add each trial point one-by-one to the database, and continue the procedure from the next trial point, just like a real-time application would work. 4.1

Trial Point Selection and Preprocessing

The number of available trial points diﬀers for each quality variable (see Table 1). Also, it is apparent from the experiment plans that often the focus is on just one or a few of the qualities. Based on these facts, we train a separate neural network model for each quality variable. When dealing with one of these models, we select only the trial points where the quality has been measured. It is likely that there are correlations between many of the input variables. Therefore, we begin by reducing input dimensionality using the well-known principal component analysis (PCA) method [12]. We compute the principal components for the whole data available for the quality being modeled. We acknowledge that in a real-world scenario, complete data would be unavailable before all the experiments were done. However, for preprocessing purposes, a representative sample of process state variability from earlier tests is suﬃcient, and we expect such data to be abundant in the pilot machine archives. Alternatively, the PCA could be applied for only the previous process states each time. We project the process measurement data on the major principal components that explain 90% of the variance. Then we introduce additional vectors that contain the squares of the projected coordinates, followed with another pass of

244

P. Nieminen et al.

PCA and a new selection of components with 90% variance. The latter step facilitates the handling of expected nonlinearities in the process data (see [9] and references therein). We acknowledge that better results could be obtained with some threshold other than our admittedly ad hoc selection of 90%. As a result, some 15–17 coordinates, depending on the quality-speciﬁc selection of data, remain in the input vectors. Finally, the resulting state vectors are scaled componentwise to the interval [−1, 1] which is the range of the tanh function used in the neural model. 4.2

Neural Model

For predicting an individual quality measurement, we train a multilayer perceptron (MLP) neural network model. For general information about neural networks, we refer to [13]. The speciﬁc MLP formulation we used has been addressed in [8,9], and later developments of the method for classiﬁcation tasks are found in [10,11]. This time the model is used for prediction of a continuous value instead of classiﬁcation, thus extending the earlier scope of applications. Our MLP is implemented directly from the matrix representation described for example in [8,9,14]. For any input vector x ∈ RD where D is the input dimension, in our case the number of principal components chosen in our preprocessing scheme, we set o0 = x,

ol = F l (Wl o(l−1) ) for l = 1, . . . , L .

(1)

The result of Eq. (1) is determined by the layer-wise neural weight matrices Wl . The hat notation o(l−1) means that the output vector of the previous layer (l−1) o is extended by an additional coordinate of value one, facilitating the bias mechanism so that the ﬁrst column of Wl contains the biases of neurons on layer l. Finally, the notation F l (·) denotes the application of a function matrix to a vector. Here, the function matrices are diagonal, and they consist solely of the tanh activation function on hidden layers and the identity mapping on the output layer (also known as “linear activation”). The ﬁnal output of the MLP resides in the output vector oL , which in the present case is one-dimensional. We use the notation N ({Wl })(x) = oL to denote the output. For training the neural network model, we use the set {xi }N i=1 of training input vectors, i.e., the preprocessed process state vectors of the N trial points encountered so far. Our training targets {ti }N i=1 are the laboratory measurements of paper quality. The prediction error of the i:th training sample is ei = N ({Wl })(xi ) − ti , and the training of the neural network happens by minimizing the cost function Jβ ({Wl }) =

N L 1 1 l 2 |ei |2 + β |Wi,j | , 2N i=1 2Sl

(2)

l=1 (i,j)∈Il

where β ≥ 0 is the regularization parameter, common in MLP training. In our variant, the set Il contains all other indices of the weight matrices except those

Neural Prediction of Paper Quality

245

corresponding to the ﬁrst column of WL , i.e., the biases of the linear output layer. Sl is the number of elements in the index set Il . See [8] for argumentation. The outcome of our algorithm depends on two meta-parameters: architecture of the neural network and the regularization coeﬃcient β. We ﬁxed the architecture to have one hidden layer of 25, 10, or 3 neurons, and the training was done with each β ∈ {0.4, 0.2, 0.1, 0.01}. To avoid overﬁtting, we use these relatively large values of β and also a rather loose minimization accuracy criterion. We minimize (2) using the following multistart scheme: 1. For 25 times, begin from random initial weight matrices and use conjugate gradient minimization to move towards a local minimum, up to a loose accuracy, i.e., until the norm of the cost function gradient is below 10−2. 2. From among the 25 results obtained, select the network that has yielded the smallest cost function value. A new MLP model is trained for every quality variable every time a new data point has been added to the history database. For our current purpose of building a proof-of-concept predictor of paper quality, this re-training scheme suﬃces, taking only a few seconds on a desktop computer.

5

Results

In Figure 2 we have plotted the complete set of predictions for the two qualities of Figure 1 as plus-signs. The true measurements are again plotted as connected dots. Here, the MLPs had 25 hidden units, and the regularization coeﬃcient was β = 0.2. From this kind of overview plots, we made the following conclusions: – Predictions of the ﬁrst 30–40 trial points are oscillatory and greatly erred, as is expected because there has been no suﬃcient training history. – Later, there is less oscillation in the predictions, unfortunately even so much that the prediction seems to follow only a more or less average behaviour rather than the actual measurements. – In some parts, the predictions seem to be strikingly accurate. – Gross errors are rare and concentrated either at the mechanical setting boundaries or at isolated instances for which explanations can be found from the detailed experiment plans and notes made by the trial leader. – The predictions seem to be able to follow the change from D1 to D2, where the mechanical setting is diﬀerent, months have passed since the recordings of D1, and the overall level of the true measurements jumps for many of the output variables. This unexpected reward may be partly due to our preprocessing that uses information from both parts of the data. Similar plots were made for all the six qualities and for other meta-parameter settings. With less regularization, the predictions avoid the averaging nature better, but the errors tend to be exaggerated. Let us now examine the accuracy of the predictions in more detail. Figure 3 shows the absolute value of the relative error (in percentages) made by each

246

P. Nieminen et al. 80 70 60 Formation − ForInd 300 200 1

26

38 47

72

90 104 1 Huygen, J/m2 − low range

32

50

76

Fig. 2. Prediction results (heavy regularization, β = 0.2 while training)

N ({Wl }(j) )(xj )−tj prediction, 100 × , where {Wl }(j) is the neural network newly tj trained to predict a quality variable at the j:th trial point, xj is the preprocessed state vector at the trial point, and tj is the true value. The dots on the vertical axes of the ﬁgure show the relative errors of each quality variable upon each available trial point. The errors correspond to the predictions shown in Figure 2. The vertical labels highlight two values for each quality: the worst case error and the “goal error”. The latter reﬂects the known typical accuracy of the laboratory measurement, and we could say that any prediction with relative error smaller than this goal is a “good”, reliable one. The captions summarize the number and percentage of predictions that comply with the goal. Space allows no further illustrations here, but we note that plots like Figures 2 and 3 were a key element in evaluating the performance of the algorithm between meta-parameter settings because they pinpoint the trial points of good and bad accuracy, enabling comparisons with the experiment plans, and they also aid in observing how well the model follows each change from a trial point to the next. In addition to the detailed plots, we used the overall percentage of good predictions as a general measure of prediction performance. Table 3 presents these numbers for our combinations of model training meta-parameters: 25, 10, and 3 hidden units (on table rows) and regularization coeﬃcients of β ∈ {0.4, 0.2, 0.1, 0.01} (on columns). The best rates found for each quality variable are highlighted with an asterisk. It can be seen that an optimal combination for predicting one quality may not be the optimal one for the others. This evidence supports our decision to use independent networks. We also found out that looking at all of the data may yield accuracy measures that are unnecessarily discouraging. Consider the following facts: |error %|

22

5 Formation − ForInd; 122/179 predictions (68 %) good

|error %|

49

8 1

26

38 47

72 90 104 1 32 50 Huygen, J/m2 − low range; 83/115 predictions (72 %) good

76

Fig. 3. Relative error of prediction (heavy regularization, β = 0.2)

Neural Prediction of Paper Quality

247

Table 3. Ratio of good predictions Formation - ForInd: regularization #hid 0.40 0.20 0.10 0.01 25 67.6 68.2* 67.0 60.3 10 63.7 63.7 65.9 65.9 3 59.2 64.2 64.8 66.5

Huygen, J/m2 - low range: regularization #hid 0.40 0.20 0.10 0.01 25 75.7 72.2 68.7 65.2 10 69.6 73.9 73.9 68.7 3 70.4 71.3 76.5* 70.4

– In the beginning of data gathering, it is unrealistic to expect good accuracy from a predictor trained with data that is clearly insuﬃcient as of yet. – It is known that a change of mechanical settings may drive the process and quality measurables to a diﬀerent range. We should be fair to the prediction algorithm in that errors during the ﬁrst couple of trial points after a mechanical setting change should be forgiven. – Additional “unfair” or impossible trial points can be singled out using the information in the experiment plans: at times, a process control is taken very far away from the earlier standard window, or some controls that usually remain untouched have been used temporarily in just a few trial points. A prediction method cannot extrapolate into magical lengths, nor does it have to in these singular cases which belong to the natural pilot machine workﬂow. Based on these considerations, and a thorough examination of the experiment plans in collaboration with the trial leader responsible for D1 and D2, we selected trial points where we can fairly expect a prediction method to work. Table 4 lists the ratios of good predictions in this reduced set. The number of samples is small, and thus many equal numbers are marked as the best. Figure 4 illustrates predicted and true values; they are a subset of the results shown in Figure 2. Table 4. Ratios of good predictions among a subset chosen with prior expert knowledge of the experiment plans Formation - ForInd: Beta-formation, g/m2: regularization regularization #hid 0.40 0.20 0.10 0.01 #hid 0.40 0.20 0.10 0.01 25 76.8* 73.9 72.5 62.3 25 66.7 53.6 58.0 49.3 10 76.8* 73.9 72.5 76.8* 10 65.2 66.7 66.7 53.6 3 68.1 76.8* 75.4 75.4 3 59.4 68.1* 65.2 58.0 Huygen, J/m2 - low range: Oil absorbency - ts/bs: regularization regularization #hid 0.40 0.20 0.10 0.01 #hid 0.40 0.20 0.10 0.01 25 86.5* 80.8 80.8 69.2 25 76.6 74.5 72.3 70.2 10 78.8 82.7 80.8 78.8 10 83.0* 78.7 74.5 70.2 3 80.8 84.6 82.7 82.7 3 74.5 83.0* 78.7 74.5

Air permeance, ml/min: regularization #hid 0.40 0.20 0.10 0.01 25 60.9 72.5* 68.1 65.2 10 62.3 65.2 69.6 71.0 3 46.4 58.0 56.5 66.7 Tensile index - geom ind: regularization #hid 0.40 0.20 0.10 0.01 25 94.2 95.7* 88.4 85.5 10 88.4 94.2 91.3 87.0 3 71.0 85.5 91.3 92.8

248

P. Nieminen et al.

80 70

Formation − ForInd 280 260 240 220 200 180 1

26

38

47

72

90 104 1 Huygen, J/m2 − low range

32

50

76

Fig. 4. Prediction of points known to be of greatest interest

Underlined entries in Table 4 indicate ﬁnal model selections, which sometimes diﬀer from the best values. To arrive at these ﬁnal meta-parameters, we used all the tools described so far: First, the tabular assessments of accuracy were used to rule out clearly unfavourable alternatives. Then the point-by-point plots of predictions and errors were used to ﬁne-tune the selection towards somewhat more complex and less regularized networks that respond better to faster changes and not only show a long term average. This tuning was based on the subset of trial points carefully selected using knowledge from the experiment plans. In general, the models with 10 hidden neurons worked well, and 0.20 was often a good regularization coeﬃcient. In some cases the performance changes monotonically with the change of meta-parameters, yet in some cases not. For future examinations, a wider and more dense set of meta-parameters ought to be tried to arrive at more reliable selections.

6

Discussion

We developed a multilayer perceptron neural network model for predicting the six most important quality measurables of newspaper grade paper from 213 process state measurements of a pilot paper machine. We created a proof-ofconcept prototype of the prediction system, suitable for visualizing prediction errors and the eﬀect of meta-parameters. Future work is needed before practical use is possible. Development foci identiﬁed during this research are as follows: – Incremental learning (such as RAN-LTM [15]) instead of re-training. – “Forgetting” or weighting of early history, emphasizing the latest data points. – Combine many predictors: For example, one for the overall trend of the process and another one for local ﬂuctuations within one mechanical setting. – Assess the sensitivity of the prediction to changes in the controllable process parameters, to improve real-world usability. – Broader and denser grid of meta-parameters and tuning heuristics. As a conclusion, we ﬁnd that our neural system yields reliable estimates most of the time, when it works close to the process state window used for training. This assumption of standardized conditions is fulﬁlled in a production machine, and also with an experimental pilot machine at times known to the personnel.

Neural Prediction of Paper Quality

249

As real-world use scenarios, we envisioned two main possibilities: cost savings in pilot machine operations, and anomaly detection in a production machine.

References 1. Leiviskä, K. (ed.): Papermaking Science and Technology, 2nd edn. Process and Maintenance Management, vol. 14. Paperi ja Puu Oy (2009) 2. Ribeiro, B., Dourado, A., Costa, E.: Industrial kiln multivariable control: MNN and RBFNN approaches. In: ICANNGA 1995, pp. 408–411. Springer, Heidelberg (1995) 3. Ribeiro, B.: Prediction of the lime availability on an industrial kiln by neural networks. In: IJCNN 1998, vol. 3, pp. 1987–1991. IEEE, Los Alamitos (1998) 4. Rajesh, K., Ray, A.K.: Artiﬁcial neural network for solving paper industry problems: A review. Journal of Scientiﬁc & Industrial Research 65, 565–573 (2006) 5. Edwards, P., Murray, A., Papadopoulos, G., Wallace, A., Barnard, J., Smith, G.: The application of neural networks to the papermaking industry. IEEE Transactions on Neural Networks 10(6), 1456–1464 (1999) 6. Wang, F., Sanguansintukul, S., Lursinsap, C.: Curl forecasting for paper quality in papermaking industry. In: Asia Simulation Conference – 7th International Conference on System Simulation and Scientiﬁc Computing, pp. 1079–1084 (2008) 7. Lampinen, J., Taipale, O.: Optimization and simulation of quality properties in paper machine with neural networks. In: ICNN 1994, pp. 3812–3815. IEEE, Los Alamitos (1994) 8. Kärkkäinen, T.: MLP in layer-wise form with applications to weight decay. Neural Computation 14(6), 1451–1480 (2002) 9. Kärkkäinen, T., Heikkola, E.: Robust formulations for training multilayer perceptrons. Neural Computation 16(4), 837–862 (2004) 10. Nieminen, P., Kärkkäinen, T.: Ideas about a regularized MLP classiﬁer by means of weight decay stepping. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 32–41. Springer, Heidelberg (2009) 11. Nieminen, P., Kärkkäinen, T.: Comparison of MLP cost functions to dodge mislabeled training data. In: IJCNN 2010. IEEE, Los Alamitos (2010) 12. Jolliﬀe, I.T.: Principal Component Analysis, 2nd edn. Springer, New York (2002) 13. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, New Jersey (1999) 14. Hagan, M.T., Menhaj, M.B.: Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks 5(6), 989–993 (1994) 15. Kobyashi, M., Zamani, A., Ozawa, S., Abe, S.: Reducing computations in incremental learning for feedforward neural network with long-term memory. In: IJCNN 2001, pp. 1989–1994. IEEE, Los Alamitos (2001)

A Robotic Scenario for Programmable Fixed-Weight Neural Networks Exhibiting Multiple Behaviors Guglielmo Montone, Francesco Donnarumma, and Roberto Prevete Dipartimento di Scienze Fisiche, Università di Napoli Federico II

Abstract. Artiﬁcial neural network architectures are systems which usually exhibit a unique/special behavior on the basis of a ﬁxed structure expressed in terms of parameters computed by a training phase. In contrast with this approach, we present a robotic scenario in which an artiﬁcial neural network architecture, the Multiple Behavior Network (MBN), is proposed as a robotic controller in a simulated environment. MBN is composed of two Continuous-Time Recurrent Neural Networks (CTRNNs), and is organized in a hierarchial way: Interpreter Module (IM ) and Program Module (P M ). IM is a ﬁxed-weight CTRNN designed in such a way to behave as an interpreter of the signals coming from P M , thus being able to switch among diﬀerent behaviors in response to the P M output programs. We suggest how such an MBN architecture can be incrementally trained in order to show and even acquire new behaviors by letting P M learn new programs, and without modifying IM structure. Keywords: Multiple behaviors, ﬁxed-weights, CTRNN, programmability, robotics.

1

Multiple Behaviors in a Fixed-Weight Neural Network

The ability of exhibiting rapid, substantial and qualitative changes of behavior as a consequence of environmental changes is typical of human beings. It is an open question whether this kind of ability can be always explained as the result of rapid changes of brain connectivity (potentiation and/or depression of the synapses), or could also be the consequence of a computation carried on by a ﬁxed connectivity, i.e., without potentiantion or depression of the synapses. This issue is recently discussed, for example, in [1]. In spite of this debate, Artiﬁcial Neural Network (ANN) architectures are usually developed as special purpose systems, especially when they are used to model the activity of brain neuronal areas (see for example [13]). In other words, ANN architectures are developed in such a way as to exhibit, after a possible training phase, a unique/special behavior in response to the input signals while preserving the structure, i.e., network parameters such as weights and biases. Understanding if and how it is possible to achieve an ANN architecture with a ﬁxed structure able to quickly switch among diﬀerent behaviors is a recent and open problem. One of the ﬁrst A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 250–259, 2011. c Springer-Verlag Berlin Heidelberg 2011

A Robotic Scenario for Programmable Fixed-Weight Neural Networks

251

papers in this ﬁeld is due to Yamauchi and Beer[15]. More recently the study of ANNs able to show multiple behaviors has been associated with the realization of autonomous robots driven by biologically plausible artiﬁcial neural networks such as Continuous-Time Recurrent Neural Networks (CTRNNs)[11,3]. Blynel and Floreano built a robot able to display learning-like abilities driven by a network in which no modiﬁcations of synaptic strengths take place. Paine and Tani realize a network hierarchically organized in two layers, the lower layer learns to execute two behaviors while the upper level by receiving an environmental input leads the lower network to select reactively the correct appropriate behavior between the two learned ones. In a series of papers, which include some of the authors of this paper, [9,14,6] the problem of multiple behaviors in ANNs is analyzed from a diﬀerent point of view. They focused on the concept of programmability, as deﬁned in computer science, explored in a biologically plausible neural network. By programmability in ANNs we mean the property of a ﬁxed-weight ANN of working as an interpreter capable of emulating the behavior of other networks given the appropriate codes. In [7,5] a possible architecture is presented to provide the CTRNN framework with programmability. Building up on this approach, we are proposing here a CTRNN architecture, Multiple Behaviour Network (MBN), able to control a robot in a simulated environment. In our experiment the robot explores a maze environment using sonars and a camera and shows two diﬀerent behaviors on diﬀerent camera inputs. In particular, the robot can behave as a right-follower, following the wall on his right at every crossroads, or as a left-follower, following the wall on his left. The MBN is hierarchically organized into two CTRNN modules, the lower one is programmable and it is able to emulate the behavior of either a network that achieves a right-follower or a network that achieves a leftfollower through the application of the appropriate codes for these two behaviors to special input lines, the auxiliary inputs, by the higher module of MBN. In the next section we describe the architecture used to provide CTRNNs with programmability. Then in Section 3 we describe the MBN realization and the robotic scenario in which we test it. We conclude with a short discussion in Section 4.

2

A Programmable Fixed-Weight CTRNN

CTRNNs are networks of biologically inspired neurons described by the following general equations [10,2]: N L dyi 1 = − yi + Wij σ(yj ) + WEik Ik dt τi j=1

(1)

k=1

In equation (1) yi is the membrane potential of the ith neuron. The variable τi is the time constant of the neuron. It aﬀects the rate of activation in response to the external sensory inputs Ik and the signals coming from the presynaptic neurons with output σ(yj ). The function σ(x) is the standard logistic function.

252

G. Montone, F. Donnarumma, and R. Prevete

Wij is the weight of the connection between the j th and the ith neuron, while WEik is the weight of the connection between the external input Ik and the ith neuron. The input to each neuron, as usual, consists of the sums of products between output signals coming from other neurons over weighted connections, and the weights associated with the connections. So the behavior of a CTRNN network is grounded into the sums of the products between weights and output signals. Here, following the architecture suggested in [7,5], we “pull out” the multiplication operation by using subnetworks providing the outcome of the multiplication operation between the output and the weight. Substituting a connection with a network realizing multiplication we are able to give the weight of the connection as auxiliary (or programming) input lines in addition to standard data input lines. To clarify our approach, let us suppose having an ideal CTRNN, mul, able to return as output the product of two inputs a, b ∈ (0, 1). Given a network S of N neurons and having chosen two neurons i and j linked by one connection with weight Wij ∈ (0, 1), it is possible to build an "equivalent" Programmable Neural Network (PNN) according to the following steps (see ﬁgure 1): 1. redirect the output of the neuron j as input of a mul, 2. set the second input of the mul to Wij , 3. redirect the output of the mul network as input to the neuron i.

Fig. 1. The substitution which achieves a programmable connection in the target network

In the approximation that the neurons of the multiplicative networks have time constants much smaller than the time constants of the other neurons of the network, the dynamics of the constructed PNN, restricted to the neurons i and j, is identical to the original network S. As it is shown in [5], it is always possible to extend this procedure to a generic weight Wij ∈ (min, max) by rescaling the parameter Wij with the transformation pij =

(Wij − min) (max − min)

pij ∈ (0, 1)

(2)

A Robotic Scenario for Programmable Fixed-Weight Neural Networks

253

setting the weight of the connection between the mul output and neuron j equal to max− min, and adding a connection between the neurons j and i with weight equal to min. Consequently, the weight Wij is encoded by pij . This process can be repeated for every connections of the network S. The resulting network will be a PNN. It will be able to emulate the behavior of any other CTRNN of N neurons when receiving its encoded weights as auxiliary inputs. 2.1

A Multiplicative CTRNN

In the previous section the substitution discussed involves a suitable multiplicative CTRNN, able of performing a multiplication of its input values. In order to eﬀectively perform this subsitution we obtained an approximation of an ideal multiplicative network training a 3-neuron CTRNN with the Backpropagation Through Time (BPTT) learning algorithm [12]. We built a training set of 100 input-output pairs. The input is given by (x1 , x2 ) equally spaced in (0, 1)×(0, 1). The output is the correspondent target value t = x1 · x2 ∈ (0, 1). Each pair (x1 , x2 ) is given as input to the network for a time T = 10 · τm , where τm is the common value of the time constant of the three neurons of the network. After this period the output of the third neuron was recorded. During the BPTT training we initialize the values of the weights of the network in a range [−10, 10]. The validation set of the network was constituted of 100 randomly chosen inputtarget pairs. The network tested on this validation set associates to the couple in input the correct value of the relative output with an error never bigger than the 5% of the target value. This approximation could be improved by considering networks with more than three neurons, however this error was suﬃcient in order to successfully perform the experiments in the next section.

3

A Programmable Neural Network for Robot Control

We developed a CTRNN architecture, Multiple Behavior Network (MBN), able to govern a simulated robot equipped with sonars and a camera in a maze environment according to two diﬀerent behaviors. Along the corridors one or two-colors tiles are located. The robot shifts from behaving like a right-follower to behaving like a left-follower when the camera sees a one-color or a two-colors tile respectively. The network is composed of two modules organized in a hierarchical way: the lower module, Interpreter Module (IM ), is implemented by a PNN, and it behaves as an interpreter, deﬁning the same mapping among sonars and motor controls as the one achieved by a right or a left-follower when receiving their codes/programs as auxiliary inputs. The upper level, Program Module (P M ), has to pass the code to the lower one. P M implements a functional relation between the two kinds of camera inputs and the two programs of the right and the left-follower (see ﬁgure 2).

254

G. Montone, F. Donnarumma, and R. Prevete

Fig. 2. The Multiple Behavior Network (MBN) that governs the robot

3.1

The Robotic Scenario

Robot simulations were carried out using the open source software Player-Stage to simulate a Pioneer 3DX robot. The robot is equipped with 10 sonars placed on the frontal and the lateral parts of the robot (s1, ..., s10 in Figure 4) and a camera placed in the middle of the robot. The output of a blobfinder software, which is able to reveal the number of spots the camera sees, is given as input of the CTRNN controlling. The robot can be governed choosing its angular and linear velocity. The environment is constituted of corridors of diﬀerent length and three times wide the robot size, which crosses with angles of 90o . The steps through the realization of MBN controlling the robot are mainly three. The ﬁrst step is the building of two CTRNNs, RF and LF , capable of behaving as a right and a left-follower, respectively. The knowledge of the structure of the networks RF and LF is fundamental for the second step, building the interpreter network IM . In fact, following the procedure proposed in 2, we will develop the interpreter network IM on the basis of RF and LF structure, considering that the connections that changes their values in passing from RF to LF are the connections that will become programmable in the interpreter network, i.e., the connections to which the procedure decribed in 2 will be applied. In the last step, we realize and train the MBN, made up of two modules, consisting in the intepreter module IM and the program module P M . Using an evolutive algorithm, during the training phase, we compare the behaviors of robots driven by diﬀerent MBN networks. Each MBN has the same lower level IM and diﬀers in the upper level network P M . Consequently, the best MBN will be constituted by the ﬁxed interpreter network IM and the network P M capable of associating to the two camera inputs the programs/codes of the most suitable right and left-follower behaviors. The Right-Follower and the Left-Follower. As we explained, the ﬁrst step in the building of the MBN is the construction of two CTRNNs, RF and LF able to behave as right and left-follower, respectively. We tried diﬀerent sized networks, but the resulting ones are CTRNNs composed both of three neurons.

A Robotic Scenario for Programmable Fixed-Weight Neural Networks

255

They receive three inputs that are respectively the weighted sum of three sonars facing right, three facing left and two frontal sonars as in the following equations: I1 =0.2 · S2 + 0.4 · S3 + 0.4 · S4 ; I2 =0.2 · S9 + 0.4 · S8 + 0.4 · S7 ;

(3)

I3 =0.5 · S5 + 0.5 · S6 ; The ﬁrst, the second and the third neurons respectively receive the ﬁrst, the second and the third input on a weighted connection. The output of the network is constituted of the output of the second and the third neuron. In particular the output of the third neuron determines the linear speed of the robot, while the output of the second neuron controls the angular velocity. The neurons of the two networks share the same value of the characteristic time τ , that is of an order of magnitude bigger than the characteristic time of the multiplicative networks τm . We evolve the weights of the networks RF and LF using an algorithm of Diﬀerential Evolution (DE) [4], a population-based evolutionary algorithm. We initialized a population of 18 elements controlled by networks with weights randomly chosen in the range [−100, 100]. Every controller obtained is evaluated with a ﬁtness function F while performing the task of behaving as a right or as a left follower. A new population is obtained using the best element of the previous population. In our training we used a crossover coeﬃcient (CR ∈ [0, 1]) of 0.2 and a step-size coeﬃcient (ST ∈ [0, 1]) of 0.85, this means that our algorithm builds the next generation preserving the architecture of the best element of the previous generation (the value of the crossover coeﬃcient is low), but even preserving the variance of the previous generation (the value of the step-size coeﬃcient is high). The task used to evaluate the robot is structured as follows. There are 7 kinds of crossroads present in the maze (see Figure 4). Each robot is placed at the beginning of each crossroad and is free to evolve for about 20 seconds. The ﬁnal evaluation of the "life" of a robot is the sum of the evaluations obtained in each of the seven distinct simulations. The ﬁtness function that evaluates the robot in every crossroads is made of two components. The ﬁrst component FM is derived by the one proposed by [8] and consists of a reward for straight, fast movements and obstacles avoidance. This component is the same in the left and the right-follower task. The second component changes between the two trainings; in the right-follower training it rewards the robot that turns right at a crossroad (FR ), in the left-follower training it rewards the robot that turns left (FL ). In equation (4) V is the average speed of the robot, VA is the average angular speed. Smin is the values of the shortest distance measured from an obstacle during the task period. FM = V · (1 − VA ) · Smin VA ∈ [0, 1], Smin ∈ [0, 1]; (4) FR = S 1 + S 2 − S 9 − S 10

FL = S 9 + S 10 − S 1 − S 2 .

(5)

In FR the average measure of the left sonars over the task period is subtracted to the average measure of the right ones, the opposite happens in FL . While

256

G. Montone, F. Donnarumma, and R. Prevete

the training to obtain the right-follower was done evolving the whole weights of the network, it was not the case of the left-follower training. In fact the connections that have the same values for the left and the right-follower will not be substituted by a multiplicative network in the programmable net. So the bigger the number of connections unchanged will be the easier the structure of the programmable network will remain. To obtain a left-follower we chose some of the connections of the right-follower and changed their values by evolving them with the algorithm of DE previously exposed. We succeed in obtaining a good left-follower changing the values of three of the connections of the rightfollower. The networks obtained were tested placing the robot in 10 diﬀerent positions in the maze and observing the robot behavior while driving through three consecutive crossroads. The positions were chosen in such a way that the robot starts its test in the middle of a corridor and oriented with its lateral part parallel to the wall. In this conditions the right and the left-follower were both successful in all the trials, showing the appropriate behavior in each of the corridors. The Interpreter IM . The second step is the construction of the ﬁxed-weight interepreter IM capable of switching between the two right and left-follower tasks in response to suitable input codes and without changing its structure. The starting point are the matrices obtained in the ﬁrst step for the right-follower and the left-follower: ⎛

WRF

⎞ 178.2 −457.4 −335.1 = ⎝ −365.3 −73.4 66.3 ⎠ 883 422.2 −266.1

WERF

⎞ 0 0 0 = ⎝ 0 −19 0 ⎠ 0 0 769.5

⎛

WLF

⎞ 178.2 −457.4 −335.1 = ⎝ −365.3 −73.4 −4 ⎠ 883 422.2 −266.1

⎛

⎛

WELF

⎞ 0 0 0 = ⎝ 18 0 0 ⎠ 0 0 769.5

where the values of the connections changed are underlined for the left-follower. The network that achieves the right-follower and the one that achieves the left-follower are CTRNNs both composed of three neurons. Accordingly the interpreter IM will consist of three neurons plus the “fast” neurons of the mul subnetworks, resulting in a network of 9 neurons. As we explained, the right-follower network and the left-follower network diﬀer in the values of three connections. So we realized an PNN network substituting those three connections with three multiplicative networks. The network so obtained, as it is shown in ﬁgure 3, has three standard inputs and three auxiliary inputs. MBN: training and test. In this section we describe the training and the test phase of the Multiple Behavior Network (MBN) system. As above described, the system is composed of two modules hierachically organized (see ﬁgure 2): the lower one, IM , that is the ﬁxed-weight CTRNN described in the previous

A Robotic Scenario for Programmable Fixed-Weight Neural Networks

257

Fig. 3. The interpreter IM . I1 , I2 and I3 are the standard inputs coming from sonars; Ip1 , Ip2 and Ip3 are the auxiliary inputs

section, and a higher level P M , that sends suitable codes/programs to the interpreter. During the MBN training phase, the higher level of MBN, P M , learns to associate two diﬀerent signals (one-color tile/two-colors tile), forwarded by the visual system, with two diﬀerent codes (two pairs of three numbers), which encode two distinct CTRNNs capable of exhibiting two diﬀerent behaviors: the right and the left-follower, respectively. Of course the codes (weights) in the matices of previous section would achieve this purpose, however in this section we let P M learn them. Note that in this case the codes are not necessarely unique, so diﬀerent trainings let us ﬁnd diﬀerent codes, possibly diﬀerent from the ones learned in the previous section. The training of the overall MBN system was achieved by the DE algorithm, as follows. We prepared 18 MBN individuals diﬀering in the weights of the higher level P M . These weights were randomly initialized into [−10, 10]. Each network was evaluated on the basis of the behavior of the controlled robot. Each network was evaluated in the 7 crossroads maze used before. In two trials the robot had to travel for 20 seconds. During the ﬁrst trial one-color tiles were placed in the environment and the robot was evaluated with the ﬁtness function used to train the right-follower, FR . During the second trial two-colors titles were placed in the maze and the robot was evaluated with the ﬁtness function FL . The product of these two values were considered as the robot reward. It is important to stress that in this way the DE rewards the robot that chooses the best programs of right and left-follower, none of the two motor primitives is deﬁned before the training is on. We can say that the DE chooses at the same time the best complex behavior and the best primitives. In order to test the obtained behaviors, similarly as described in section 3.1, the robot is placed in 10 diﬀerent positions in the maze and its behavior is observed while the robot drives through three consecutive crossroads. Each test maze is prepared with diﬀerent tiles (one or two-color) as shown in ﬁgure 4. During the test phase the trained MBN was always capable of changing its behavior in a left or a right-follower depending on the last type of tile visually recognized. The successful operation of the system in its simulated environment is documented at http://people.na.infn.it/~donnarumma/files/Prog_CTRNN.avi

258

G. Montone, F. Donnarumma, and R. Prevete

Fig. 4. On the right the robot sensor position is shown. On the left a picture of the maze in which one and two-color tiles are placed.

4

Discussion

We have endowed with programmability, originally associated with algorithmic computability, a speciﬁc variety of dynamical neural network: Programmable Neural Network (PNN), which consists of particular ﬁxed-weight CTRNNs. We accomplished a robotic application which suggests a way of building ANNs able to model more complex multiple behaviors. We focused in particular on the possibility to build networks that quickly change behaviors, showing how important programming is, giving the possibility to modify the neural network behavior by changing a code instead of modifying its structure. This capability is implemented by the use of a two layer architecture, MBN, in which the upper level learns to associate an environment input to the code of an appropriate CTRNN, while the lower level is able to interpret the code provided by the upper level and virtually realizes the coded network. The training of MBN presents two interesting aspects. Firstly, during the training phase, we do not restrict the system to the learning of sequences of predetermined motor primitives. On the contrary, the network architecture constrains the evolutionary algorithm to tune the appropriate motor primitives, setting the behavior of the lower level network IM in such a way as to obtain the best performance of the robot in the maze. Moreover by this kind of architecture we have a network that learns to realize diﬀerent behaviors because the higher level network P M learns to provide the suitable codes to the interpreter IM . Both these aspects could be improved, building an interpreter network in which all the connections are programmable. In this process some diﬃculties must be addressed. In fact, in order to perform the experiments of section 3, we constructed multiplicative networks with an error of 5%, and for them we chose a time constant one order of magnitude less than the time constant of the neurons of the original network. Of course this heuristic choice worked well for our tests in the simulated environment, because we experimentally met the approximation hypothesis in section 2 on the multiplicative networks. This could be not the case for a real environment, in which even noise introduced by actuators and sensors should be compensated. More in general, as the interpreter network and the emulated CTRNN are both

A Robotic Scenario for Programmable Fixed-Weight Neural Networks

259

non-linear systems, the bigger the number of programmable connections is, the higher the possibility for the two networks to show diﬀerent dynamics is. Thus further studies are needed in order to be able to perform a robust PNN capable of interpreting a vast number of codes/programs.

Acknowledgments We wish to thank Prof. Giuseppe Trautteur for his many helpful comments and discussions about programmability in brain, biological systems and nature.

References 1. Anderson, M.L.: Neural re-use as a fundamental organizational principle of the brain - target article. Behavioral and Brain Sciences 33(04) (2010) 2. Beer, R.D.: On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior 3(4), 469–509 (1995) 3. Blynel, J., Floreano, D.: Exploring the T-Maze: Evolving Learning-Like Robot Behaviors using CTRNNs. In: 2nd European Workshop on Evolutionary Robotics (2003) 4. De Falco, I., Cioppa, A.D., Donnarumma, F., Maisto, D., Prevete, R., Tarantino, E.: CTRNN parameter learning using diﬀerential evolution. In: ECAI 2008, vol. 178, pp. 783–784 (July 2008) 5. Donnarumma, F.: A Model for Programmability and Virtuality in Dynamical Neural Networks. PhD thesis, Università di Napoli Federico II (2010) 6. Donnarumma, F., Prevete, R., Trautteur, G.: Virtuality in neural dynamical systems. In: International Conference on Morphological Computation, Venice, Italy (2007) 7. Donnarumma, F., Prevete, R., Trautteur, G.: How and over what timescales does neural reuse actually occur? Behavioral and Brain Sciences 33(04), 272–273 (2010) 8. Floreano, D., Mondada, F.: Automatic creation of an autonomous agent: Genetic evolution of a neural-network driven robot. In: Proceedings of the Conference on Simulation of Adaptive Behavior, pp. 421–430. MIT Press, Cambridge (1994) 9. Garzillo, C., Trautteur, G.: Computational virtuality in biological systems. Theoretical Computer Science 410(4-5), 323–331 (2009); Computational Paradigms from Nature 10. Hopﬁeld, J.J., Tank, D.W.: Computing with neural circuits: A model. Science 233, 625–633 (1986) 11. Paine, R.W., Tani, J.: Evolved motor primitives and sequences in a hierarchical recurrent neural network. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 603–614. Springer, Heidelberg (2004) 12. Pearlmutter, B.A.: Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks 6(5), 1212–1228 (1995) 13. Riesenhuber, M., Poggio, T.: Models of object recognition. Nature Neuroscience 3, 1199–1204 (2000) 14. Trautteur, G., Tamburrini, G.: A note on discreteness and virtuality in analog computing. Theoretical Computer Science 371, 106–114 (2007) 15. Yamauchi, B.M., Beer, R.D.: Sequential behavior and learning in evolved dynamical neural networks. Adaptive Behavior 2(3), 219–246 (1994)

Self-Organising Maps in Document Classification: A Comparison with Six Machine Learning Methods Jyri Saarikoski1, Jorma Laurikkala1, Kalervo Järvelin2, and Martti Juhola1 2

1 Department of Computer Sciences Department of Information Studies and Interactive Media, 33014 University of Tampere, Finland {Jyri.Saarikoski,Jorma.Laurikkala, Kalervo.Jarvelin,Martti.Juhola}@uta.fi

Abstract. This paper focuses on the use of self-organising maps, also known as Kohonen maps, for the classification task of text documents. The aim is to effectively and automatically classify documents to separate classes based on their topics. The classification with self-organising map was tested with three data sets and the results were then compared to those of six well known baseline methods: k-means clustering, Ward’s clustering, k nearest neighbour searching, discriminant analysis, Naïve Bayes classifier and classification tree. The self-organising map proved to be yielding the highest accuracies of tested unsupervised methods in classification of the Reuters news collection and the Spanish CLEF 2003 news collection, and comparable accuracies against some of the supervised methods in all three data sets. Keywords: machine learning, neural networks, self-organising map, document classification.

1 Introduction Finding relevant information about something is of highest importance, particularly in electronic documents. We need to seek information in our everyday lives, both at home and work while the amount of information available is getting enormous. The Internet is full of digital documents covering almost every topic one can imagine. How can one find relevant information in this massive collection of documents? One cannot really do it manually, so one needs some automatic methods. These methods can help in the search for useful information by clustering, classifying and labelling the documents. When the documents are ordered and preclassified, one can effectively browse through them. This is why we need document classification methods. The machine learning solutions [21] for the text document classification task mostly use the supervised learning procedure. These include, for example, classic methods such as k nearest neighbour searching and Naïve Bayes classifier [8]. However, we are interested in the unsupervised self-organising map method [13], also known as Kohonen map or SOM. It is an artificial neural network originally designed for the clustering of high-dimensional data samples to a low-dimensional map. It is widely used in the clustering and classification of text documents. WEBSOM [12, 15] A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 260–269, 2011. © Springer-Verlag Berlin Heidelberg 2011

Self-Organising Maps in Document Classification

261

is a self-organising map based method for effective text mining and clustering of massive document collections. However, it is not really designed for the classification of documents. ChandraShekar and Shoba [2] classified 7000 documents with selforganising maps and Chowdhury and Saha [4], Eyassu and Gambäck [9] and GuerroBote et al. [11] used smaller collections of a few hundred documents. Recently, Chumwatana et al. [5] used maps in clustering task of news collection of 50 Thai news and Chen et al. [3] compared self-organising maps and k-means clustering method in clustering of 420 documents collection. The method has also been used in information retrieval, see for instance [10] and [14]. We have used self-organising maps earlier in information retrieval [18] and document classification [19] tasks of a German news document collection. In classification self-organising map showed some potential by beating k nearest neighbour searching and k-means clustering in the five (278 documents) and ten class (425 documents) cases with accuracies as high as 88-89%. Encouraged by that performance, we decided to test its classification ability in this research with multiple reasonably large data sets and against well known baseline methods, something that is seldom seen in research papers of this field. We were also interested in testing the self-organising map classification method with documents of different languages. Therefore, one of our present data sets is in Spanish. The research proved us that self-organising map is an effective method in document classification even when there are thousands of documents in the data set. Selforganising map yielded over 90% micro-averaged accuracy in Data Sets 1 (Reuters, Mod Apte Split collection) and 3 (Spanish CLEF 2003 collection) and competed very well against unsupervised methods and comparably against some of the supervised methods.

2 The Data Sets 2.1 Data Set 1: Reuters-21578, Mod Apte Split The first data set is a subset of the well known Reuters-21578 collection [17, 21]. The complete collection includes 21578 English Reuters news documents from the year 1987. We chose the widely used Mod Apte split [1, 21] subset, which contains 10789 documents and 90 classes. Some of these documents have multiple class labels. To make things simpler, we discarded those and took only the documents with one label. Then, we selected 10 largest classes and finally obtained our collection of 8008 documents, consisting of 5754 training samples and 2254 test samples. The class labels are words, for example 'earn', 'coffee' and 'ship'. 2.2 Data Set 2: 20 Newsgroups, Matlab/Octave The second data set is also a widely used collection of 20 Internet newsgroups [21, 23]. We selected its Matlab/Octave version, which provides 18774 English documents, 12690 in the training set and 7505 in the test set. The class labels are names of newsgroups, for instance 'rec.sport.hockey', 'soc.religion.christian' and 'sci.space', and each document has only one class label.

262

J. Saarikoski et al.

2.3 Data Set 3: Spanish CLEF 2003 The third data set is the Spanish collection of CLEF 2003 news documents [6]. The collection contains news articles from the years 1994 and 1995. There are 454045 documents in the complete collection. Here test topics form the classes, and the relevant documents for each topic the class members. From the 60 available classes we selected 20 largest classes. There were, in all, 1901 documents for the top-20 classes. Finally, we constructed 10 test sets using a 10-fold cross-validation procedure. In this data set each document has only one class label. The labels are news topics, such as 'Epidemia de ébola en Zaire', 'Los Juegos Olímpicos y la paz', 'El Shoemaker-Levy y Júpiter'.

3 Preprocessing Conventional preprocessing was performed to all three data sets. Firstly, the SNOWBALL stemmer was used to transform words to their stems, for instance word 'continued' became 'continu'. Then, stopwords, useless “little” words, such as 'a', 'about' and 'are', were removed. At this point also words shorter than three letters, numbers and special characters were discarded. This is because short words generally have little information value for the classification of documents. Next, we calculated the frequencies of remaining words (word stems). Then we computed document vectors for all documents by applying the common vector space model [20] with tf·idf weighting for all remaining word stems. Thus, a document was presented in the following form

Di = ( wi1 , wi 2 , wi 3 ,..., wit )

(1)

where wik is the weight of word k in document Di, 1 ≤ i ≤ n, 1 ≤ k ≤, t, where n is the number of documents and t is the number of word stems in all documents. Weights are given in tf·idf form as the product of term frequency (tf) and inverse document frequency (idf). The former for word k in document Di is computed with

tf ik =

freqik max l { freqil }

(2)

where freqik equals to the number of the occurrences of word k in document Di and l is for all words of Di, l=1,2,..., t-1, t. The latter is computed for word k in the document set with

idf k = log

N nk

(3)

where N is the number of the documents in the set and nk is the number of documents, which contain word k at least once. Combining equations (2) and (3) we obtain a weight for word k in document Di

Self-Organising Maps in Document Classification

wik = tf ik ⋅ idf k

263

(4)

After this procedure every document has its own document vector. Finally, the length of each document vector was shortened only to 1000 of middle frequency (around median) word stems from the total word frequency distribution sorted in ascending order. Very often the most and least frequent words are pruned in information retrieval applications, because their capacity to distinguish relevant and non-relevant documents (to a topic) is known to be poor. Only 1000 stems were chosen to ease the computational burden and based on the fact that it had proven quite effective choice in a previous study [19]. The same vectors were then used for all of the methods used, except for the Naïve Bayes method, which needed frequency weighted vectors. It should also be noted that document vectors were only computed from training sets. Information about its corresponding test set was not used in order to create as a realistic classification situation as possible, where the system knows an existing training set and its words in advance, but not those of test set. Thus, each training set included its own word set, somewhat different from those of the other training sets, and the document vectors of its corresponding test set were prepared according to the words of the training set.

4 Document Classification with Self-Organising Map In order to use a self-organising map in the document classification task we needed to label the map nodes with class labels of the training data set in some meaningful way. The labelled nodes then represent the document classes and the map is able to classify new documents (test set samples) by mapping them. The following simple procedure was implemented to label the self-organising map with class labels: • • •

Create a self-organising map using a training data set. Map each training set sample to the map. Determine a class for each node of the map according to the numbers of training documents of different classes mapped on that node. The most frequent document class determines the class of the node. If there are more than one class with the same maximum, label the node according to the class of the document (from the maximum classes) closest to the model vector of the node.

After this procedure the map is labelled with class labels. Fig. 1 shows an example of a labelled map. The data on the map is the training set of Data Set 1 and the labels are: #1 earn, #2 acq, #3 crude, #4 trade, #5 money-fx, #6 interest, #7 money-supply, #8 ship, #9 sugar and #10 coffee. Most of the classes seem to form one or two clusters on the map. After giving the labels to the map nodes, the classification of the test set was done by mapping each test sample and comparing the classification result given by the map with the known class label of the sample.

264

J. Saarikoski et al.

Fig. 1. Labelled self-organising map. The numbers on the map surface are class labels. The darker the colour on the map, the closer the neighbouring nodes are to each other.

The self-organising map was implemented with the SOM_PAK [22] program written in C in Helsinki University of Technology, Finland. We programmed supporting software tools in Java.

5 The Baseline Methods To evaluate the classification performance of self-organising maps, six classic baseline methods were used in comparison. The idea was to take some unsupervised methods as well as some supervised. Being unsupervised a self-organising map is in disadvantage against the supervised methods as it does not use the class information of the training samples at all. On the other hand, unsupervised methods can be used even without the class labels. In real life the labels are rarely available. The selected unsupervised methods were k-means clustering and Ward’s clustering. For these methods a similar labelling procedure as described earlier for self-organising map had to be implemented for the classification task. The chosen supervised methods were k nearest neighbour searching, discriminant analysis, naïve Bayes classifier and classification tree. All these baseline methods were implemented with Matlab software. More information about these baseline methods can be found from numerous sources, for example in [8].

Self-Organising Maps in Document Classification

265

6 Results In the classification of Data Sets 1 and 2 we used a single test set, while with Data Set 3 we used the 10-fold cross-validation procedure. Because of the randomness in the self-organising map method initialization phase, we built 10 maps for each test set and calculated the average outcome. The results of the baseline methods were calculated with the same test sets, but they were run once for each set, because there was no randomness in these methods. The same preprocessed document vector data was used for all methods, except for the Bayes method, which needed frequency weighted data. No information of the test set vocabulary was used in the word selection of the vectorization. Two measures of classification performance were used: micro- and macroaveraged accuracy [21]. Micro-averaged accuracy for a given test set j is

a micro = j

cj 100% nj

(5)

where cj is equal to the number of the correctly classified documents in test set j and nj is the number of all documents in that test set. Macro-averaged accuracy for test set j is computed with nc j

∑ d jk a

macro j

=

k= 1

(6)

nc j

where ncj is the number classes in the test set j and the djk (k=1,...,ncj) is of form

d jk =

c jk 100 %, n jk

(7)

where cjk is the number of correctly classified documents in class k of test set j and njk is the number of documents in class k of test set j. The micro-averaged accuracy tells how well all the documents were classified, but it does not take the class differences into account, and is, therefore, very much influenced by the largest classes, when class sizes are imbalanced. The macro-averaged measure addresses the importance of all classes and it lessens the influence of large classes. The preprocessed vectors of 1000 features were used and the free parameters of the methods were tested for optimal results in each data set. The free parameters were the map size for self-organising map, the number of nearest neighbours searched for k nearest neighbour method, the number of clusters for k-means clustering and the maximum number of clusters for Ward’s clustering. Based on the obtained accuracies, the best result for every method was selected to be compared Table 1 shows the results of the Data Set 1, the Reuters news collection (Mod Apte Split). Overall the results are good with most of the methods performing over 90% accuracies (micro-averaged). Naïve Bayes, discriminant analysis and self-organising map yielded the top results. Self-organising map was the best unsupervised method.

266

J. Saarikoski et al.

Table 2 shows the results of Data Set 2, which was more difficult to classify than the other two data sets. Only the top two methods (Bayes and discriminant analysis) gave over 50% of correct answers (micro-averaged). Self-organising map was the best of the unsupervised methods. Table 1. Micro- and macro-averaged classification accuracies (%) and the significant differences (Friedman test) of the methods for the Data Set 1. Significant statistical differences are here notated with ’>’ and ’<’ characters. For example, A > {B, C} means that A is significantly better than B and A is significantly better than C

Method Self-organising map (som) k-means clustering (kme) Ward's clustering (war) k nearest neighbour search (knn) Discriminant analysis (dis) Naive Bayes (nba) Classification tree (clt)

Accuracy (%) micro macro 92.3 90.7 81.1 83.0 95.0 95.2 91.1

83.5 79.2 55.8 76.7 87.0 90.4 81.7

Significant differences (macro) >war , war , <{nba, dis} < {knn, dis, clt, nba, kme, som} >war , < {dis, nba} > {knn, kme, war} > {som, kme, war, knn, clt} > war , < nba

Table 2. Micro- and macro-averaged classification accuracies (%) and the significant differences (Friedman test) of the methods for the Data Set 2. Significant statistical differences are here notated with ’>’ and ’<’ characters.

Method Self-organising map (som) k-means clustering (kme) Ward's clustering (war) k nearest neighbour search (knn) Discriminant analysis (dis) Naive Bayes (nba) Classification tree (clt)

Accuracy (%) micro macro 42.3 30.6 41.9 38.9 60.1 62.0 46.2

41.4 29.9 40.8 38.6 59.4 61.1 45.5

Significant differences (macro) >kme , <{dis, nba} <{som, war, dis, clt, nba} >{kme, knn} <{clt, nba} <{war, dis, nba, clt} >{som, kme, war, knn, clt} >{som, kme, war, knn, clt} >{kme, knn} , <{dis, nba}

Table 3. Micro- and macro-averaged classification accuracies (%) and the significant differences (Friedman test) of the methods for the Data Set 3. Significant statistical differences are here notated with ’>’ and ’<’ characters.

Method Self-organising map (som) k-means clustering (kme) Ward's clustering (war) k nearest neighbour search (knn) Discriminant analysis (dis) Naive Bayes (nba) Classification tree (clt)

Accuracy (%) micro macro 95.6 90.8 97.2 97.0 96.3 98.1 94.5

Significant differences (micro)

91.7 >kme , <{war, knn, dis, nba} 86.7 <{war, knn, dis, clt, nba, som} 96.0 >{dis, clt, kme, som} , {som, kme, dis, clt} , {som, kme, clt} , <{war, knn, nba} 97.5 >{som, kme, war, knn, dis, clt} 92.3 >{kme} , <{war, knn, dis, nba}

Self-Organising Maps in Document Classification

267

The results of Data Set 3, the Spanish CLEF news collection, are in Table 3. It turned out to be the easiest case of the three data sets. All methods gave over 90% accuracies (micro-averaged) and Naïve Bayes outperformed the others with very good 98.1% result. Self-organising map performed well with 95.6% and were the second best of the unsupervised methods. In this set also the macro-averaged accuracies were also reasonably high. Naïve Bayes proved to be the most effective in all cases and discriminant analysis performed almost at the same level. Self-organising maps were at least average compared to others in all cases, and among the unsupervised methods it was the most consistent. Another interesting outcome was that in k nearest neighbour classification the best results was always, with all three data sets, obtained with k=1, although we tried with k values up to 20. We conducted the Friedman test [7] to compare the results. For the Data Set 3 we used the micro-averaged accuracies of 10 test sets. For Data Sets 1 and 2 we had to use the macro-averaged accuracies to get enough data, because there was only one test set in these data sets. All the significant differences (p < 0.05) between methods are shown in the Tables 1-3. For example, self-organising map was significantly better than k-means clustering in Data Sets 2 and 3, and also significantly better than Ward’s clustering in Data Set 1. On the other hand, Naïve Bayes was significantly better than self-organising map in all three data sets.

7 Conclusions and Discussion We tested self-organising map in text document classification task with three different kinds of collections and compared the results to those of the standard text classification methods. Naïve Bayes turned out to be the most effective of all, but selforganising map performed well in its own category of the unsupervised methods. Overall, the results of self-organising maps were encouraging with over 90% classification accuracy (micro-averaged) in Data Sets 1 and 3. This suggests that it is an effective method for the document classification tasks. Futhermore, self-organising maps performed comparably against some of the supervised classification methods tested. The intuitive visual map (see Fig. 1) and the unsupervised learning phase are also a benefit of using self-organising map in document classification, because the map enables data visualization, and in some applications browsing. Additionally, labelled data is rarely available. Even the costly learning procedure has its benefits. If we compare self-organising map with k nearest neighbour, it is easy to see that the learning phase of the map takes much more time than the learning of k nearest neighbour. In actual classification it is quite the opposite. The map has usually by an order of magnitude less nodes than there are documents in the training set and this actually leads to 10 times faster classification compared to k nearest neighbour with the same data. One does not have to construct a new map every time when a new document is added to the collection, one can just map it and do the learning later when there is more new data available. The slow learning is done rarely and the fast classification often. The self-organising map method could give even better results, if some more advanced features would be implemented. For example, it is possible to calculate

268

J. Saarikoski et al.

classification based on multiple class hits on the map, for instance using three nearest could be classes. Another approach is the use of multiple maps. In the future, we consider these options and focus on the dimensionality reduction and feature selection problem associated with document vectors.

Acknowledgements Jyri Saarikoski was supported by the Tampere Graduate Programme in Information Science and Engineering (TISE). SNOWBALL stemmer was by Martin Porter.

References 1. Apte, C., Damerau, F.J., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994) 2. ChandraShekar, B.H., Shobha, G.: Classification of Documents Using Kohonen’s SelfOrganizing Map. International Journal of Computer Theory and Engineering 5(1), 610– 613 (2009) 3. Chen, Y., Qin, B., Liu, T., Liu, Y., Li, S.: The Comparison of SOM and K-means for Text Clustering. Computer and Information Science 2(3), 268–274 (2010) 4. Chowdhury, N., Saha, D.: Unsupervised text classification using kohonen’s self organizing network. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 715–718. Springer, Heidelberg (2005) 5. Chumwatana, T., Wong, K., Xie, H.: A SOM-Based Document Clustering Using Frequent Max Substring for Non-Segmented Texts. Journal of Intelligent Learning Systems & Applications 2, 117–125 (2010) 6. CLEF: The Cross-Language Evaluation Forum, http://www.clef-campaign.org/ 7. Conover, W.J.: Practical Nonparametric Statistics. John Wiley & Sons, New York (1999) 8. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, New York (2001) 9. Eyassu, S., Gambäck, B.: Classifying Amharic News Text Using Self-Organizing Maps. Proceeding of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, Michigan, USA, pp. 71–78 (2005) 10. Fernández, J., Mones, R., Díaz, I., Ranilla, J., Combarro, E.F.: Experiments with Self Organizing Maps in CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 358–366. Springer, Heidelberg (2004) 11. Guerro-Bote, V.P., Moya-Anegón, F., Herrero-Solana, V.: Document organization using Kohonen’s algorithm. Information Processing and Management 38, 79–89 (2002) 12. Honkela, T.: Self-Organizing Maps in Natural Language Processing, Academic Dissertation. Helsinki University of Technology, Finland (1997) 13. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995) 14. Lagus, K.: Text retrieval using self-organized document maps. Neural Processing Letters 15, 21–29 (2002) 15. Lagus, K., Kaski, S., Kohonen, T.: Mining massive document collections by the WEBSOM method. Information Sciences 163(1-3), 135–156 (2004)

Self-Organising Maps in Document Classification

269

16. Moya-Anegón, F., Herrero-Solana, V., Jiménez-Contreras, E.: A connectionist and multivariate approach to science maps: the SOM, clustering and MDS applied to library and information science research. Journal of Information Science 32(1), 63–77 (2006) 17. Reuters-21578 collection, http://kdd.ics.uci.edu/databases/reuters21578/ reuters21578.html 18. Saarikoski, J., Laurikkala, J., Järvelin, K., Juhola, M.: A study of the use of self-organising maps in information retrieval. Journal of Documentation 65(2), 304–322 (2009) 19. Saarikoski, J., Järvelin, K., Laurikkala, J., Juhola, M.: On Document Classification with Self-Organising Maps. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 140–149. Springer, Heidelberg (2009) 20. Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989) 21. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002) 22. SOM_PAK, http://www.cis.hut.fi/research/som-research/ nnrc-programs.shtml 23. 20 newsgroups collection, http://people.csail.mit.edu/jrennie/20Newsgroups/

Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia Primož Potočnik and Edvard Govekar Faculty of Mechanical Engineering, University of Ljubljana, Slovenia {primoz.potocnik,edvard.govekar}@fs.uni-lj.si

Abstract. Analysis and short-term forecasting of traffic flow data for several locations of the Slovenia highway network are presented. Daily and weekly seasonal components of the data are analysed and several features are extracted to support the forecasting. Various short-term forecasting models are developed for one hour ahead forecasting of the traffic flow. Models include benchmark models (random walk, seasonal random walk, naive model), AR and ARMA models, and various configuration of feedforward neural networks. Results show that the best forecasting results (correlation coefficient R > 0.99) are obtained by a feedforward neural network and a selected set of inputs but this sophisticated model surprisingly only slightly surpasses the accuracy of a simple naive model. Keywords: traffic flow, analysis, forecasting, neural networks.

1 Introduction As the modern transportation networks are saturating toward their maximum transfer capacities, proper understanding, modelling, simulation and forecasting of network behaviour is becoming of great importance. Evolving Intelligent Transportation Systems (ITS) are connecting transport infrastructure and vehicles with information support and communications technology in order to understand, simulate, forecast and control the dynamics of modern traffic flow. Traffic congestions have been increasing in the recent years but in contrast to the past, it is no longer possible to react by construction of new highways [1]. Therefore, various simulation, forecasting, optimization, and control strategies have been proposed for efficient traffic network management. An important issue for the application of ITS is also short-term traffic flow forecasting with the aim to determine the traffic volume in the next time interval [2]. A variety of forecasting approaches have been proposed in recent years, including various ARIMA, parametric and nonparametric modelling approaches [3,4], conditional average estimator [5], several configurations of neural networks [6-9], hybrid fuzzy approaches [10,11], multivariate state-space approach [12], support vector regression (SVR) [13,14], Bayesian networks [15], and even the forecasting approach based on chaos time A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 270–279, 2011. © Springer-Verlag Berlin Heidelberg 2011

Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia

271

series theory [16]. The short-term traffic flow forecasting can be further divided on urban and highway areas that differ considerably in their dynamic properties. In this paper we analyze the traffic flow data of Slovenia highway network. Several locations are selected to study the one step ahead forecasting problem within one hour resolution. The paper is organized as follows: traffic flow data of the Slovenia highway network are introduced in the next section, then data analysis is performed with the aim of understanding the data and extracting the informative features, section 4 introduces the forecasting framework for the application and evaluation of various forecasting approaches. In sections 5 and 6, benchmark models, ARIMA models and neural network models are discussed, and the forecasting results are presented in section 7. Finally, some conclusions are summarized in section 8.

2 Traffic Flow Data Traffic flow data for the Slovenia highway network are provided by the Slovenian Roads Agency which is a body within the Ministry of Transport. The data are collected by automatic and manual counting methods with former being suitable for the application of forecasting methods as presented in this paper. Figure 1 shows the map of traffic flow measurement stations around Ljubljana city. For the analysis in this paper, the following locations were selected: 645 (Šmarje Sap AC), 854 (Drenov Grič AC) and 174 (Ljubljana S Obvoznica). Locations 645 and 854 represent major transportation links between the capital and the other regions of Slovenia, and the location 174 represents the ring around Ljubljana city.

Fig. 1. Locations of traffic flow measurement stations on the Slovenia highway network. Locations 645 (Šmarje Sap AC), 854 (Drenov Grič AC) and 174 (Ljubljana S Obvoznica) are selected for the forecasting analysis in this paper.

272

P. Potočnik and E. Govekar

Traffic flow data Q are represented in hourly resolution with each data point representing the total number of vehicles passing in the last hour through the selected location. Since both passing directions for all locations are monitored separately, only a single direction for each location is selected for our treatment. Data for the last three years (2007-2009) are available. Figure 2 shows traffic flow data for location 645 (Šmarje Sap AC). The upper plot shows complete data in daily resolution Q [cars/day] with a trend line for three subsequent years (2007-2009). The lower plot displays a 4-week interval of both hourly [cars/hour] and daily data [cars/day]. 645 Šmarje Sap AC 2000 daily data trend

Q

1500

1000

500

0 Jan07

Jan08

Jan09

Jan10

645 Šmarje Sap AC 3500 hourly data daily data

3000 2500

Q

2000 1500 1000 500 0 02/03/2009

09/03/2009

16/03/2009

23/03/2009

30/03/2009

Fig. 2. Traffic flow data Q for location 645 (Šmarje Sap AC). The upper plot shows complete data in daily resolution Q [cars/day] with a trend line for three subsequent years (2007-2009). The lower plot displays a 4-week interval of both hourly [cars/hour] and daily data [cars/day].

3 Traffic Flow Data Analysis Traffic flow is known to follow the population activity pattern that is influenced by many agents but appears to be synchronized to a high degree [5]. The synchronization is stimulated externally by the environment, as well as internally by social agreements about working days and holidays. In this section, preliminary data analysis is discussed and several informative features are extracted to support the development of forecasting models.

Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia

273

In general, traffic flow data for the Slovenia highway network exhibit slowly increasing trend and various seasonal cycles, such as yearly, weekly, and daily cycle. As can be noted from Figure 2, yearly cycle and trend are not very emphasized and doesn’t seem to be considerably important for the short-term forecasting task. Weekly seasonality (sw) is more evident and can be calculated through normalization of data for each week to weekly average, and then the normalized weekly profiles can be plotted as a boxplot in daily resolution, as shown in Figure 3. Very stable weekly cycle sw(k; k=1,2,...,7) emerges with outliers mainly caused by holiday patterns. 645 Šmarje Sap AC 1.6 1.4

Weekly seasonality (sw)

1.2 1 0.8 0.6 0.4 0.2 0

Mon

Tue

Wed

Thu Day

Fri

Sat

Sun

Fig. 3. Weekly seasonality (sw) of the traffic flow data: stable normalized weekly pattern is shown with outliers mostly caused by holidays

As next, the daily seasonality (sd) needs to be addressed. We apply a similar normalization procedure and normalize the traffic flow data to the daily average. This calculation is performed for various days of the week and results in four characteristic patterns for workdays (Mon,…,Thu), Friday, Saturday and Sunday. The boxplots of daily cycles are shown in Figure 4. Weekly and daily seasonality can even be combined as a weekly seasonality expressed in hourly resolution (swd). Here the hourly traffic flow data are normalized to the average of the current week and the emerging combined weekly/daily seasonality swd(j; j=1,2,...,168) is shown in Figure 5. Such seasonal profiles are very stable for each location and can be supportive in building the forecasting models. As the last data analysis we perform the calculation of autocorrelation (R) and mutual information (MI) for the delayed values of traffic flow data. As shown in Figure 6, both daily and weekly seasonalities are strongly emphasised. The red squares denote the set of 18 maximum MI values. The information obtained through data analysis in this section will be utilized in subsequent sections to build the forecasting models.

274

P. Potočnik and E. Govekar 3.5

3.5

d

Fri

3

3

2.5

2.5

2

2

d

s (Fri)

s (Mon,Tue,Wed,Thu)

Mon,Tue,Wed,Thu

1.5

1.5

1

1

0.5

0.5

0

1

2

3

4

5

6

7

8

9

0

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

Hour

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

3.5

3.5 Sun 3

2.5

2.5

2

2

s (Sun)

3

d

d

s (Sat)

Sat

1.5

1.5

1

1

0.5

0.5

0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

1

2

3

4

5

6

7

Hour

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Fig. 4. Daily seasonality (sd) of the traffic flow data: 4 groups of normalized daily patterns can be formed, namely workdays (Mon-Thu), Friday, Saturday, and Sunday 645 Šmarje Sap AC 4 3.5 3

s

wd

2.5 2 1.5 1 0.5 0 Mon

Tue

Wed

Thu Fri Time [Day/Hour]

Sat

Sun

Mon

Fig. 5. Combined daily/weekly seasonality (swd) of traffic flow data: the normalized week in hourly resolution with lower (5%) and upper (95%) confidence margins

Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia

275

645 Šmarje Sap AC 1 abs(R) MI max(MI)

0.9 0.8 0.7

R & MI

0.6 0.5 0.4 0.3 0.2 0.1 0 00

01

02

03

04 Time [day]

05

06

07

08

Fig. 6. Correlation coefficients (R) and mutual information (MI) values for the delayed traffic flow data. Daily and weekly seasonalities are strongly emphasised. The red squares denote the set of 18 maximum MI values.

4 Forecasting Framework The objective of this paper is to construct and test forecasting models for 1-hour ahead traffic flow forecasting. The available three years of data are split into train and test data. The first two years (2007-2008) are utilized as training data, and the last year (2009) is used as test data. The performance of the various models is compared and evaluated based on error (comparing measured data yt and forecasts yf) obtained on test data. Three error measures are applied in order to compare the performance of various models: 1.

Correlation measure (R):

R=

E[ y t y f ]

(1)

σ tσ f

2. Mean absolute percentage error (MAPE):

MAPE =

1 N

N

yt ( n ) − y f ( n )

n =1

yt ( n )

∑

⋅ 100%

(2)

⋅ 100%

(3)

3. Mean absolute normalized error (MANE):

MANE =

1 N

N

yt (n) − y f (n )

n =1

yt

∑

276

P. Potočnik and E. Govekar

Although R is a standard statistical measure and MAPE is often applied as a measure of performance in traffic forecasting reports, we prefer the evaluation based on MANE that is not as sensitive as MAPE for very low volume traffic flows approaching zero. Consequently, the conclusions will be based on MANE error measure. The following two sections describe construction of various forecasting models, starting with several benchmark models for the comparison with the more elaborated models.

5 Benchmark Models Defined error measures are not very informative for the evaluation of the forecasting models unless they are compared with simplified benchmark models. We propose the following three benchmark models with the increasing level of complexity: 1. Random walk model (RW):

y f (t + 1) = y t (t )

(4)

2. Seasonal RW model based on weekly seasonality (7×24–1 hours):

y f (t + 1) = yt (t − 167)

(5)

3. Naive model, improving the RW model by the ratio of corresponding combined daily/weekly seasonalities swd(t+1)/swd(t):

y f (t + 1) = y t (t )

s wd (t + 1) s wd (t )

(6)

Although both random walk models are not very strong, the results discussed in section 7 demonstrate the surprisingly accurate predictions of the naive model that can be hardly beaten by the more elaborated models.

6 ARIMA and Neural Network Models Beyond the rudimental benchmark models we construct various ARIMA models and feedforward neural network (FNN) models with sigmoid hidden layer transfer functions. The following forecasting models are examined: 1. AR model (NAR=24), 2. ARMA model (NAR=24, NMA=24), 3. NAR model, implemented by FNN (NAR=24, NHidden1=8, NHidden2=4), 4. Custom ARX model (custom input selection), 5.

Custom NARX model, implemented by FNN (custom input selection),

Parameters NAR and NMA denote the number of auto-regressive delays and movingaverage delays, and NHidden1, NHidden2 denote the number of neurons in the first and the

Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia

277

second hidden layer of a neural network. The number of delays are similar for all models in order to be able to compare the predictive power of various models. The proposed selection of parameters was optimized by preliminary experiments. Regarding the selection of inputs for the considered models, the first three models are purely autoregressive, and the last two models are supported by custom input selection that also includes the time descriptors. Custom input selection was composed from 23 values as follows: a) AR inputs selected by max(MI) criterion (NAR=10), b) the same selected AR inputs (NAR=10) modified by local gradients obtained by the ratio of corresponding daily/weekly seasonalities swd(t+1)/swd(t-k), c) time information, including holiday mark (0,1), day of the week (1,2,...,7), and hour of the day (0,1,...,23). Custom inputs are prepared based on prior assumptions about influential factors that could enhance the predictive power of traffic flow forecasting models. The forecasting models were trained and tested as described in the forecasting framework (section 4), and the forecasting results are presented in the next section.

7 Forecasting Results Figure 7 summarizes the forecasting results obtained by the set of proposed forecasting models. Results are expressed by MANE error measure. Best results (MANE = 5.2±0.2%) are obtained by a custom NARX model, implemented by a feedforward neural network (FNN). 25 174 S Obvoznica 645 Šmarje Sap AC 854 Drenov Gric AC

MANE [%]

20

15

10

5

N N

)

X

(F

C

us

to

m

N

C

us

AR

to

X

m

(F

N

AR

N

)

A AR

M N AR

AR

ve N ai

lR W na

so Se a

R

an

do

m

W al

k

0

Fig. 7. Forecasting results for 1-hour ahead forecasting by various models. Best results (MANE = 5.2±0.2%) are obtained by a custom NARX model, implemented by a feedforward neural network.

278

P. Potočnik and E. Govekar

For the winning model, all three error measures, averaged over three traffic flow locations, are as follows: R = 0.993, MANE = 5.2%, MAPE = 11.5%. Forecasting example is shown in Figure 8. Two weeks of 1-step ahead traffic flow forecasting with a custom NARX/FNN model on test data are displayed. 645 Šmarje Sap AC 3500 True data Forecast Abs error

3000

Q [vehicles/hour]

2500 2000 1500 1000 500 0 01−Mar−2009

08−Mar−2009 Time [hours]

15−Mar−2009

Fig. 8. Custom feedforward neural network based 1-step ahead traffic flow forecasting on test data for the location 645 (Šmarje Sap AC)

While the best forecasting results are obtained with a custom neural network based model, surprisingly accurate forecasts are obtained also by a simple naive model (Eq. 5). This confirms that highway traffic flow is quite regular and synchronized to a high degree with the population dynamics patterns. Since the scaling factor of the naive model is calculated by the ratio of combined daily/weekly seasonalities swd(t+1)/swd(t), and we also include this information in the winning custom FNN model, we conclude that a combined weekly/daily seasonality swd is an informative regressor that considerably enhances the predictive power of the forecasting models. Additional forecasting experiments were performed by including sine and cosine terms with the frequencies corresponding to major seasonal cycles (daily, weekly, yearly) but the presented results could not be improved, as the cyclic information is already adequately described by the combined daily/weekly seasonality swd.

8 Conclusions The paper presents a preliminary forecasting study of the Slovenia highway traffic flow. Although the only similar study so far [5] reports on average correlation error measures R = 0.96, this study confirms that the Slovenia highway traffic flow appears to be quite regular and predictable. This paper demonstrates very accurate forecasting results (R > 0.99) by applying a proper input selection combined with an optimized model construction. As a possible application of the presented traffic flow forecasting approach in an Intelligent Transportation System (ITS) we propose to apply it as an automated detector of

Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia

279

abnormal highway activity. In such a system, the actual measured traffic flow data can be compared to the forecasted one, and consequently the unusual traffic phenomena could be automatically detected. If 1-hour resolution happens to be too long, shorter intervals (e.g. 15 minutes) could be considered in order to design an efficient ITS support that will facilitate the management of the regional traffic flows.

References 1. Schadschneider, A., Knospe, W., Santen, L., Schreckenberg, M.: Optimization of highway networks and traffic forecasting. Physica A: Statistical Mechanics and its Applications 346, 165–173 (2005) 2. Sun, S., Zhang, C., Zhang, Y.: Traffic Flow Forecasting Using a Spatio-temporal Bayesian Network Predictor. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 273–278. Springer, Heidelberg (2005) 3. Smith, B.L., Williams, B.M., Oswald, R.K.: Comparison of parametric and nonparametric models for traffic flow forecasting. Transportation Research Part C: Emerging Technologies 10, 303–321 (2002) 4. Kirby, H.R., Watson, S.M., Dougherty, M.S.: Should we use neural networks or statistical models for short-term motorway traffic forecasting? International Journal of Forecasting 13, 43–50 (1997) 5. Grabec, I., Kalcher, K., Švegl, F.: Modeling and Forecasting of Traffic Flow. Nonlinear Phenomena in Complex Systems 12, 1–10 (2009) 6. Dia, H.: An object-oriented neural network approach to short-term traffic forecasting. European Journal of Operational Research 131, 253–261 (2001) 7. Chen, H., Grant-Muller, S.: Use of sequential learning for short-term traffic flow forecasting. Transportation Research Part C: Emerging Technologies 9, 319–336 (2001) 8. Vlahogianni, E.I., Karlaftis, M.G., Golias, J.C.: Optimized and meta-optimized neural networks for short-term traffic flow prediction: A genetic approach. Transportation Research Part C: Emerging Technologies 13, 211–234 (2005) 9. Wong, W.K., Xia, M., Chu, W.C.: Adaptive neural network model for time-series forecasting. European Journal of Operational Research 207, 807–816 (2010) 10. Yin, H., Wong, S.C., Xu, J., Wong, C.K.: Urban traffic flow prediction using a fuzzy-neural approach. Transportation Research Part C: Emerging Technologies 10, 85–98 (2002) 11. Dimitriou, L., Tsekeris, T., Stathopoulos, A.: Adaptive hybrid fuzzy rule-based system approach for modeling and predicting urban traffic flow. Transportation Research Part C: Emerging Technologies 16, 554–573 (2008) 12. Stathopoulos, A., Karlaftis, M.G.: A multivariate state space approach for urban traffic flow modeling and prediction. Transportation Research Part C 11, 121–135 (2003) 13. Jin, X., Zhang, Y., Yao, D.: Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4492, pp. 1022–1031. Springer, Heidelberg (2007) 14. Hong, W.-C., Dong, Y., Zheng, F., Lai, C.-Y.: Forecasting urban traffic flow by SVR with continuous ACO. Applied Mathematical Modelling (2010), article in press 15. Castillo, E., Menéndez, J.M., Sánchez-Cambronero, S.: Predicting traffic flow using Bayesian networks. Transportation Research Part B: Methodological 42, 482–509 (2008) 16. Xue, J., Shi, Z.: Short-Time Traffic Flow Prediction Based on Chaos Time Series Theory. Journal of Transportation Systems Engineering and Information Technology 8, 68–72 (2008)

A New Method of EEG Classification for BCI with Feature Extraction Based on Higher Order Statistics of Wavelet Components and Selection with Genetic Algorithms Marcin Kołodziej, Andrzej Majkowski, and Remigiusz J. Rak Warsaw University of Technology, Pl. Politechniki 1, 00-662 Warsaw {kolodzim,amajk,rakrem}@iem.pw.edu.pl

Abstract. A new method of feature extraction and selection of EEG signal for brain-computer interface design is presented. The proposed feature selection method is based on higher order statistics (HOS) calculated for the details of discrete wavelets transform (DWT) of EEG signal. Then a genetic algorithm is used for feature selection. During the experiment classification is conducted on a single trial of EEG signals. The proposed novel method of feature extraction using HOS and DWT gives more accurate results then the algorithm based on discrete Fourier transform (DFT). Keywords: feature extraction, feature selection, genetic algorithms (GA), higher order statistics (HOS), discrete wavelet transform (DWT), brain-computer interface (BCI), data-mining.

1 Introduction Constructing of an efficient brain-computer interface (BCI) is one of the most challenging scientific problems and focuses scientists attention from all over the world. In order to work out an efficient and accurate brain-machine interface the cooperation of scientist from many disciplines, such as medicine, psychology and computer science is necessary. The best known BCI interfaces are based on EEG signals recorded from the surface of the scalp, because this method of monitoring the brain activity is easy to use and quite inexpensive. Besides, electroencephalography is widely used in medicine, for example for diagnosis of certain neurological diseases. Brain-computer interfaces make use of several brain potentials such as: P300, SSVEP or ERD/ERS [1,2,3]. The most difficult case for implementation is BCI based on brain potentials associated with movements (ERD/ERS). The ERD/ERS name origins from the phenomenon of power rise or fall of EEG signal in the bands about 8-12 Hz and 18-26 Hz, when the subject spontaneously imagines a movement. For example the imagination of left or right hand movement causes a rise or fall of EEG signal power collected from various locations on the scalp. In our experiment we tried to classify EEG signals for single asynchronous A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 280–289, 2011. © Springer-Verlag Berlin Heidelberg 2011

A New Method of EEG Classification for BCI with Feature Extraction

281

trial of movement imagination. There were three different classes associated with different actions the subjects were asked to imagine. The aim of the proposed algorithm was to decide to which class belonged particular one-second window of the EEG signal, that is what the subject imagined at that time.

2 Dataset Description In the experiment we used a dataset of EEG signals provided by IDIAP Research Institute (Silvia Chiappa, José del R. Millán) [10]. The set contains data from 3 normal subjects acquired during 3 non-feedback sessions. The subjects were relaxed, sat in a normal chairs with arms resting on their legs. Each subject had three tasks to execute:

imagination of repetitive self-paced left hand movements, imagination of repetitive self-paced right hand movements, generation of word which starts with the same random letter.

All 3 sessions were conducted on the same day. Each session lasted 4 minutes with 510 minutes breaks in between them. The subject performed a given task for about 15 seconds and then switched randomly to another task on the operator's request. EEG data was not split in trials. In the experiment we have focused on the first session of only one subject. EEG signals were recorded with a Biosemi system using a cap with 32 integrated electrodes located at standard positions of the conventional 10-20 system. The sampling rate was 512 Hz. No artifact rejection or correction was employed. Dataset contains raw EEG signals - 32 EEG potentials acquired in the following order: Fp1, AF3, F7, F3, FC1, FC5, T7, C3, CP1, CP5, P7, P3, Pz, PO3, O1, Oz, O2, PO4, P4, P8, CP6, CP2, C4, T8, FC6, FC2, F4, F8, AF4, Fp2, Fz, Cz. In the training files, each line has a 33rd component indicating the class label.

3 Proposed Feature Extraction Method Based on DWT and HOS There exists many methods of feature extraction from signal. The most widely used for EEG signal are based on frequency analysis, for example discrete Fourier transform (DFT) or power spectral density (PSD). We propose a new method based on higher order statistics (HOS) of wavelet components (DWT). At first the EEG signal is divided in one-second windows overlapping by a half of second. The half-second overlap enables to generate large enough set of data for efficient classifier learning. Each one second window of EEG signal contains information from 32 channels, with 512 samples per channel. Further we refer to that portion of data as to a block. Next for each block the wavelet transform is calculated. Continuous Wavelet Transform: (CWT) of x(t)∈L2(ℜ) (where L2(ℜ) denotes a vector space of one-dimenssional functions) for certain wavelet ψ(t) is defined as:

282

M. Kołodziej, A. Majkowski, and R.J. Rak

W (τ , σ ) =

+∞

∫ x(t )ψ τ σ (t )dt ,

−∞

(1)

⎛ t −τ ⎞ ψ τ ,σ (t ) = ψ⎜ ⎟ σ ⎝ σ ⎠ 1

where σ - denotes a scale and τ - a delay of wavelet window ψ τ ,σ (t ) . For the Discrete Wavelet Transform (DWT). We assume that:

σ = 2− s , τ = 2− s l

(2)

(where l describes the delay and s the scale coefficients: l=0,1,2,... s=0,1,2,...) what as a result, after discretization of signal x(t), brings a new form of the wavelet transform:

W (σ ,τ ) = W (l 2 − s ,2 − s ) = W (l , s ) = = 2 s / 2 ∑ x (n)ψ ( 2 s n − l )

(3)

n

It is also possible to define a wavelet series, like in a case of Fourier transform, for any function x(t)∈L2:

x(t ) = ∑∑ wl ,sψ l ,s (t ) s

l

ψ l , s (t ) = 2 s / 2ψ (2 s t − l )

(4)

But in a case when {ψl,s(t)} creates an orthonormal base in L2(ℜ) space then, like in a case of Fourier series, we have:

wl , s = x (t ),ψ l , s (t ) = 2 s / 2 Wψ , x (l , s )

(5)

Very useful and practical implementation of the DWT algorithm has been proposed in 1998 by S. Mallat [11]. It relies on the decomposition of the signal into wavelet components with the help of digital filters.

Fig. 1. Filtering process of wavelet decomposition – first decomposition level

A New Method of EEG Classification for BCI with Feature Extraction

283

We can distinguish approximations and details. The approximations are the lowfrequency components of the signal. The details, in turn, are the high-frequency components. The filtering process of DWT at its most basic level is presented in figure 1. The process of decomposition can be continued, so that one signal is broken down into many lower resolution components. This is called the wavelet decomposition tree (fig. 2).

Fig. 2. Wavelet decomposition tree

For decomposition we used 5th order wavelet from the Daubechies family (db5) [7]. In our experiment 7th level DWT decomposition was performed, so for one second window of EEG signal we received 7 details and an approximation (fig.3). All steps include decimation, so on 7th level the detail signal consists of only 4 samples. In the next step higher order statistics (HOS) are calculated. In this way variance, skewness and kurtosis were counted on the successive details d1,d2,d3,…d7. As the result of that process 21 features were obtained (7 details × 3 HOS) for each onesecond window. Since the operation was performed on 32 channels, we had 672 features form one second block.

Fig. 3. Details (d1,d2,d3,...d7) and the approximation (a7) obtained by using wavelet transform (db5) of EEG signal

284

M. Kołodziej, A. Majkowski, and R.J. Rak

For feature extraction we used our, prepared earlier, Matlab FE_Toolbox [8]. For HOS method the toolbox enables to select a type of decomposition wavelet, number of decomposition levels and statistics which we want to count. Calculated feature vectors, together with its features description, are saved in Matlab workspace.

4 Proposed Feature Selection Method Based on Genetic Algorithms As we used a database that contained EEG signal collected from 32 electrodes we obtain a large number of features (672 features). In that case a very important problem was to select the best features and in consequence electrodes which carry the most important information for the classifier. To solve this problem a genetic algorithm (GA) was implemented in our experiment. The GA is a popular technique used in optimization tasks. It is used to select the best combination of features that minimize classification error. It was assumed that out of all 672 features only 30 features would be selected, so a chromosome consisted of 30 genes. A randomly chosen set of 500 chromosomes formed an initial population (fig.4). The size of initial population was chosen experimentally. In order to verify which chromosomes were the best adapted a special fitness function was used. The main part of the fitness function was a classifier. For that purpose Linear Discriminant Analysis (Matlab function classify with the option 'quadratic') was used. The fitness function performed 10-fold cross-validation test and returned percentage of the classify error. Smaller errors depicted more relevant chromosomes (more relevant features).

Initial population

Fitness evaluation

Continue?

stop

Parent selection

Crossover function

Mutation function

Fig. 4. The phases of the proposed genetic algorithm

A New Method of EEG Classification for BCI with Feature Extraction

285

The selection operation determines which parent chromosomes are involved in producing the next generation of offspring. In our case the process was based on “spinning the roulette wheel”, for which parents were selected for mating with a probability that was proportional to their fitness values. Crossover function enables to exchange chromosomes information between highly-fit individuals. In our experiment the probability of crossover was 80%. It means that 80% of chromosomes were allowed to exchange genes. The crossover point was selected randomly and next for two given parent chromosomes the values in the chromosome to the left of the crossover point were swapped. As the population may not contain all the information needed for optimal classification of the EEG signal, a mutation function is introduced. The probability of mutation was 5%. A chromosome that mutate was allowed to change one gene to anyone out of 672 available features. The GA was stopped after 100 generations. It is also worth to know that the calculations lasted about 30 minutes for one run of GA on PC with QuadCore processor, so the algorithm is rather not proper to use in real time. We run 10 times the GA and look which features repeated most often. The achieved results are presented in fig. 5. It is also interesting where the features were localized in channels. The distribution of features per channel is presented in fig.6.

Fig. 5. Features that repeated in 10 runs of genetic algorithm. Darker shades mean that the features described by them occurred more frequently.

286

M. Kołodziej, A. Majkowski, and R.J. Rak

Fig. 6. The distribution of features per frequency

The classification error for a single launch of genetic algorithm ranged from 11.6% to 8.6%. Accurate results for 10 runs of genetic algorithm have been presented in table 1. It can be noticed that the most often selected features were taken from channels: 8, 9, 23 and 32 (that is from the electrodes: C3, CP1, C4 and Cz). Table 1. Classification error for 10 runs of genetic algorithm (BEST - is referred to the error achieved for the best adapted individual from the population, MEAN - to mean error for all individuals in the population)

1 2 3 4 5 6 7 8 9 10

BEST 11.6% 9.50% 10.1% 8.10% 10.3% 9.10% 9.70% 9.90% 8.60% 11.8%

MEAN 13.1% 11.2% 11.5% 9.41% 11.9% 10.5% 11.4% 11.6% 9.90% 13.7%

As we knew which channels (electrodes) bring the most information, it would be interesting to check if information from only these channels (instead of 32 electrodes) would be sufficient enough for constructing brain-computer interface. Our further studies went in that direction. We examined the classification error for only four electrodes C3, CP1, C4 and Cz). Then the genetic algorithm was started to select the new best features and determine the classification error. It appeared that for the best case

A New Method of EEG Classification for BCI with Feature Extraction

287

Fig. 7. The fitness value for 4 channels as a function of generation number for one run of GA (upper curve – mean value, bottom curve – best value)

Fig. 8. The distribution of features. Feature number from 1 to 7 denote the variance of details d1…d7. Feature number from 8 to 14 denote the kurtosis of details d1…d7. Feature number from 15 to 21 denote the skewness of details d1…d7.

288

M. Kołodziej, A. Majkowski, and R.J. Rak

the mean classification error for 10 runs of GA for 4 selected electrodes achieved 10.5 %. In figure 7 there is presented the fitness value for 4 channels as a function of generation number for one run of GA.

5 Conclusion The results show that it is possible successfully apply DWT and HOS as a method of feature extraction for brain-computer interfaces. The selected features can be considered as statistical parameters describing signals after applying filter banks. The variance can be interpreted as the power of the variable component of the signal. The skewness is most often interpreted as the measure of the asymmetry of the signal distribution. Kurtosis in turn is interpreted as the measure of flatness of the signal distribution. In our experiment feature number 3 was the most often selected feature during ten runs of the genetic algorithm (fig. 8). This feature is a variance of the d3 detail signal. An important disadvantage of using the genetic algorithm in practice is the long time it requires to complete calculations. It is hard to imagine using GA while carrying out an experiment on-line (for example while operating the BCI system in real time). In this case, it seems to be necessary to implement other methods for feature selection, for example ranking methods. We compare the results with our previous research where we used discrete Fourier transform as a method of feature extraction [9]. In that case the mean classification error for the same subject was about 22%. In the new proposed method the same error was only 10.5%. Next step of our research showed that is possible to reduce number of channels to 4 and there are not significant changes in the classification results. It is worth to note, that set of the most relevant electrodes selected from HOS/DWT algorithm was different than in a case of pure FFT. It means that the set of selected electrodes depends on the implemented selection method. For final verification the new method should be tested for all subjects and all sessions. Such experiment could select features that are best and universal. Furthermore implementation of biofeedback sessions could bring better results.

References 1. Vidal, J.J.: Direct brain-computer communication. Ann. Rev. Biophys Bioeng. 2 (1973) 2. Molina, G.: Direct Brain-Computer Communication through scalp recorded EEG signals. PhD Thesis, École Polytechnique Fédérale de Lausane (2004) 3. Wolpaw, J.R., Birbaumer, N., Heetderks, W.J., Mcfarland, D.J., Hunter Peckham, P., Schalk, G., Donchin, E., Quatrano, L.A., Robinson, C.J., Vaughan, T.M.: Brain–Computer Interface Technology: A Review ofthe First International Meeting. IEEE Transactions on Rehabilitation Engineering 8(2) (June 2000) 4. Kantardzic, M.: Data Mining: Concepts, Models, Methods, and Algorithms. IEEE Press & John Wiley (November 2002) 5. Documentation of Genetic Algorithm and Direct Search ToolboxTM – MATLAB

A New Method of EEG Classification for BCI with Feature Extraction

289

6. Kołodziej, M., Majkowski, A., Rak, R.J.: A new method of feature extraction from EEG signal for braincomputer interface design. Przeglad Elektrotechniczny 86(9), 35–38 (2010) 7. Kołodziej, M., Majkowski, A., Rak, R.J.: Matlab FE-Toolbox - An universal utility for feature extraction of EEG signals for BCI realization. Przeglad Elektrotechniczny 86(1), 44–46 8. Kołodziej, M., Majkowski, A., Rak, R.J.: Implementation of genetic algorithms to feature selection for the use of brain-computer interface. In: CPEE 2010 (2010) 9. del Millán, J.R.: On the need for on-line learning in brain-computer interfaces. In: Proc. Int. Joint Conf. on Neural Networks (2004) 10. Mallat, S.: A wavelet Tour of Signal Processing. Academic Press, London (1998)

Regressor Survival Rate Estimation for Enhanced Crossover Configuration Alina Patelli and Lavinia Ferariu “Gh. Asachi” Technical University of Iasi Department of Automatic Control and Applied Informatics D. Mangeron 27, 700050 Iasi, Romania {apatelli,lferaru}@tuiasi.ro

Abstract. In the framework of nonlinear systems identification by means of multiobjective genetic programming, the paper introduces a customized crossover operator, guided by fuzzy controlled regressor encapsulation. The approach is aimed at achieving a balance between exploration and exploitation by protecting well adapted subtrees from division during recombination. To reveal the benefits of the suggested genetic operator, the authors introduce a novel mathematical formalism which extends the Schema Theory for cut point crossover operating on trees encoding regressor based models. This general framework is afterwards used for monitoring the survival rates of fit encapsulated structural blocks. Other contributions are proposed in answer to the specific requirements of the identification problem, such as a customized tree building mechanism, enhanced elite processing and the hybridization with a local optimization procedure. The practical potential of the suggested algorithm is demonstrated in the context of an industrial application involving the identification of a subsection within the sugar factory of Lublin, Poland. Keywords: genetic operators, fuzzy control, schema theory, nonlinear systems identification, multiobjective optimization.

1 Introduction One of the key features of a successful evolutionary algorithm is an efficient symbiosis between exploiting the well adapted components of the considered individuals and exploring new regions of the search space [1]. On the one hand, if the algorithm fails to encourage the survival of fit building blocks, the evolutionary process is most likely to waste computational resources on poorly adapted chromosomes. On the other hand, a high selection pressure in favor of elite individuals is prone to cause premature convergence due to lack of population diversity. One way to avoid these two undesired tendencies is to configure specific genetic operators capable to implement an appropriate compromise between exploration and exploitation. The suggested approach to the problem of nonlinear systems identification considers nonlinear models, linear in parameters (NLP), a formalism proven to be an universal approximator, capable to encrypt a model of any desired degree of accuracy for any continuous bounded function [2]. NLP models are recursive combinations of A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 290–299, 2011. © Springer-Verlag Berlin Heidelberg 2011

Regressor Survival Rate Estimation for Enhanced Crossover Configuration

291

regressors (building blocks) (see section 3), easily represented by tree structures, that the crossover operator recombines in order to generate offspring. In that context, a blind selection of cut points might lead to an uncontrolled increase in tree dimension, without any significant fitness improvement, a phenomenon known as bloat [3]. To prevent such a situation, the paper suggests a novel method to identify potentially fit regressors by means of a similarity analysis, and encapsulate them based on the decision of a fuzzy controller. The membership functions parameters are trained during the first generations of the evolutionary process. The encapsulation labels assigned to each regressor are used in selecting appropriate cut nodes, therefore encouraging the production of fitter offspring. An early version of the fuzzy controlled encapsulation procedure was described in [4], where the membership functions parameters had predefined, fixed values. A dynamic tuning procedure of the above mentioned parameters is considered in [5], without providing any mathematical support for proving the tool’s efficiency. The main contribution of the present paper consists in the introduction of variable size and shape schemata, an original instrument designed to compute the survival rates of well adapted regressors. For a sound analysis, the authors provide a mathematical formalism which represents a general tool, capable of characterizing the effect of any structural crossover operating on NLP compliant subtrees of different sizes and shapes. This paper exploits the resulted theorems for monitoring the algorithm’s efficiency in detecting and encapsulating fit hierarchical substructures. The suggested identification algorithm is a multiobjective optimization tool as it considers both accuracy and parsimony in evaluating potential models. By evolving populations of candidate solutions, a diverse set of nondominated points, close to the Pareto optimal front, is generated in a single run [6]. The search is dynamically focused on a specific zone of practical importance within the Pareto front [4]. The inherent robustness of evolutionary techniques is exploited to cope with vast nonlinear problem domains [7], whilst search efficiency is enhanced by hybridization with a local deterministic optimization procedure, namely QR decomposition [8]. Insular evolution and genetic material infusion [4] are employed in order to encourage structural diversity, thus allowing the discovery of unexpected relations between input variables. The layout of the paper is as follows. The next section represents a brief summary of the research in the field. Details about chromosome representation and regressor classification are provided in section 3, followed by the description of the fuzzy controller that manages the encapsulation process. Section 5 introduces variable size and shape schemata, while section 6 includes the industrial plant application used to test the presented algorithm along with experimental observations. The conclusions are drawn in the final section.

2 Related Work Selecting appropriate NLP model structures is a difficult endeavor given the increased dimension of the search space. Therefore, deterministic approaches such as incremental model building, that starts with a minimal structure and iteratively adds new terms [2], or pruning techniques, that consider the maximal achievable architecture

292

A. Patelli and L. Ferariu

and progressively eliminate terms [1], are usually resource consuming and unsuited for complex identification problems. Evolutionary optimization methods represent an attractive alternative, partly due to their flexible configuration, allowing the inclusion of domain specific knowledge, and/or the integration of multiple objectives [7]. Depending on the shape of the Pareto front, the search process can be guided towards specific points or regions known as Pareto knees [9] by evolving specialized clusters of individuals. To handle rugged objectives spaces, the decision variables may be adjusted (relocated) to efficiently explore regions marked by discontinuities and/or nonlinearities [10]. The difficult problem of mapping a solution in the objectives space back to its corresponding point in the decision space, necessary when it is more convenient to exploit information contained by the former, is solved by using neural networks [10] or fuzzy logic. Some of the most proficient documented methods to encourage solution diversity, speedy adaptation and even distribution across the Pareto front are nondominance analysis [11], elitist approaches, with or without archiving [6], refined Pareto ranking [11] or specialized memetic hybrids [6]. They all exploit well adapted building blocks, which according to the schema theory [3] lead up to fitter individuals, in an implicit manner [6]. One way of performing the same task explicitly is to take the usefulness of tree building blocks (BB) into account at evaluation stage. The approach suggested in [12] evolves BB of all sizes between 1 and an imposed value, leading to large populations which are difficult to process. The survival rate of fit building blocks may be statistically evaluated in the framework of the fixed size and shape schema theory [3], developed for different types of genetic operators employed on hierarchical chromosomes. However, due to the commutative nature of the operators assumed by the NLP encryption, regressors are building blocks of flexible shapes and sizes, therefore this paper suggests a novel variation of Poli’s schema theory, specifically tailored to fit this case.

3 Regressor Based Hierarchical Representation To ensure heightened exploration capabilities as well as an efficient hybridization with the local optimization procedure [8], the potential models evolved by the algorithm are encrypted by regressive trees (Fig. 1). The polynomial form of an m-input, n-output NLP model, is the one in (1), where yi stands for the ith output of the system at the current time instant k, and consists in a linear combination of nonlinear functions Fiq called regressors, with their corresponding coefficients ciq: Q

yi ( k ) = ∑ ciq Fiq ( x( k )), i = 1, n . q =1

(1)

A regressor is a product of elements from the terminal set x, which contains lagged values of system inputs, ui,, i=1..m, and outputs, yi, i=1..n, as shown in (2), where nu, ny stand for the maximum allowed input and output lags, respectively: I

F = ∏ xi , x(k) = {u1(k),", u1(k − nu ),", um (k),", um (k − nu ), y1(k −1),", yn (k − ny )} . i =1

(2)

Regressor Survival Rate Estimation for Enhanced Crossover Configuration

293

Let us denote a regressor with Ri, where i represents its root within the containing tree. An atomic regressor is made out of a single terminal which is tree root or successor to a “+” node (type I) or is successor to a “*” node (type II). All terminals contained by the trees in Fig. 1 are type II atomic regressors. A recursive regressor is a combination of terminals, connected by “*” operator nodes, with a depth of at least two tree levels. Note that a minimal recursive regressor consists of two type II atomic regressors, as is the case with R3, R6, R10 in T2. Regressors R2 and R9 in T2 are also recursive, yet not minimal. A regressor is called global (Fiq in (1)) if its root coincides with that of the tree or is the successor of a “+” node. R2 and R9 are global regressors in both T1 and T2. A local or nested regressor is rooted in a node that is the successor of a “*” operator. All terminal nodes as well as R4, R6, R10 in T1 are local regressors. Note that type I atomic regressors are always global, whereas type II atomic regressors are always local. Recursive regressors may be global (R2, R9 in both trees in Fig. 1) or local (R3, R6, R10 in T2). Due to singularity issues concerning the regression matrix processed by the QR decomposition procedure, one tree may not contain two identical global regressors [8]. There are no restrictions relative to nested regressors. Coherent regressors contain directly connected component nodes, and incoherent regressors are made up of scattered nodes, with no direct links. Mathematically, according to (1), nodes 3 and 7 in T1 may be viewed as nested regressors, yet, from a computational standpoint, they are not directly connected. Within the context of the suggested recursive representation, coherent tree regressors are building blocks as defined in [1], [3] and [7], with a direct influence on the quality of the containing individuals. The survival of well adjusted BBs is encouraged via fuzzy controlled encapsulation (section 4) and monitored by studying the survival rates of variable size and shape schemata (section 5). 1

T1

1

T3

+

u(k-1)

* u(k)

* u(k-1)

7

* u(k-1)

4 y(k-1)

9

6

5

*

7

u(k)

u(k-1)

13

* 8

y(k-1)

*

10

12

u(k-1)

y(k-1)

*

u(k-2)

11

+

2

y(k-1) 3

13

* 6

1

*

*

10

4

5

3

9

2 3

T2

*

2

u(k-2)

11 12

u(k)

y(k-1)

u(k-1)

8

Fig. 1. Regressive trees encrypting single input single output NLP models

4 Fuzzy Controlled Encapsulation As the population moves closer to the Pareto optimal front, which represents the solution of the multiobjective optimization problem (MOO) [4], [8], the well adapted individuals start featuring similar regressors as a result of the fittest chromosomes taking over the population [6]. Ergo, it is highly probable that these common subtrees represent BBs with a positive influence over the chromosomes performances in terms

294

A. Patelli and L. Ferariu

of the considered objectives. The authors suggest searching for all coherent regressors located at least twice within the structure of the trees that form the nondominated set of the current generation, and storing them in a global list R. The degree of adaptation of each element in the list will then be evaluated relative to their complexity and the accuracy of the containing individuals, by a fuzzy controller operating according to the following rules: IF var(SEF) IS small AND size(REG) IS convenient THEN P[enc] = 1 IF var(SEF) IS medium AND size(REG) IS convenient THEN P[enc] = 0.8 IF var(SEF) IS high AND size(REG) IS convenient THEN P[enc] = 0 IF var(SEF) IS small AND size(REG) IS large THEN P[enc] = 0.9 IF var(SEF) IS medium AND size(REG) IS large THEN P[enc] = 0.5 IF var(SEF) IS high AND size(REG) IS large THEN P[enc] = 0 .

(3)

In (3), var(SEF) stands for the variance of the squared error function values accomplished by all nondominated trees that feature the current element in list R, marked REG, size(REG) denotes its dimension, small, medium, high are the fuzzy sets associated to the var(SEF) variable, whilst convenient and large are the fuzzy sets corresponding to the size(REG) variable. P[enc] represents the encapsulation probability, a label attached to each regressor in R to mark its estimated degree of adaptation. The crossover operator is allowed to select a cut point inside an encapsulated regressor with a probability of one minus the encapsulation label, a customization meant to encourage exploitation (inheritance of well adapted BBs by the offspring solutions). The trapezoidal membership functions (MF) associated to the fuzzy sets and variables mentioned before are available in [4]. During the first generations of the evolutionary process, similar regressors are mostly accidental, therefore labeling is premature. These early stages of the algorithm are used as a training period to tune the MF parameters according to the average accuracy and parsimony of the nondominated trees at each generation [4]. The authors consider the two fuzzy variables, var(SEF) and size(REG), as accuracy and parsimony are the most relevant indicators of tree adaptation. The five fuzzy sets in (3) provide an accurate classification of the evolved models relative to their performances in terms of the considered optimization criteria, whilst maintaining the complexity of the fuzzy module to a minimum. Any additional fuzzy sets/variables would lead to an increase in the complexity of the fuzzy parameter training procedure. In order to explain the encapsulation process more clearly, let us assume that the nodominated front of the current generation is made up of the three trees in Fig. 1. There is a high probability that similar atomic regressors do not influence the overall tree accuracy significantly, therefore the similarity analysis only considers coherent recursive regressors. In addition to the previous approaches [4], the encapsulation described in this paper targets not only global regressors but also nested ones. R1 = {R6, R4, R2, R10, R9 }, R2 = { R3, R6, R2, R10, R9} and R3 = {R1} are the regressor lists for the three trees in Fig. 1, assuming a post-order traversal. Regressors R6 and R10 in T1, R10 in T2 and R1 in T3 are identical (denoted R’), as well as R9 in T1 and R9 in T2 (denoted R’’), and are both included in R. Let us assume that var(SEF) for all three trees that contain R’, belongs to set medium with the highest degree of membership, and that size(R’) is compact. Let us also assume that var(SEF) for T1 and T2, featuring R’’, is small, and that size(R’’) is large. After running the fuzzy control procedure, the

Regressor Survival Rate Estimation for Enhanced Crossover Configuration

295

encapsulation probability for all the instances of R’ is 0.8 (the second rule in (3) is fired) and that of all the instances of R’’ is 0.9 (the fourth rule in (3) is fired). As dividing R10 would also destroy R9, the encapsulation probability of R10 is upgraded to that of the nesting regressor, namely 0.9. At offspring generation stage, nodes 7, 8 and 6 in T1 as well as 2, 3, 1 in T3 will be considered as cut nodes with a probability of 0.2, whereas nodes 11, 12, 10, 13, 9 in T1 and T2 are assigned a probability of 0.1. The well adapted regressors, according to the fuzzy controller, are therefore protected against division by crossover, and stand a better chance of being passed on, unaltered, to the offspring solutions.

5 Variable Size and Shape Schemata The survival rate of well adapted regressors is an indicator of a given crossover operator’s efficiency in protecting useful BBs. Therefore, the authors propose a variable size and shape schema theory, specifically designed to asses the life span of regressors over generations. In this paper, the theorems presented below are used to monitor the behavior of the fuzzy controller, by computing the expected number of encapsulated regressors instances, at generation x+1, using only information available at generation x (section 6). The computations are performed without explicitly constructing all possible offspring, and are meant to reveal the efficiency of encapsulation during the evolutionary process. A regressor may be passed on to the offspring in one of two ways. Firstly, if one of the parents, T, features the targeted regressor R, and the selected cut point is p ∈ T \ R , then the BB will be inherited by at least one of the two offspring. Secondly, if the crossover operator selects convenient nodes in both parents, then the desired BB would be reconstructed in the offspring solutions by recombining the resulting subtrees. The authors suggest representing a regressor R by means of an extended set S(R) made up of unique pairs containing the names of the terminals inside R and their frequencies. For example, R2 in T1 (Fig. 1) is denoted by set S(R2) = {(u(k), 1), (u(k – 1), 2), (y(k – 1), 1)}. The estimated number of schema H samples, featured by the offspring solutions, computed at generation t and noted α(H, t) is the sum of two components, one referring to preservation, αpres(H, t) and the other to the join effect, αjoin(H, t):

α ( H , t ) = α pres ( H , t ) + α

join

(H ,t) .

(4)

Let us assume that trees T1 and T2 in Fig. 1 are two parents affected by crossover, and that the regressor being traced is S(H) = {(y(k – 1), 1), (u(k – 1), 2)}. As T1 features an instance of S(H), namely R4, the targeted regressor may be passed on to at least one of the two offspring by selecting any of the following cut nodes within T1: 3, 11, 12, 10, 13, 9, paired up with any node in T2. T2 does not feature any instance of the schema, so α pres ( H , t ) =

∑ ( p(i | T1 ) p( j | T2 )) . To identify the cut nodes that would yield

i∈T1 \ R 4 j∈T2

subtrees representing S(H) fragments, which could be conveniently joined to form the targeted regressor within the offspring, all T1 and T2 “*” and terminal nodes are considered in sequence. Note that, for one particular cut point, the regressor rooted in the

296

A. Patelli and L. Ferariu

considered node as well as all the nesting ones will be analyzed. Relative to the considered example, αjoin(H, t) adds up to: p(5|T1)p(5|T2) + p(5|T1)p(12|T2)+ p(6|T1)p(10|T2) + p(7|T1)p(5|T2) + p(7|T1)p(12|T2) + p(8|T1)p(10|T2) + p(8|T1)p(7|T2) + p(8|T1)p(11|T2) + p(4|T2)p(6|T1) + p(4|T2)p(10|T1) + p(11|T2)p(6|T1) + p(11|T2)p(10|T1) + p(13|T2)p(5|T1) + p(13|T2)p(7|T1) + p(13|T2) p(11|T1) + ∑ p(i | T2 ) p( 4 | T1 ) . i∈T2

By representing a schema using extended sets instead of trees that include wildcard symbols [3], the suggested approach effectively handles the case when the same combination of terminals is encoded by differently shaped regressors. Additionally, the proposed procedure considers all nesting regressors, each of a different size, for each potential cut node. Therefore, the described method can process hierarchical individuals of various dimensions and shapes. This allows the reformulation of the fixed size and shape schema theory [3], in the context of NLP compliant tree based structures, as indicated below. Theorem 1. The expected number of schema H samples, featured by the offspring trees, under cut point crossover, due to preservation, α pres ( H , t ) , is:

α pres ( H , t ) = ∑ p(T , t )δ ( H ∈ Rg (T )) ∑ p(i | T , t )δ (i ∈ N (T \ H ))) , i∈T \{+}

T

(5)

where T is a tree index iterated throughout the entire population, p(T, t) marks the probability of selecting tree T to become a parent, p(i|T, t) represents the probability of selecting cut node i inside tree T, Rg(T) denotes the set of regressors featured by T, S(R) stands for the set associated to regressor R, N(T) returns the set of nodes within tree T, δ() is a predicate that returns the truth value of its argument, and t symbolizes the current generation. Proof Let χ (T , i, t ) = δ ( H ∈ Rg (T ))δ (i ∈ N (T \ H ))) be a Bernoulli distributed random variable, as it can only assume two values: 1 in case schema H is passed on from T to one of the offspring (success), and 0 otherwise (failure). The expected value of χ (T , i, t ) equals the probability of success, namely

α pres ( H , t ) . Therefore, E ( χ (T , i, t )) = ∑

∑ χ (T , i , t ) p(T , i , t ) = α

T i∈T \ { + }

pres

(H ,t) .

The event of selecting a specific cut node is conditioned by the one of selecting its containing tree, whereas the selections of two different trees in the population are independent events: p(T , i , t ) = p(i | T , t ) p(T , t ) . Equation (5) follows by substituting the last equation into the one before it. Theorem 2. The expected number of schema H samples, featured by the offspring trees, under cut point crossover, due to join effect, α

join

( H , t ) , is:

Regressor Survival Rate Estimation for Enhanced Crossover Configuration

α join( H , t ) = ∑ p(T1 , t ) p(T2 , t ) T1 ,T2

K

∑ p(i | T1, t ) p( j | T2 , t ) ∑δ ( S ( Rg(T1, i, k )) − S ( Rg(T1, i )) + S ( Rg(T2 , j )) = H )

i∈T1 \{+ } j∈T2 \{+}

297

k =0

(6)

where Rg(T, i) denotes the regressor rooted in i within tree T, Rg(T, i, k) refers to the k-nesting regressor of the one rooted at node i, namely the regressor whose root is situated k levels closer to the tree root than the level of node i. For example, relative to T1 in Fig. 1, R8 is the 0-level nesting regressor for itself and R6 is the 1-level nesting regressor of R8. The proof for the theorem above is constructed similarly to the one for Theorem 1 (5). The suggested mathematical framework is general, valid for any type of structural crossover. It may be used to monitor the survival rate of encapsulated regressors in order to outline the efficiency of the enhanced crossover operator. High values for αpres(H, t) and αjoin(H, t), computed for regressors with strong encapsulation probabilities, would indicate a success scenario. More specifically, extended sets are built for all encapsulated regressors featured by the nondominated trees, at certain generations. By employing (5) and (6), the transmission rates due to preservation and join effect are computed, as shown in the following section.

6 Application The suggested memetic evolutionary algorithm, FCE-STRM-MEA, featuring fuzzy controlled encapsulation, monitored via a schema theory extension for regressive models, has been deployed to obtain a model for the steam subsection of the evaporation station within the sugar factory of Lublin, Poland. The targeted industrial plant represents a complex nonlinear system, with two distinct representative data sets obtained experimentally: one used for training, and the other for validation. Two versions of the algorithm were considered, both evolving a population of 50 individuals over 50 generations. Both alternatives feature the encapsulation mechanism described in section 4, yet the first employs a crossover operator that takes the encapsulation probabilities into account when selecting the cut nodes, whilst the second does not. The selection of a given cut node has a probability of p = 1 – P[enc], in the first case, and is uniformly distributed, p = 1/Nr(T), in the second case. The results in Table 1 show the mean values of αpres(H, t) (5) and αjoin(H, t) (6), computed for all regressors with the same encapsulation label, considering both versions of the identification method. The analysis is carried out on the trees of the nondominated set at each of the sampled generations, the first 30 of them having been used to adaptively configure the fuzzy controller parameters. The mean squared error values obtained by employing the encapsulation guided crossover operator decrease over the generations. That indicates that the regressors classified by the fuzzy controller as well adapted have a high survival rate. This effect is confirmed by increased transmission probabilities due to preservation. When the classic version of the crossover operator is applied, the mean squared error values fluctuate, while the survival rate of most regressors, regardless of their encapsulation probability, varies insignificantly around the same value. That is most likely a sign that traditional crossover is less efficient in encouraging the survival of useful BBs. Note that, in most cases, αjoin(H, t) is slightly lower in the case of the enhanced crossover operator, as

298

A. Patelli and L. Ferariu

most fragments that could be spliced to form a certain BB are probably part of an encapsulated regressor as well. However, the high survival rate of the fittest BBs does not have a negative impact on the diversity of the nondominated set. Hence, the most accurate solution extracted from the first order Pareto set, at generation 50, shows appropriate generalization capacities with a mean relative error of 0.98% over the validation data set (Fig. 2). Table 1. Expected regressors samples number Enh/Std–enhanced/standard crossover, rg– regressor number, Penc–encapsulation probability, Gen–generation, acc–mean accuracy, αpres/ αjoin–mean expected samples number for all regressors with the given Penc

Penc

0 0.5 0.8 0.9 1

30 [1.32/2.03] αpres 0.42 0.53 0.32 0.56 0.78 0.12 1.56 0.34 2.76 0.12

αjoin 0.63 0.51 0.45 0.44 0.76 0.06 0.75 0.45 1.45 0.34

35 [0.98/3.05]

rg αpres 7 0.16 12 0.92 8 0.23 8 0.67 6 0.54 3 0.12 12 1.89 3 0.02 10 3.06 5 0.87

αjoin 0.10 0.45 0.14 0.56 0.24 0.66 1.54 0.12 1.03 0.45

40 [0.85/2.96]

rg αpres 3 0.11 7 1.67 2 0.12 11 0.87 5 0.55 6 0.33 13 1.78 2 0.51 9 2.98 1 0.68

αjoin 0.07 0.45 0.11 0.87 0.45 0.07 1.08 0.88 0.87 0.55

45 [0.71/3.21]

rg αpres 2 0.05 8 0.43 1 0.12 7 0.00 5 0.45 2 0.39 7 1.02 2 0.56 13 3.25 2 0.43

αjoin 0.01 0.44 0.03 0.00 0.23 0.78 0.78 0.12 0.56 0.32

50 [0.52/2.87]

rg αpres 1 0.00 9 0.45 1 0.00 0 0.55 3 0.34 2 0.00 7 1.23 2 0.12 11 2.99 0 0.34

Gen acc [enh/std]

αjoin rg 0.00 0 Enh 0.34 12 Std 0.00 0 Enh 0.61 10 Std 0.12 2 Enh 0.00 0 Std 0.12 8 Enh 0.34 1 Std 0.34 12 Enh 0.55 2 Std

VALIDATION DATA SET 230 process output model output

process output, model output

220

210

200

190

180

170

0

20

40

60

80

100 120 time[sec*0.01]

140

160

180

200

Fig. 2. Generalization capacity of the most accurate solution from the final nondominated set

7 Conclusions The authors suggest a novel identification tool for complex nonlinear systems, based on genetic programming and enhanced to better fit the specific of the engineering problems. The first key feature of the described algorithm is a fuzzy controlled encapsulation mechanism aimed at estimating the degree of regressor adaptation. The second important enhancement consists in a customized crossover operator with a specific cut point

Regressor Survival Rate Estimation for Enhanced Crossover Configuration

299

selection procedure. Identifying potentially useful building blocks by means of similarity analysis and encouraging their survival via encapsulation enhance the method’s exploitation capacities, increasing the overall speed in locating the final Pareto front. The exploratory power of the proposed tool is favored by selecting cut nodes from the less adapted subtrees of the considered parents, thus maintaining population diversity. The balance between exploration and exploitation is achieved dynamically by hybridizing the evolutionary process with a fuzzy controller whose parameters are configured in an unsupervised manner. In order to assess the performance of the upgraded crossover operator in protecting well adapted regressors from division during reproduction, this paper introduces a variable size and shape schema theory tailored to suit the specific of the regressive tree representation. The novel schema theory allows the computation of BB survival rates, results that may be used to estimate the life span of regressors, thus evaluating the quality of the encapsulation process. Experiments conducted within the framework of a real life industrial system show that encapsulation guided crossover is superior to the classic version in what concerns well adapted BB survival. The proposed variable size and shape schema theory is used merely as a validation framework for the performances of the encapsulation guided crossover operator. Future research will directly include schema theory results within the encapsulation process, and consider additionally employing the mutation operator.

References 1. Fogel, D.B.: Evolutionary Computation – Towards a new Philosophy of Machine Intelligence, 3rd edn. IEEE Press Series on Computational Intelligence. IEEE Press, Los Alamitos (2006) 2. Nelles, O.: Nonlinear System Identification – From Classical Approaches to Neural Networks and Fuzzy Models. Springer, Heidelberg (2001) 3. Poli, R., McPhee, N.F.: General Schema Theory for Genetic Programming with SubtreeSwapping Crossover: Part 2. Evol. Comp. 11(2), 169–206 (2003) 4. Patelli, A., Ferariu, L.: Dynamic Fuzzy Controlled Regressor Encapsulation in Evolving Nonlinear Models. In: 14th Int. Conf. on System Theory and Control, pp. 373–378 (2010) 5. Patelli, A., Ferariu, L.: Increasing Crossover Operator Efficiency in Multiobective Nonlinear Systems Identification. In: Proc. of IEEE Intelligent Systems Conference, pp. 426–431 (2010) 6. Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A.: Evolutionary Algorithms for Solving Multi-Objective Problems. Springer, Heidelberg (2007) 7. De Jong, K.A.: Evolutionary Computation – A Unified Approach. MIT Press, Cambridge (2006) 8. Ferariu, L., Patelli, A.: Multiobjective Genetic Programming for Nonlinear Systems Identification. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 233–242. Springer, Heidelberg (2009) 9. Rachmawati, L., Srinivasan, D.: Multiobjective Evolutionary Algorithm with Controllable Focus on the Knees of the Pareto Front. IEEE Trans. Evol. Comp. 13(4), 810–824 (2009) 10. Adra, S.F., Dodd, T.J., Griffin, I.A., Fleming, P.J.: Convergence Acceleration Operator for Multiobjective Optimization. IEEE Trans. on Evol. Comp. 13(4), 825–847 (2009) 11. Deb, K.: Multiobjective Optimization Using Evolutionary Algorithms. Wiley&Sons, Chichester (2001) 12. Van Veldhuizen, D.A., Lamont, G.B.: Multiobjective Optimisation with Messy Genetic Algorithms. In: Proc. of the 2000 ACM Symposium on Applied Computing, p. 470 (2000)

A Study on Population’s Diversity for Dynamic Environments Anabela Sim˜oes1,2 , Rui Carvalho1, Jo˜ao Campos1 , and Ernesto Costa2 1

Coimbra Institute of Engineering, Polytechnic Institute of Coimbra Rua Pedro Nunes - Quinta da Nora, 3030-199 Coimbra, Portugal 2 Centre for Informatics and Systems, University of Coimbra Polo II, Pinhal de Marrocos, 3030-290 Coimbra, Portugal {abs,a21160212,a21160025}@isec.pt, [email protected]

Abstract. The use of mechanisms that generate and maintain diversity in the population was always seen as fundamental to help Evolutionary Algorithms to achieve better performances when dealing with dynamic environments. In the last years, several studies showed that this is not always true and, in some situations, too much diversity can hinder the performance of the Evolutionary Algorithms dealing with dynamic environments. In order to have more insight about this important issue, we tested the performance of four types of Evolutionary Algorithms using different methods for promoting diversity. All the algorithms were tested in cyclic and random dynamic environments using two different benchmark problems. We measured the diversity of the population and the performances obtained by the algorithms and important conclusions were obtained. Keywords: Evolutionary Computation, Dynamic Optimization, Diversity.

1 Introduction Evolutionary Algorithms (EAs) have been successfully used in a wide area of applications. Traditionally, EAs are well-suited to solve problems where the environment is static. The generational process of evolution in the standard EA often leads to the premature convergence of the population which is not suitable for dynamic optimization problems (DOP). Therefore, when dealing with DOP, some improvements have been proposed as extensions of the classical EA: (1) maintaining diversity, (2) using memory schemes, (3) using multi-populations or (4) anticipating the changes in the environment. In this paper we are interested on the first type of methods. The promotion of diversity was always considered as beneficial to EAs to cope with dynamic environments. This makes sense, for if we have individuals in different regions of the search space, when a change occurs, the probability of having an individual close to the new optimum is higher. Different schemes have been proposed under this topic: to restart the EA from scratch every time a change is detected [8], re-initialization using memorized solutions [11], hypermutation [5], self adaptation of the mutation rate [2], using random immigrants [7] or combining the concepts of elitism and random immigrants [17]. Moreover, the use of sentinels in different regions of the search space [10] and the application of ideas from immune system [12] are other approaches used to maintain the ˇ (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 300–309, 2011. A. Dobnikar, U. Lotriˇc, and B. Ster c Springer-Verlag Berlin Heidelberg 2011

A Study on Population’s Diversity for Dynamic Environments

301

population’s diversity. However, different research has showed that, in some circumstances, it is not necessary to have too much diversity in the population [1], [3], [13], [17]. These studies show that, in some situations, an high diversity level can be detrimental to the performance of the EAs. Nevertheless, these studies do not provide an exhaustive investigation about this issue, since the diversity is not measured in a consistent and complete manner and also, because the characteristics of the algorithms and the dynamics of the environments are not analyzed. The goal of this paper is to explore the real importance of diversity and its relation with the performance of different EAs. We are interested in studying the impact of diversity on two types of algorithms: memory-based and immigrants-based EAs. The promotion of diversity can be achieved by changing the genetic operators rate or using different recombination operators. In this study, different levels of diversity were obtained using three different recombination operators: uniform crossover, transformation and conjugation. This paper investigates and analyzes the relation between the diversity of the population and the performance of different EAs in cyclic and random dynamic environments using two benchmark problems. The remaining of the paper is organized as follows. The next section briefly reviews the memory-based and immigrant-based algorithms used in our experimental study. Section 3 describes the used genetic operators. Section 4 explains the dynamic optimization problem (DOP) generator proposed in [17] and used in our experiments. The parameter settings and the experimental results and analysis are presented in section 5. Finally, in Section 6 relevant conclusions are presented and discussed.

2 Evolutionary Algorithms In this work we used two types of memory-based EAs and two different memoryless and immigrant-based EAs. The Memory-Immigrants Genetic Algorithm (MIGA) [17] evolves a population of individuals in the standard evolutionary way: selection, crossover and mutation. Additionally, a memory is initialized randomly and used to store the current best individual of the population, replacing one of the initially randomly generated individuals, if it exists, or replacing one memory individual, selected according to a replacing scheme, if it is better. Every generation, the best individual in the memory is used to create a percentage ri of new individuals, called immigrants that are introduced into the population replacing the worst ones. These new individuals are created mutating the best solution in memory using a chosen mutation rate pi . When a change is detected it is expected that the diversity introduced in the population by adding these immigrants can help the EA to readapt to the new conditions. The Memory-Enhanced Genetic Algorithm (MEGA) [17] is an adaptation of Branke’s memory-based algorithm [4] and evolves a population of individuals through the application of selection, crossover and mutation. Additionally, a randomly initialized memory is used to store good solutions and to detect changes in the environment. When a change is detected, the memory is merged with the population and the best p (population size) individuals are used to form the new population, while memory remains unchanged. The Random Immigrants Genetic Algorithm (RIGA) was proposed by [7] and also uses a standard EA modified by the introduction of immigrants during the evolutionary process. Every generation, a percentage ri of random individuals is created

302

A. Sim˜oes et al.

(immigrants) and replaces the same number of individuals of the current population. In our implementation we replace the worst individuals of the population. The Elitismbased Immigrants Genetic Algorithm (EIGA) [17] is a standard EA that combines the concepts of elitism and random immigrants. Every generation, before applying the usual genetic operators, the best individual from the previous generation (elite) is used to create immigrants via mutation, using a fixed probability. The number of elitism-based immigrants is controlled by the percentage ri and these immigrants are introduced into the current population, replacing the worst individuals.

3 Genetic Operators In order to have different levels of diversity in the population we kept constant all the components of the different EAs and changed the recombination operator. Three different operators were used to create diversity: uniform crossover, transformation and conjugation. Uniform crossover: uniform crossover is probably the most widely used crossover operator because of its efficiency in not only identifying, inheriting and protecting common genes, but also in re-combining non-common genes [16]. After selecting two parents, the bits are randomly copied from the first or from the second parent to the offspring. Transformation: in biology, it consists of the transfer of genetic material between organisms by means of extracellular pieces of DNA [6]. The computational approach of this operator was proposed in [12]. At the beginning of the process, a pool of segments of different random sizes is randomly created. Each segment can be selected to be incorporated into the individuals. Transformation works as follows: after selecting the parents, using the chosen selection method, they are transformed individually with the segments, with a fixed probability. Besides changing the individuals of the population, the pool of segments is also updated: a percentage (αt ) of the segments is modified using the genome of the individuals of the population and the remaining segments are randomly created. The transformation of each selected individual follows three steps: (1) randomly select a segment from the segment pool; (2) randomly choose a point of transformation; (3) incorporate the segment in the genome of the individual, replacing the same number of genes. Conjugation: bacterial conjugation consists of the transfer of genetic material between bacteria through cell-to-cell contact. To make conjugation possible, two bacterial cells must come together and a cytoplasmic bridge called pilus is built allowing the transfer of genetic material from the donor cell to the recipient cell [6]. Computational conjugation tries to mimic the biological mechanism and was already use by Smith [15]. In this paper we use a different version of computational conjugation proposed in [14]. In this approach, the donor and the recipient cells are not chosen at random but selected according to their fitness: for a population of size p, the p2 best individuals become the ‘donors’ while the remaining become the ‘recipients’. Then, using a fixed probability, the ith donor transfers part of its genetic material to the ith recipient (i=1, ..., p2 ). Following that, all offspring created by this process is mutated and joined with the donor individuals becoming the next population of size p.

A Study on Population’s Diversity for Dynamic Environments

303

4 Dynamic Test Environments The dynamic environments to carry out our experimentation were created using Yang’s Dynamic Optimization Problems (DOP) generator [17]. This generator allows constructing different dynamic environments from any binary-encoded stationary function using the bitwise exclusive-or (XOR) operator. The characteristics of the change are controlled by two parameters: the speed of the change, r, which is the number of generations between two changes, and the magnitude of the change, ρ, that controls how different is the new environment from the previous one. The DOP generator can construct three types of dynamic environments: cyclic, cyclic with noise and random. In this work we constructed two types of environments: cyclic and random1 . For each type of DOP, the parameter r was used with 10 and 50. The ratio ρ was set to different values in order to test different levels of change: 0.1 (a light shifting) 0.2, 0.5 and 1.0 (severe change). The selected benchmark problems were the dynamic Knapsack problem and the Royal Road F1 function. The Knapsack problem is a NP-complete combinatorial optimization problem which consists of selecting a number of items to a knapsack with limited capacity. Each item has a value and a weight and the objective is to choose the items that maximize the total value, without exceeding the capacity of the bag. The initial values, weights and capacity were created using strongly correlated sets of randomly generated data. The fitness of an individual using binary representation is equal to the sum of the values of the selected items, if the weight limit is not reached. If too many items are selected, then the fitness is penalized in order to ensure that invalid individuals are distinguished from the valid ones. The Royal Road functions [9] consist of a list of partially specified bit strings (schemas) si in which ‘*’ denotes a wild card (i.e., allowed to be either 0 or 1). A bit string x is said to be an instance of a schema s, i.e., x ∈ s, if x matches s in all non-‘*’ positions. Each schema si contributes with a coefficient ci which is equal to the schema’s order, i.e. ci = o(si ). The order of a schema si , is the number of defined bits in si . In this work we use the royal road F1 (RRF 1), with ci = 8 for all si , and i = 1..8. Both problems, knapsack and the royal road function were transformed from static to dynamic using the DOP generator, described before. A total number of 16 DOPs was tested: two types of environment, two values of r combined with four values of ρ. Those 16 DOPs were used to run four different algorithms using three recombination operators on two benchmarks, giving a total of 384 different situations tested in this work.

5 Experimental Study 5.1 Experimental Design For the experiments, we used standard parameters for all EAs: generational replacement with elitism of size one, tournament selection with tournament of size two, recombination (uniform crossover, conjugation or transformation) with probability pc = 0.7 and flip mutation with probability pm = 0.01. Binary representation was used with chromosomes of size 100 for the Knapsack and 64 for the Royal Road F1 Function. The probability of mutation pi used to create the immigrants of MIGA and EIGA was set to 0.01. 1

the noisy environments were not included in this paper due to lack of space.

304

A. Sim˜oes et al.

The ratio of immigrants ri for MIGA, RIGA and EIGA was set to 0.2. The percentage αt of segments generated at random each generation when using transformation was set to 0.5. The gene segment pool size used in transformation was set to 50. In order to have the same number of function evaluations per generation, the global number of individuals (n) was set as follows: MEGA used n = 100; MIGA, RIGA and EIGA used a value 100 of n = 1+r , since that, in these algorithms, the ri × n immigrants were also evaluated i every generation. So, for MIGA, EIGA and RIGA we used n = 83. The memory size for MEGA and MIGA was set to m = 0.2 × n. The generational replacing strategy proposed by [13] was used in all memory-based EAs. For each experiment of an algorithm, 30 runs were executed. Each algorithm was run for a number of generations corresponding to 200 environmental changes. The overall performance used to compare the algorithms was the best-of-generation fitness averaged over 30 independent runs, executed with the same random seeds. The diversity of the population every generp was computed p 1 ation using the following equation: Div(k, t) = l(n−1) i=1 j=i HD(Ii , Ij ), where k is the k th run, p is the population size, l is the chromosome length and HD(Ii , Ij ) is the Hamming distance between individuals i and j. The final diversity, that will be presented in the next section, corresponds to the average of the diversity over the 30 runs. 5.2 Results on Cyclic Environments The overall performance obtained by the EAs on cyclic dynamic environments is presented in Table 1. The overall diversity maintained by the different genetic operators in each EA is reported on Table 2. Table 3 reports the related statistical results of comparing the different methods. These tables use the following notation Tf for transformation, Cx for uniform crossover and Cj for conjugation. The statistical validation was made using the nonparametric Friedman test at 0.01 level of significance. The multiple pair wised comparisons were performed using the Nemenyi procedure with level of significance of 0.01 with Bonferroni correction. When comparing two algorithms, the notation used is “+”, “−”, “++” or “−−”, respectively when the first algorithm was better than, worse than, significantly better than, or significantly worse than the second one. We were not concerned in comparing the performance among the different EAs. This work can be found in [17] and [14]. Our main goal was to see if, for the same algorithm, a relation between the diversity of the population and the performance of the algorithm could be found. Results show that, for all problems, the operator that kept the highest diversity was transformation, while conjugation maintained the lowest diversity. The diversity obtained by crossover and conjugation was similar for r = 50, mainly for the RRF 1 problem and in MIGA. The diversity had similar values, independently of the severity of the change (ρ). For larger values of r the level of diversity decreased, since the population had more time to converge. The results show that, for both problems, the memory-based EAs obtained the best results using conjugation, which maintained the lowest level of diversity. On the other hand, those EAs with transformation attained the worst results, corresponding to the highest diversity of the population. The retrieval of the memorized information associated with some exploration

A Study on Population’s Diversity for Dynamic Environments

305

(moderate/lower diversity) of the search space ensured the best performance of this type of EAs. The results obtained show that, at the beginning of the run, transformation allowed a better performance of MIGA, and after that, the EA’s performance was improved using conjugation. This happened because, at the beginning, the memory individuals corresponded to random points or to solutions that were not completely optimized. Later, memory kept the best solutions that allowed to guide the immigrants even with lower diversity. Generating too much diversity in these cases was damaging because it continued disrupting the population that had already found some “good” areas to explore and exploit by mutation. Note that the studied memory-based algorithms are appropriate for cyclic environments. The retrieval of information from memory helps the algorithm’s re-adaptation when a change happen. The genetic operators introducing lowest diversity, are useful to continue the search after the retrieval of the memory individuals. So, if the memory stores the appropriate information to introduce into the population when a change is detected, mechanisms that promote higher diversity are disruptive, slowing the search process for the optimum. The immigrant-based EAs under cyclic environments achieved higher performances with the mechanisms that generated higher diversity (crossover and transformation). Conjugation (lower diversity) never allowed those EAs to obtain good performances. In general, for r = 10 transformation allowed the best scores and for r = 50 the best marks were achieved using crossover. The absence of memory was critical for those algorithms and extra diversity in the population improved the EA’s performance when a change was detected. More diversity in the population indicates that the solutions are covering different regions of the search space. So, when a change occurs, higher diversity is advantageous for these memory-less algorithms. Table 1. Overall Performance of the algorithms on cyclic environments

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

Cx 1778.28 1778.75 1784.94 1790.82 1804.45 1809.42 1813.27 1817.33

MIGA Tf 1780.71 1782.73 1788.03 1794.62 1797.90 1801.60 1810.80 1814.10

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

Cx 22.31 23.51 27.72 28.30 44.80 44.89 46.79 50.23

MIGA Tf 27.31 28.33 32.59 35.66 39.20 42.40 44.80 45.00

Cj 1782.78 1785.09 1791.18 1796.83 1808.80 1815.30 1820.40 1823.60

Cx 1771.01 1772.91 1775.24 1779.67 1791.55 1795.81 1798.27 1802.91

Cj 28.37 29.89 34.18 36.96 47.20 49.00 52.20 53.00

Cx 21.78 24.02 25.13 31.48 58.40 59.11 59.74 59.46

Knapsack problem MEGA RIGA Tf Cj Cx Tf 1773.46 1776.24 1766.74 1759.43 1776.03 1779.69 1758.15 1758.29 1777.12 1780.74 1744.42 1754.48 1779.90 1781.11 1727.99 1744.02 1779.20 1795.40 1795.57 1769.03 1788.90 1801.50 1779.93 1768.56 1793.60 1803.00 1769.18 1766.99 1799.20 1805.80 1759.61 1765.18 RoyalRoad Function F1 MEGA RIGA Tf Cj Cx Tf 23.13 24.98 21.68 15.02 24.61 25.57 13.90 14.63 28.64 30.00 10.72 11.04 32.09 32.76 29.20 12.16 38.40 59.00 51.79 24.39 39.20 62.40 35.44 22.27 45.60 63.00 23.59 23.38 54.00 63.60 22.71 28.18

Cj 1763.81 1748.20 1717.45 1688.59 1792.52 1779.11 1764.37 1753.23

Cx 1768.24 1746.23 1696.91 1664.05 1796.20 1785.66 1771.41 1740.74

EIGA Tf 1769.76 1765.83 1758.09 1749.96 1783.21 1779.59 1775.70 1774.45

Cj 1762.42 1737.94 1682.32 1642.38 1793.36 1781.22 1756.30 1736.04

Cj 21.37 13.38 9.97 29.15 47.50 32.05 22.66 22.57

Cx 25.98 12.19 9.74 28.57 50.70 34.80 15.34 20.17

EIGA Tf 24.06 19.90 15.40 18.97 39.63 34.55 36.83 46.25

Cj 22.80 10.57 8.48 24.69 47.13 30.80 13.60 17.30

306

A. Sim˜oes et al. Table 2. Diversity of the Population on cyclic environments

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

MIGA Cx Tf Cj 0.22 0.46 0.14 0.21 0.46 0.12 0.19 0.46 0.12 0.19 0.45 0.12 0.12 0.34 0.10 0.11 0.33 0.10 0.12 0.33 0.09 0.12 0.32 0.10

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

MIGA Cx Tf Cj 0.13 0.37 0.11 0.13 0.39 0.11 0.13 0.37 0.12 0.11 0.32 0.10 0.06 0.12 0.06 0.07 0.14 0.06 0.07 0.14 0.06 0.07 0.11 0.06

Knapsack problem MEGA RIGA Cx Tf Cj Cx Tf Cj 0.36 0.49 0.15 0.38 0.49 0.16 0.37 0.49 0.16 0.39 0.49 0.20 0.37 0.49 0.16 0.41 0.49 0.23 0.34 0.47 0.15 0.42 0.49 0.27 0.30 0.39 0.20 0.32 0.49 0.09 0.31 0.39 0.20 0.33 0.49 0.11 0.32 0.39 0.19 0.35 0.49 0.13 0.32 0.38 0.19 0.37 0.49 0.15 RoyalRoad Function F1 MEGA RIGA Cx Tf Cj Cx Tf Cj 0.25 0.46 0.14 0.29 0.49 0.17 0.29 0.46 0.12 0.40 0.49 0.25 0.29 0.46 0.12 0.45 0.49 0.30 0.19 0.45 0.12 0.17 0.49 0.11 0.09 0.30 0.09 0.13 0.47 0.09 0.10 0.30 0.08 0.20 0.47 0.14 0.14 0.28 0.10 0.29 0.47 0.20 0.17 0.25 0.10 0.28 0.46 0.19

Cx 0.09 0.09 0.10 0.10 0.06 0.07 0.08 0.08

EIGA Tf Cj 0.42 0.05 0.43 0.05 0.43 0.06 0.44 0.06 0.41 0.03 0.41 0.04 0.42 0.04 0.42 0.05

Cx 0.10 0.12 0.14 0.08 0.06 0.09 0.15 0.14

EIGA Tf Cj 0.37 0.09 0.39 0.12 0.41 0.13 0.40 0.09 0.34 0.06 0.35 0.10 0.33 0.15 0.28 0.14

Table 3. The statistical results on cyclic environments r = 10 ρ ⇒ MIGA Cj - Cx MIGA Cj - Tf MIGA Tf - Cx MEGA Cj - Cx MEGA Cj - Tf MEGA Tf - Cx r = 50 ρ ⇒ MIGA Cj - Cx MIGA Cj - Tf MIGA Tf - Cx MEGA Cj - Cx MEGA Cj - Tf MEGA Tf - Cx

Knapsack problem 0.1 0.2 0.5 1.0 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ + 0.1 0.2 0.5 1.0 ++ ++ ++ ++ ++ ++ ++ ++ −− −− −− −− ++ ++ ++ ++ ++ ++ ++ ++ −− −− −− −−

Royal Road F1 0.1 0.2 0.5 1.0 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ + ++ ++ ++ ++ 0.1 0.2 0.5 1.0 ++ ++ ++ ++ ++ ++ ++ ++ −− −− −− −− ++ ++ ++ ++ ++ ++ ++ ++ −− −− −− −−

r = 10 ρ ⇒ RIGA Cj - Cx RIGA Cj - Tf RIGA Tf - Cx EIGA Cj - Cx EIGA Cj - Tf EIGA Tf - Cx r = 10 ρ ⇒ RIGA Cj - Cx RIGA Cj - Tf RIGA Tf - Cx EIGA Cj - Cx EIGA Cj - Tf EIGA Tf - Cx

Knapsack problem 0.1 0.2 0.5 1.0 −− −− −− −− ++ −− −− −− −− + ++ ++ −− −− −− −− −− −− −− −− + ++ ++ ++ 0.1 0.2 0.5 1.0 −− − −− −− ++ ++ −− −− −− −− −− ++ −− −− −− −− ++ ++ −− −− −− −− ++ ++

Royal Road F1 0.1 0.2 0.5 1.0 − − − − ++ −− −− ++ −− ++ ++ −− −− −− − −− −− −− −− ++ −− ++ ++ −− 0.1 0.2 0.5 1.0 −− −− −− −− ++ −− −− −− − ++ ++ −− −− −− − − ++ −− −− −− −− − ++ ++

5.3 Results on Random Environments The overall performance obtained by the EAs on random dynamic environments are presented in Table 4. Table 5 summarizes the overall diversity maintained by the different genetic operators. Table 6 reports the related statistical results of comparing the different methods. For random environments, the results of the memory-based EAs on the Knapsack problem were similar to cyclic environments. Conjugation allowed these methods to achieve the best results. Comparing the diversity level we see that conjugation maintained the lowest values. The results obtained in the RRF 1 were slightly different. The overall diversity of crossover and conjugation was similar and the performance of the

A Study on Population’s Diversity for Dynamic Environments

307

Table 4. Overall Performance of the algorithms on random environments

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

Cx 1770.43 1764.28 1757.35 1788.13 1792.48 1783.44 1772.24 1817.22

MIGA Tf 1762.56 1760.24 1756.90 1783.95 1774.99 1771.49 1769.12 1808.76

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

Cx 22.11 13.66 9.67 28.47 46.60 30.65 17.35 51.75

MIGA Tf 15.77 13.14 11.77 58.19 25.56 21.21 19.44 63.72

Cj 1772.42 1765.56 1758.75 1796.48 1793.31 1785.92 1775.28 1819.31

Cx 1769.06 1763.63 1757.63 1777.70 1783.16 1778.53 1769.65 1802.17

Cj 21.77 12.61 8.24 27.30 45.80 29.59 15.69 51.62

Cx 25.96 15.17 10.26 31.05 51.66 37.29 20.86 59.90

Knapsack problem MEGA RIGA Tf Cj Cx Tf 1758.63 1773.71 1774.91 1758.40 1757.29 1764.20 1764.66 1767.42 1755.92 1757.96 1754.49 1755.08 1777.73 1792.27 1726.77 1744.09 1766.38 1792.95 1790.79 1768.34 1765.45 1783.32 1782.12 1767.62 1764.73 1771.81 1771.46 1773.12 1793.14 1817.48 1759.71 1765.16 RoyalRoad Function F1 MEGA RIGA Tf Cj Cx Tf 14.51 23.14 24.80 14.16 12.30 13.58 16.02 12.21 11.28 8.70 11.07 10.90 55.14 30.75 29.60 11.94 20.82 46.88 52.01 20.78 18.75 32.54 38.10 18.81 21.58 17.36 24.48 17.72 63.67 59.94 22.61 28.14

Cj 1772.78 1760.64 1738.60 1691.81 1786.86 1778.18 1766.42 1753.31

Cx 1776.80 1765.73 1734.98 1663.63 1798.96 1789.10 1772.41 1741.00

EIGA Tf 1770.23 1765.88 1760.13 1749.60 1782.81 1778.70 1774.93 1774.47

Cj 1771.69 1757.17 1725.04 1643.76 1794.41 1784.63 1765.23 1735.92

Cj 24.48 15.06 10.98 29.23 50.10 34.53 22.82 22.26

Cx 28.70 14.53 10.47 28.14 51.98 37.68 18.30 19.92

EIGA Tf 21.91 13.81 9.22 19.22 34.44 28.76 24.99 46.37

Cj 25.06 13.94 9.31 25.76 49.00 33.20 14.85 17.17

Table 5. Diversity of the Population on random environments

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

MIGA Cx Tf Cj 0.22 0.46 0.14 0.26 0.46 0.17 0.28 0.46 0.20 0.22 0.46 0.12 0.13 0.45 0.06 0.14 0.45 0.07 0.16 0.45 0.09 0.12 0.44 0.06

ρ 0.1 r = 10 0.2 0.5 1.0 0.1 r = 50 0.2 0.5 1.0

MIGA Cx Tf Cj 0.16 0.44 0.13 0.21 0.45 0.17 0.22 0.46 0.19 0.12 0.33 0.11 0.07 0.40 0.06 0.11 0.41 0.10 0.15 0.42 0.14 0.07 0.21 0.06

Knapsack problem MEGA RIGA Cx Tf Cj Cx Tf Cj 0.34 0.49 0.16 0.37 0.49 0.15 0.35 0.49 0.20 0.38 0.49 0.17 0.38 0.49 0.25 0.39 0.49 0.21 0.34 0.49 0.18 0.42 0.49 0.27 0.30 0.49 0.09 0.32 0.49 0.09 0.30 0.49 0.11 0.33 0.49 0.10 0.32 0.49 0.14 0.35 0.49 0.13 0.32 0.49 0.08 0.37 0.49 0.15 RoyalRoad Function F1 MEGA RIGA Cx Tf Cj Cx Tf Cj 0.19 0.48 0.13 0.26 0.49 0.15 0.25 0.49 0.17 0.37 0.49 0.23 0.28 0.49 0.19 0.45 0.49 0.30 0.17 0.44 0.13 0.17 0.49 0.11 0.09 0.47 0.08 0.12 0.48 0.08 0.14 0.48 0.12 0.19 0.48 0.13 0.21 0.48 0.18 0.29 0.48 0.19 0.12 0.34 0.05 0.28 0.46 0.19

Cx 0.08 0.09 0.10 0.10 0.06 0.07 0.08 0.08

EIGA Tf Cj 0.42 0.05 0.43 0.05 0.43 0.06 0.44 0.06 0.41 0.03 0.42 0.04 0.42 0.04 0.42 0.04

Cx 0.09 0.12 0.14 0.09 0.06 0.09 0.14 0.14

EIGA Tf Cj 0.39 0.09 0.40 0.11 0.41 0.14 0.40 0.08 0.37 0.06 0.38 0.09 0.39 0.14 0.28 0.14

memory-based EAs was also equivalent when using these two genetic operators. This can be confirmed by the statistical tables, where the comparison between Cj and Cx for the RR1F 1 was not significant. The best results were achieved by crossover and conjugation for ρ = 0.1 and ρ = 0.2 and by transformation for ρ = 0.5 and ρ = 1.0. In general, the memorized information was also useful in random environments and

308

A. Sim˜oes et al. Table 6. The statistical results on random environments r = 10 ρ ⇒ MIGA Cj - Cx MIGA Cj - Tf MIGA Tf - Cx MEGA Cj - Cx MEGA Cj - Tf MEGA Tf - Cx r = 50 ρ ⇒ MIGA Cj - Cx MIGA Cj - Tf MIGA Tf - Cx MEGA Cj - Cx MEGA Cj - Tf MEGA Tf - Cx

Knapsack problem 0.1 0.2 0.5 1.0 ++ + + ++ ++ ++ ++ ++ −− −− − −− ++ + + ++ ++ ++ ++ ++ −− −− −− − 0.1 0.2 0.5 1.0 + + + + ++ ++ ++ ++ −− −− −− −− ++ ++ + ++ ++ ++ ++ ++ ++ ++ ++ ++

Royal Road F1 0.1 0.2 0.5 1.0 − − − − ++ − −− −− −− − ++ ++ − − − − ++ + −− −− −− −− + ++ 0.1 0.2 0.5 1.0 − − − − ++ ++ −− −− −− −− ++ ++ −− −− − + ++ ++ −− −− −− −− + ++

r = 10 ρ ⇒ RIGA Cj - Cx RIGA Cj - Tf RIGA Tf - Cx EIGA Cj - Cx EIGA Cj - Tf EIGA Tf - Cx r = 10 ρ ⇒ RIGA Cj - Cx RIGA Cj - Tf RIGA Tf - Cx EIGA Cj - Cx EIGA Cj - Tf EIGA Tf - Cx

Knapsack problem 0.1 0.2 0.5 1.0 −− −− −− −− ++ −− −− −− −− ++ ++ ++ −− −− −− −− + −− −− −− −− + ++ ++ 0.1 0.2 0.5 1.0 −− −− −− −− ++ ++ −− −− −− −− ++ ++ −− −− −− −− ++ ++ −− −− −− −− ++ ++

Royal Road F1 0.1 0.2 0.5 1.0 − − − − ++ ++ + ++ −− −− − −− − − − − ++ + + ++ −− −− −− −− 0.1 0.2 0.5 1.0 − −− − − ++ ++ ++ −− −− −− −− ++ −− −− −− −− ++ ++ −− −− −− −− ++ ++

helped the algorithms to react to the changes. Higher diversity was only needed for the severer changes of the RRF 1. The immigrant-based EAs under random environments needed more diversity to achieve better results. Conjugation obtained the worst results, while the highest performances were obtained using crossover, for lower values of ρ, and transformation, for higher values of ρ. The reason for this was the same as before: more diversity was needed because, in the absence of memory, when a change happened, the re-adaptation of the algorithm to the new environment depended on the presence of diverse solutions in the search space.

6 Conclusions In the last years, different mechanisms that promote and maintain the population’s diversity have been proposed and used in EAs for dynamic environments. Recently, some studies showed that in some situation the presence of excessive diversity could damage the performance of EAs for dynamic environments. In order to have more insight about this important issue, we carried out an empirical study to analyze the relation between the diversity of the population and the performance of different EAs in dynamic environments. Two memory-based and two immigrant-based EAs were compared in two dynamic test problems under cyclic and random environments. The different algorithms used the same base, except for the standard recombination operator. To promote different levels of diversity, three genetic operators - uniform crossover, transformation and conjugation - were applied to the EAs. Experiments were executed to compare the relation between the diversity of the population and the performance of the EAs. From the experimental results and analysis, several conclusions can be highlighted. First, a relation between the population’s diversity and the performance of the EAs was observed. Second, higher diversity didn’t necessary imply better performance of the algorithms. Third, in memory-based EAs, the memory was crucial to the performance of the algorithms. Additionally to memory, methods that promote too much diversity were damaging to the EAs. The best results corresponded to lower diversity in the population, for cyclic and random environments. Finally, in immigrant-based EAs, the diversity

A Study on Population’s Diversity for Dynamic Environments

309

assumed an important role. In general, the best results were obtained with crossover for environments changing rapidly, and with transformation for slower changes in the environment. Lower diversity (conjugation) always obtained the worst results. The results obtained in this paper provide important information about how the diversity of the population can influence different types of EAs on dynamic environments. As future work we intend to extend this study to different types of algorithms and to environments with other characteristics. We are also interested in exploring methods that can maintain the appropriate diversity level in different stages of the evolutionary process.

References 1. Andrews, M., Tuson, A.: Diversity does not necessarily imply adaptability. In: Proc. of the of the 2003 workshop on GECCO 2003, pp. 1–6. ACM Press, New York (2003) 2. Angeline, P.: Tracking extrema in dynamic environments. In: Angeline, P.J., McDonnell, J.R., Reynolds, R.G., Eberhart, R. (eds.) EP 1997. LNCS, vol. 1213, pp. 335–345. Springer, Heidelberg (1997) 3. Blackwell, T.M.: Particle swarms and population diversity 1: Analysis. In: Proc of the of the 2003 workshop on GECCO 2003. ACM Press, New York (2003) 4. Branke, J.: Evolutionary Optimization in Dynamic Environments. Kluwer Academic Publishers, Dordrecht (2002) 5. Cobb, H.G.: An investigation into the use of hypermutation as an adaptive operator in genetic algorithms having continuous, time-dependent nonstationary environments. Technical Report TR AIC-90-001, Naval Research Laboratory (1990) 6. Gould, J.L., Keeton, W.T.: Biological Science. W. W. Norton & Company, New York (1996) 7. Grefenstette, J.J.: Genetic algorithms for changing environments. In: Parallel Problem Solving from Nature (PPSN II) (1992) 8. Grefenstette, J.J., Ramsey, C.L.: An approach to anytime learning. In: Ninth International Conference on Machine Learning, pp. 189–195. Morgan Kaufmann, San Francisco (1992) 9. Mitchell, M., Forrest, S., Holland, J.: The r. road for genetic algorithms: fitness landscape and ga performance. In: Proc. of the First ECAL, pp. 245–254. MIT Press, Cambridge (1992) 10. Morrison, R.W.: Designing Evolutionary Algorithms for Dynamic Environments. Springer, Heidelberg (2004) 11. Ramsey, C.L., Grefenstette, J.J.: Case-based initialization of genetic algorithms. In: Proc. of the Fifth ICGA, pp. 84–91. Morgan Kaufmann, San Francisco (1993) 12. Sim˜oes, A., Costa, E.: An immune system-based ga to deal with dynamic environments: Diversity and memory. In: Proc. of the 6th ICANNGA, pp. 168–174. Springer, Heidelberg (2003) 13. Sim˜oes, A., Costa, E.: Improving memory’s usage in evolutionary algorithms for changing environments. In: Proc. of the 2007 IEEE CEC, pp. 276–283. IEEE Press, Los Alamitos (2007) 14. Sim˜oes, A., Costa, E.: Variable-size memory evolutionary algorithm to deal with dynamic environments. In: Giacobini, M., et al. (eds.) EvoWorkshops 2007. LNCS, vol. 4448, pp. 617–626. Springer, Heidelberg (2007) 15. Smith, P.: Conjugation: A bacterially inspired form of genetic. In: Late Breaking Papers, 1996 GP Conference (1996) 16. Sywerda, G.: Uniform crossover in genetic algorithms. In: Proc. of the Third International Conference on Genetic Algorithms, pp. 2–9. Morgan Kaufmann Publishers Inc., San Francisco (1989) 17. Yang, S.: Genetic algorithms with memory- and elitism-based immigrants in dynamic environments. Evolutionary Computation 3(16), 385–416 (2008)

Effect of the Block Occupancy in GPGPU over the Performance of Particle Swarm Algorithm Miguel Cárdenas-Montes1, Miguel A. Vega-Rodríguez2, Juan José Rodríguez-Vázquez3, and Antonio Gómez-Iglesias4 1

2

Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Department of Fundamental Research, Madrid, Spain [email protected] University of Extremadura, ARCO Research Group, Dept. Technologies of Computers and Communications, Cáceres, Spain [email protected] 3 Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Department of Fundamental Research, Madrid, Spain [email protected] 4 Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, National Laboratory of Fusion, Madrid, Spain [email protected]

Abstract. Diverse technologies have been used to accelerate the execution of Evolutionary Algorithms. Nowadays, the GPGPU cards have demonstrated a high efficiency in the improvement of the execution times in a wide range of scientific problems, including some excellent examples with diverse categories of Evolutionary Algorithms. Nevertheless, the studies in depth of the efficiency of each one of these technologies, and how they affect to the final performance are still scarce. These studies are relevant in order to reduce the execution time budget, and therefore affront higher dimensional problems. In this work, the improvement of the speed-up face to the percentage of threads used per block in the GPGPU card is analysed. The results conclude that a correct election of the occupancy —number of the threads per block— contributes to win an additional speed-up. Keywords: GPGPU, Performance Analysis, Particle Swarm Algorithm (PSO), Schwefel Problem 1.2.

1 Introduction With the advent of General-Purpose Graphic Processing Unit (GPGPU) cards the possibilities to accelerate the execution of scientific problems are numerous. Many scientific and technological areas have ported applications to GPGPU, and optimisation problems solved with Evolutionary Algorithms (EAs) are not an exception. Spite of the numerous applications and their corresponding papers reporting the adaptation process and the benefits, few works report analysis in deep about the impact of the technical features of GPGPUs in the final speed-up. As much as the computational capabilities have augmented, the researchers have coped them with more complex optimization problems. The increment of the dimenA. Dobnikar, U. Lotriˇc, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 310–319, 2011. c Springer-Verlag Berlin Heidelberg 2011

Effect of the Block Occupancy in GPGPU over the Performance of PSO Algorithm

311

sionality of these problems is one of the fundamental methods to augment its complexity; attracting the interest of researchers and yielding theoretical and practical studies. However, dealing with high-dimensional problems imply large increments in the execution time cost. In this scenario, the capabilities of GPGPU cards play a fundamental role, providing cheap and portable hardware; and promising the maintenance a limited execution time budget. The GPGPUs have been harnessed for general-purpose computation. In addition to their low cost and ubiquitous availability, they have a superior processing architecture when compared with modern CPUs, and present a tremendous opportunity for developing lines of research in optimization. However, for profiting these capabilities it does not suffice with the simple adaptation of the problems; on the contrary, it requires deep studies on the technical features of GPGPU applied to optimization problems. Two main features produce a clear reduction in the execution time: the matching between data and parallel elements —blocks and threads—, and the use of the diverse types of memories implemented in the GPGPU [1], [2]. In this context, the occupancy of the block —number of threads used per block— becomes a key feature to increase the speed-up. This article proposes a study of the speed-up attained face to the occupancy of the blocks. For this study a Particle Swarm Algorithm (PSO) optimizing the Schwefel problem 1.2 for a very large-scale case —dimensionality of 20,000— has been employed. This paper is organized as follows: Section 2 summarizes the related work and previous efforts done. Section 3.1 makes a brief introduction to GPGPUs. In Section 3.2, a resume of the parallel model of Evolutionary Algorithm is introduced. In Section 3.3, the Particle Swarm Algorithm used in this article is briefly described. In Section 4, the implementation details and the production set-up are shown. The results are displayed and analysed in Section 5. And finally, the conclusion and the future work are presented in Section 6.

2 Related Work During the last years, a plethora of works have covered diverse topics related with the adaptation of EA problems to GPGPU architecture. Most of them present the adaptation of diverse EAs to GPGPU. Some few examples of this kind of works are: speeding-up the optimization of 0/1 knapsack problem with genetic algorithm [3], dealing with the mapping of the parallel island-based genetic algorithm [4], or an example of cellular genetic algorithm [5]; also there are examples in accelerating learning systems [6]; and specifically, examples of general-purpose parallel implementations of PSO in GPGPU [7], [8]. This last study [8] is the closer one to the work presented in this article: benchmark functions and PSO algorithm. The main difference is the dimensionality employed. In this study the dimensionality ranges from 50 to 200; whereas in our study a higher dimensionality has been used —20,000— in order to check the behaviour for extremely large-scale problems. Although the most frequent topics are the adaptation of EA applications and problems to GPGPU, other studies cover theoretical aspects of optimisation problems.

312

M. Cárdenas-Montes et al.

An example of this kind of work is the study of the models of parallel EAs in GPGPU [9], where three basic models for adaptation of EAs to GPGPU hardware are presented. However in the bibliography reviewed, there are not examples of deeper analysis of the capacity of different implementation models to accelerate the execution of EA in GPGPU. The current work covers this kind of deep analysis of the occupancy of the blocks and its impact on the final performance.

3 Methods and Materials In this section, diverse models of parallel evolutionary algorithms and a resume of PSO are presented as well as a brief presentation of the GPGPU cards used in this study. 3.1 GPGPU Cards For this study, two different GPGPU cards have been used: GTX 295 and TESLA C2050. The most relevant specifications of these cards for the present study are displayed in Table 1. Regarding the two cards, the TESLA C2050 is more modern and powerful than the GTX 295, incorporating more cores, more and faster memory per CUDA-enabled core and more threads per block. This last specification is very relevant when studying the occupancy of the blocks, as it will show later. The GTX 295 is a general-purpose GPGPU card not specifically devoted to scientific activity. Oppositely, the TESLA C2050 has been designed for scientific simulations. As it can be observed in Table 1, the main differences between both cards are related to the higher capacity of the blocks to allocate threads in TESLA C2050 in comparison to GTX 295; and the total dedicated memory in the TESLA C2050 face to GTX 295. These differences will provide different behaviour in the performance when applying to optimisation problems. In the performed experiments, only one GPGPU core is used, in order to avoid distortions in the measure of the speed-up due to the activation several GPGPU cores; and thus fairly to compare one CPU performance versus one GPGPU performance. Table 1. Specifications of GTX 295 and TESLA C2050 cards GTX 295 TESLA C2050 1.3 2.0 480 (240 per GPGPU) 448 1.24 GHz 1.15 GHz 1792 MB GDDR3 Total Global Memory (896MB per GPGPU) 3GB GDDR5 Memory Frequency 1.5 GHz 999 MHz Memory Bandwidth (GB/sec) 144 223.8 Threads in warp 32 32 Maximum threads per block 512 1024 Max thread dimensions (512, 512, 64) (1024, 1024, 64) Max grid dimensions (65535, 65535, 1) (65535, 65535, 1) Compute capability Number of CUDA Cores Frequency of CUDA Cores

Effect of the Block Occupancy in GPGPU over the Performance of PSO Algorithm

313

Concerning to the CPU experiments, they were executed on a machine equipped with two Intel Xeon E5520 processors (16 logical processors) running at 2.27 GHz, under the distribution Fedora Core 10 - 64bits (kernel 2.6.27.37-170.2.104.fc10.x86-64), having 6 GB of main memory. 3.2 Parallel Models of Evolutionary Algorithm For non-trivial problems, executing the reproductive cycle of a simple EA with long individuals and/or large populations requires high computational resources. In general, evaluating a fitness function for every individual is frequently the most costly operation of the EA. In EA, parallelism arises naturally when dealing with populations, since each of the individuals belonging to, it is an independent unit. Due to this, the performance of population-based algorithms is specially improved when running in parallel [10]. Parallel Evolutionary Algorithms (PEAs) are naturally prone to parallelism, since most variation operation can be easily undertaken in parallel. Using a PEA often leads to not only a faster algorithm, but also to a superior numerical performance when island model is implemented. Basically, three major parallel models for EAs can be distinguished [9]: the island a/synchronous cooperative model, the parallel evaluation of the population and the distributed evaluation of a single solution. The parallel evaluation of the population is recommended when the evaluation is the most time-consuming. This model has been selected in the adaptation of this case to the GPGPU, due to extremely slow evaluation of the Schwefel’s Problem 1.2. Besides, the parallel evaluation follows a master-worker model. The operations executed in the CPU (master) are, in general, the transformation of the population, as well as the generation of the initial random population. Meanwhile, the evaluation of population is performed in GPGPU (worker). When the particles need to be evaluated, the required data are transferred from main memory to the global memory of the GPGPU. After the evaluation, the results return back to the CPU, and the CPU-code part regains the control. In the next cycle, the evaluation of the population is allocated again in the GPGPU. This study has been conducted using a PSO algorithm, with a panmictic population structure —all manipulations take place over the whole population— and following a generational model —a whole new population replaces the previous one. 3.3 Particle Swarm Optimizer In this paper, Particle Swarm Optimizer (PSO) has been chosen to test its response in relation to large optimization problems. In PSO [11], [12], [13], a set of particles are initially created randomly. During the process, each particle keeps track of its coordinates in the problem space that are associated with the best solution it has achieved so far. Not only the best historical position of each particle is kept, also the associated fitness is stored. This value is called localbest. Another "best" value that is tracked and stored by the global version of the PSO is the overall best value, and its location, obtained so far by any particle in the population. This location is called globalbest.

314

M. Cárdenas-Montes et al.

The PSO concept consists in, at each time step, changing the velocity (accelerating) each particle toward its localbest and the globalbest locations (in the global version of PSO). Acceleration is weighted by a random term, with separate random numbers being generated for acceleration toward localbest and globalbest locations. The process for implementing the global version of PSO is as follows: 1. Creation of a random initial population of particles. Each particle has a position vector and a velocity vector on N dimensions in the problem space. 2. Evaluation of the desired (benchmark function) fitness in N variables for each particle. This step is the only one implemented in GPGPU for the GPGPU version. In the CPU code version, the evaluation is implemented in CPU. 3. Comparison of fitness of each particle with its localbest. If the current value is better than the recorded localbest, it is replaced. Additionally, if replacement occurs, the current position is recorded as localbest position. 4. For each particle, comparison of the present fitness with the globalbest fitness, globalbest. If the current fitness improves the globalbest, it is replaced, and the current position is recorded as globalbest position. 5. Updating the velocity and the position of the particle according to Eqs. 1 and 2: vid = vid + c1 · Rand() · (xlocalbest − xid )+ id globalbest − xid ) c2 · Rand() · (xid

(1)

xid (t + δt) ← xid (t) + vid

(2)

6. If an end execution criterion —fitness threshold or number of generations— is not met, back to the step number 2. Apparently, in Eq. 1, a velocity is added to a position. However, this addition occurs over a single time increment (iteration), so the equation keeps its coherency. In the implementation of PSO algorithm, the c1 and c2 constants were established as c1 = c2 = 1 and the maximum velocity as Vmax = 2.

4 Production Setup In order to study the effect of the occupancy —threads per block— in GPGPUs in the final speed-up of PSO, the function Schwefel’s Problem 1.2 (Eq. 3) has been used. The Schwefel’s Problem 1.2 is a full-non-separable and monomodal function. This function has a global minimum at 0 = (01 , 02 , . . . , 0D ). The main feature of this function is the high CPU-time consumption for its evaluation. This function has been used in the last editions of CEC competitions as benchmark function on large-scale global optimization —CEC 2010 and 2008 Special Sessions and Competition on Large-Scale Global Optimization— [14], [15]. fSchwef el s

P roblem 1.2

=

D i ( xj )2 i=1 j=1

(3)

Effect of the Block Occupancy in GPGPU over the Performance of PSO Algorithm

315

The configuration employed in this study has been selected requiring: to stress the potential capabilities of the GPGPU cards, to stretch out as much as possible the data on the threads and blocks and to be representative of large-scale problems in continuous optimization. In all cases, the population size is 20 individuals, the dimensionality of the search space is 20,000 and the number of cycles is 1,000. Finally, the occupancies range from 100% to 12.5%. For each configuration, 15 tries were executed. As pseudorandom number generator, a subroutine based on Mersenne Twister has been used [16]. Concerning the adaptation process, it is important to underline the importance of the matching of data and parallel processing elements. The mapping between the data array and the threads is a key element in the adaptation process in order to maximize the performance. The invocation of the kernel is made with a bi-dimensional grid of blocks, and all threads of each block are allocating in an one-dimension array. Regarding the grid of blocks, the dimension in y-axis represents the particles —blockIdx.y variable— and the number of block in x-axis is calculated for allocating the number of dimensions —i.e. for 20,000 dimensions and 512 threads per block, then 40 blocks are necessary.

5 Analysis and Results In Table 2 and Fig. 1, the speed-ups produced in the GTX295 by the reduction of the occupancy of the blocks —100%, 50%, 25% and 12.5%— are presented. The analysis of these data shows that the progressive decrement of the occupancy in the block produces an increment in the speed-up higher than 1.7 in percentage. This improvement becomes less relevant for further reduction of the occupancy: 1.332 from 100% to 50%, 0.455 from 50% to 25%. Nevertheless, this improvement disappears when the occupancy falls

Fig. 1. Comparative box plot of speed-up of GTX 295 for diverse configurations of threads per block —100%, 50%, 25% and 12.5%— and 15 tries per configuration

316

M. Cárdenas-Montes et al.

Table 2. Mean speed-up and standard deviation —after 15 tries per configuration— in GTX 295 and TESLA C2050 of GPGPU versus CPU codes for each number of threads per block Percentage of GTX 295 Number of Threads Threads per Block per Block Speed-up 512 26.755 (0.200) 100% 256 28.087 (0.128) 50% 128 28.542 (0.143) 25% 64 20.442 (0.101) 12.5%

TESLA C2050 Number of Threads Speed-up per Block 43.430 (0.301) 1024 43.641 (0.233) 512 43.806 (0.196) 256 43.899 (0.258) 128

Fig. 2. Comparative box plot of speedup of TESLA C2050 for diverse configurations of threads per block —100%, 50%, 25% and 12.5%— and 15 tries per configuration

below 25%. In this case, a strong degradation of the performance for small occupancy emerges from the experimental data. In Table 2 and Fig. 2, the speed-ups produced in the TESLA C2050 by the reduction of the occupancy of the blocks are presented. Alike to the GTX 295, in the TESLA C2050 an increase of the speed-up appears when the occupancy of the blocks is reduced from 100% to 50%, and for further reductions. However, the increment in TESLA C2050 is less significant than the equivalent reduction observed in the GTX 295 when the occupancy reduces. On the contrary to GTX 295, in TESLA C2050 a degradation of the performance is not observed when the occupancy diminishes below 25%. Nevertheless, the data are so close that the significance of the difference has to be checked by statistical methods. Comparing the increments produced in both GPGPU cards, a more intense effect in the speed-up is appreciated in the GTX 295 when the occupancy diminishes than in the TESLA C2050. The most probably origin of this slighter improvement of the speed-up could be the higher memory capacity of the TESLA C2050, therefore the effect of the reduction of the occupancy is mitigated.

Effect of the Block Occupancy in GPGPU over the Performance of PSO Algorithm

317

In order to check if the differences in the speed-ups shown in Table 2 are significant, the Wilcoxon signed-rank test is used [17], [18], [19]. The Wilcoxon signed-rank test is used to ascertain if two data sets come from the same distribution. In this analysis, the information —performance data sets— is used to build a hypothesis-testing procedure, where the null hypothesis (H0 ) is established as: H0 : μ1 = μ2 —both distributions have equal means—, being these sub-indexes applied to each pair of threads values. And as alternative hypothesis, H1 : μ1 = μ2 is stated. Here a confidence level of 95% is considered, that is, a significance level of 5% or p-value under 0.05, it means that the differences are unlikely to have occurred by chance with a probability of 95%. In all cases for the speed-up of GTX 295, the differences have significance —the null hypothesis can be rejected. It means that these data come from different distributions. On the contrary, the Wilcoxon signed-rank test applied to speed-up of TESLA C2050, assures the significance in the speed-up for all data except for the last case, speed-up for 256 threads face to 128 threads. In this case, the Wilcoxon signed-rank test can not reject the null hypothesis —both data set could come from the same distributions. Regarding the best fitness obtained for each configuration and GPGPU card, the application of the Wilcoxon signed-rank to them concludes that they belong to the same distribution. Due that the porting activity focusing on making the code run faster, improvements on the best fitness obtained were not expected.

6 Conclusion and Future Work In this work, the role of the occupancy of the blocks in GPGPU —the number of threads per block— in the final speed-up of an EA is deeply analysed. This study has been performed over two GPGPU cards: a GTX 295 and a TESLA C2050. In order to test the role of the occupancy, a Particle Swarm Algorithm, the Schwefel problem 1.2 as benchmark function and a high-dimensionality configuration have been used. The data obtained remark the importance of the occupancy in the final speed-up of PSO algorithm. The analysis concludes that the ideal occupancy is within the rank 50% - 25% of the total threads of the block. A correct election of the occupancy allows winning an additional, no negligible, speed-up. Nevertheless, an incorrect election of the occupancy —i.e. a very low number of threads per block— can produce a degradation of the final performance. The optimization of CUDA program essentially consists of striking the optimum balance between the number of blocks and their size, more threads per block will be useful in masking the latency of the memory operations, but at the same time the number of registers available per thread is reduced. The multiprocessors have 8,192 registers that are shared among all the threads of all the active blocks on that multiprocessor. The number of active blocks per multiprocessor can not exceed eight. Based on the experimental data, it can be advised that the use of block’s occupancies ranking from 50% to 25% will offer the best compromise between masking latency and the number of registers needed for most kernels. GPGPU cards implement other advance features devoted to improve the performance. The verification of these features, such as the shared memory, is proposed as future work.

318

M. Cárdenas-Montes et al.

Acknowledgement The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) via the project EUFORIA under grant agreement number 211804; and the project EGI-InSPIRE under the grant agreement number RI-261323. The author thanks the Spanish Network for e-Science (CAC-2007-52) for their support when using the NGI resources.

References 1. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming, 1st edn. Addison-Wesley Professional, Reading (July 2010) 2. Kirk, D.B., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach, 1st edn. Morgan Kaufmann, San Francisco (February 2010) 3. Pospíchal, P., Schwarz, J., Jaros, J.: Parallel genetic algorithm solving 0/1 knapsack problem running on the gpu. In: 16th International Conference on Soft Computing MENDEL, Brno University of Technology, pp. 64–70 (2010) 4. Pospíchal, P., Jaros, J., Schwarz, J.: Parallel genetic algorithm on the CUDA architecture. In: Di Chio, C., Cagnoni, S., Cotta, C., Ebner, M., Ekárt, A., Esparcia-Alcazar, A.I., Goh, C.-K., Merelo, J.J., Neri, F., Preuß, M., Togelius, J., Yannakakis, G.N. (eds.) EvoApplicatons 2010. LNCS, vol. 6024, pp. 442–451. Springer, Heidelberg (2010) 5. Vidal, P., Alba, E.: Cellular genetic algorithm on graphic processing units. In: González, J.R., Pelta, D.A., Cruz, C., Terrazas, G., Krasnogor, N. (eds.) NICSO 2010. Studies in Computational Intelligence, vol. 284, pp. 223–232. Springer, Heidelberg (2010) 6. Franco, M.A., Krasnogor, N., Bacardit, J.: Speeding up the evaluation of evolutionary learning systems using gpgpus. In: GECCO 2010: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, pp. 1039–1046. ACM, New York (2010) 7. Zhou, Y., Tan, Y.: Particle swarm optimization with triggered mutation and its implementation based on gpu. In: Proceedings of Genetic and Evolutionary Computation Conference, GECCO 2010, Portland, Oregon, USA, July 7-11, pp. 1–8. ACM, New York (2010) 8. Zhou, Y., Tan, Y.: Gpu-based parallel particle swarm optimization. In: Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2009, Trondheim, Norway, May 18-21, pp. 1493–1500. IEEE, Los Alamitos (2009) 9. Luong, T.V., Melab, N., Talbi, E.G.: Gpu-based island model for evolutionary algorithms. In: Proceedings of Genetic and Evolutionary Computation Conference, GECCO 2010, Portland, Oregon, USA, July 7-11, pp. 1089–1096. ACM, New York (2010) 10. Alba, E., Tomassini, M.: Parallelism and evolutionary algorithms. IEEE Trans. Evolutionary Computation 6(5), 443–462 (2002) 11. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, vol. IV, pp. 1942–1948 (1995) 12. Eberhart, R.C.: Computational Intelligence: Concepts to Implementations. Morgan Kaufmann Publishers Inc., San Francisco (2007) 13. Eberhart, R.C., Kennedy, J.: A new optimizer using particle swarm theory. In: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pp. 39–43 (1995) 14. Tang, K., Li, X., Suganthan, P.N., Yang, Z., Weise, T.: Benchmark functions for the cec’2010 special session and competition on large-scale global optimization. Technical report, Nature Inspired Computation and Applications Laboratory (NICAL), School of Computer Science and Technology, University of Science and Technology of China (USTC), Electric Building No. 2, Room 504, West Campus, Huangshan Road, Hefei 230027, Anhui, China (2009)

Effect of the Block Occupancy in GPGPU over the Performance of PSO Algorithm

319

15. Tang, K., Yao, X., Suganthan, P.N., MacNish, C., Chen, Y.P., Chen, C.M., Yang, Z.: Benchmark functions for the CEC 2008 special session and competition on large scale global optimization. Technical report, Nature Inspired Computation and Applications Laboratory, USTC, China (2007) 16. Matsumoto, M., Nishimura, T.: Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul. 8(1), 3–30 (1998) 17. Montgomery, D.C., Runger, G.C.: Applied Statistics and Probability for Engineers, 4th edn. John Wiley & Sons, Chichester (May 2006) 18. Sheskin, D.: Handbook of parametric and nonparametric statistical procedures. CRC Press, Boca Raton (2004) 19. García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms behaviour: a case study on the cec’2005 special session on real parameter optimization. J. Heuristics 15(6), 617–644 (2009)

Two Improvement Strategies for Logistic Dynamic Particle Swarm Optimization Qingjian Ni and Jianming Deng School of Computer Science & Engineering, Southeast University, Nanjing, China {nqj,jmdeng}@seu.edu.cn

Abstract. A new variant of particle swarm optimization, Logistic Dynamic Particle Swarm Optimization (termed LDPSO), is introduced in this paper. LDPSO is constructed based on the new inspiration of population generation method according to the historical information about particles. It has a better searching capability in comparison to the canonical method. Furthermore, according to the characteristics of LDPSO, two improvement strategies are designed respectively. Mutation strategy is employed to prevent premature convergence of particles. Selection strategy is adopted to maintain the diversity of particles. Experiment results demonstrate the eﬃciency of LDPSO and the eﬀectiveness of the two improvement strategies. Keywords: logistic dynamic particle swarm optimization, mutation, selection.

1

Introduction

Particle swarm optimization (PSO) is an evolutionary computation method which is inspired by the biological mechanisms of swarm behavior such as bird ﬂocking and ﬁsh schooling. Particle swarm optimization was initially developed by Kennedy and Eberhart[6]. As an evolutionary computation method, PSO is a population based algorithm that is similar to evolutionary programming and genetic algorithm. The population of PSO consists of particles (individuals) which have the velocity property. And in solving optimization problems, particle is deﬁned as the point of D-dimensional space which is the solution space of optimization problem, particles can ’ﬂy’ in the D-dimensional space. Particles are initialized as a set of random solutions at the beginning, and update their positions according to their own ﬂying experience and the ﬂying experience of the population. In the stage of evolution, ﬁtness value function is used to evaluate particles, which is similar to other evolutionary computation methods, and these evaluations are associated with the ﬂying experience of particles. PSO ﬁnds the solutions of optimization problem through the evolution of particle swarm. Since the introduction of PSO, lots of variants of the original PSO algorithms have been developed for diﬀerent applications. Several typical variants of PSO ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 320–329, 2011. c Springer-Verlag Berlin Heidelberg 2011

Improve Strategies on LDPSO

321

are as follows: the time varying inertial weight PSO [9], the PSO with constriction coeﬃcient [1,2], the PSO with neighborhood topology [3,7], etc. Recently, Kennedy recommended to try a new variant of PSO without particles’ velocity [4,5]. And the Gaussian dynamic PSO which belongs to the new variant of PSO shows comparable performance to other excellent PSO variants[4]. Furthermore, a more in-depth analysis of this variant of PSO is presented [10]. The preliminary study of the variant of PSO shows the new PSO variant is Worthy of further research. The present paper introduces a new version of this PSO variant, Logistic Dynamic PSO. And two improvement strategies are designed respectively to enhance the searching ability of Logistic Dynamic PSO. The rest sections are organized as follows. Section 2 introduces the canonical PSO and Logistic Dynamic PSO. Section 3 detailed describes the two improvement strategies respectively. Section 4 provides the experimental design and the experimental results. Section 5 outlines the contributions and results of the present paper.

2

Logistic Dynamic Particle Swarm Optimization

2.1

Canonical PSO

Many variants of PSO have been proposed to increase the searching ability. Among these variants, the PSO with constriction coeﬃcient [1,2] is one of the successful variants, which is widely used and often been compared as a typical canonical PSO. And in comparison to the original PSO[6], the most signiﬁcant improvement is the use of constriction coeﬃcient when updating particles’ velocity. The general procedure for implementing the canonical PSO is as algorithm 1. Algorithm 1. Canonical PSO 1 2 3 4 5 6

Initialize positions and velocities of a swarm of particles randomly; while The termination criterion is not satisfied do Evaluate the ﬁtness values for each particle in a swarm; Update the previous best positions for each particle; Update the global previous best position in a swarm; Compute the new velocities and positions according to equation 1 and 2;

The position and velocity update equation of the canonical PSO is described as below: vid = k ∗ (vid + c1 ∗ rand() ∗ (pid − xid ) + c2 ∗ Rand() ∗ (pgd − xid ))

(1)

xid = xid + vid

(2)

where c1 and c2 are two constant coeﬃcients, rand() and Rand() are two uniform random number generators in the range [0,1], Vi = (vi1 , vi2 , ..., vid , ..., viD ) refers

322

Q. Ni and J. Deng

to the velocity of particle i, Xi = (xi1 , xi2 , ..., xid , ..., xiD ) represents the particle with index i in D-dimensional space, Pi = (pi1 , pi2 , ..., pid , ..., piD ) represents the previous best position of particle i, Pg = (pg1 , pg2 , ..., pgd , ..., pgD ) refers to the previous best position in a population or swarm. 2.2

Logistic Dynamic PSO

The usual variants of PSO (include the canonical PSO) could be considered as a trajectory approach with velocity property. Kennedy ﬁrst proposed to try a new variant of PSO without velocity property [4,5]. And the preliminary theoretical and experimental study of the new PSO variant show good prospects in PSO theory and application[4,5,10,8]. This paper describes a new variant of PSO without velocity called Logistic Dynamic PSO (LDPSO). In LDPSO, the ’ﬂying’ direction of particle i focuses on the space around the expected value of Xi , and the probability of speciﬁc position decays according to the distance to the expected value. The procedure for implementing the LDPSO is as algorithm 2.

Algorithm 2. Logistic Dynamic PSO 1 2 3 4 5 6 7

Initialize positions of a swarm of particles randomly; while The termination criterion is not satisfied do Evaluate the ﬁtness values for each particle in a swarm; Update the previous best positions for each particle; Calculate the CT values for each particle according to equation 4; Calculate the OT values for each particle according to equation 5; Compute the new positions according to equation 3;

The velocity and position update equations are evolved to position update rule as follows: Xi (t + 1) = Xi (t) + α ∗ (Xi (t) − Xi (t − 1)) + β ∗ CTi (t) +γ ∗ Gen(LogisticRN G, para) ∗ OTi (t) K CTid (t) = ( Pkd (t))/K − Xid (t)

(3)

(4)

k=1 K OTid (t) = ( |Pid − Pkd |)/K

(5)

k=1

where t is the number of evolution, i is the index of particle, k is the index of particle i’s neighborhood particles, K is the number of neighborhood particles, Pk represents the previous best position of particle k, d means the dimension index of particle i, α, β and γ are three positive constant coeﬃcients. CTi (t) refers to the tendency which particle i is closing to its neighborhood particles,

Improve Strategies on LDPSO

323

OTi (t) refers to the divergence between particle i and its neighborhood particles, Gen(LogisticRN G, para) is the dynamic probabilistic evolution operator which generates random numbers according to a Logistic distribution with parameter para. In LDPSO, particles have no the velocity property, CT and OT values of particle are calculated according to the historical information of its own and its neighborhood particles ﬁrstly, furthormore, the particle position of new generation is computed in line with the dynamic probabilistic evolution operator Gen(LogisticRN G, para).

3

Two Improvement Strategies on LDPSO

Experiment on LDPSO show its good performance. In practice, further improvements should be considered for better performance. In this section, two possible improvement strategies on LDPSO are presented. 3.1

Mutation Strategy Based Improvement

PSO, genetic algorithm and other evolutionary algorithms are population based methods. However, typical PSO algorithms doesn’t have the selection, crossover and mutation operations in genetic algorithms. In LDPSO, the evolution of particles are driven by particles’ CT and OT values combined with dynamic probabilistic evolution operator. The calculation of particles’ CT and OT values are relevant to the historical information of particles’ neighborhood particles, and it is how the historical information of particles in a swarm inﬂuence the evolution of the entire population. In solving complex optimization problems, it must be avoided that the historical experience of some individual particles prematurely dominate the entire population in the process of evolution. This paper proposes a mutation strategy based improvement on LDPSO. The main strategy is as follows: mutation operation is implemented to the particle which obtain the best position (’the optimal particle’) in the evolution process. Speciﬁcally, the position of ’the optimal particle’ is updated according to equation 6, and the positions of other particles are still updated according to 3. Xi (t + 1) = Xi (t) + α ∗ (Xi (t) − Xi (t − 1)) + β ∗ CTi (t) +γ ∗ Gen(LogisticRN G, para) ∗ OTi (t)

(6)

+Gen(CauchyRN G, para) ∗ OTi (t) In equation 6, Gen(CauchyRN G, para) is the dynamic probabilistic evolution operator which generates random numbers according to a Cauchy distribution with parameter para. The mutation strategy is employed to maintain the population diversity and to prevent the swarm from prematurely converging. The procedure for implementing the LDPSO with mutation strategy is as algorithm 3.

324

Q. Ni and J. Deng

Algorithm 3. Logistic Dynamic PSO with Mutation Strategy 1 2 3 4 5 6 7 8 9 10

3.2

Initialize positions of a swarm of particles randomly; while The termination criterion is not satisfied do Evaluate the ﬁtness values for each particle in a swarm; Update the previous best positions for each particle; Calculate the CT values for each particle according to equation 4; Calculate the OT values for each particle according to equation 5; for the optimal particle do Compute the new positions according to equation 6; for the remaining particles do Compute the new positions according to equation 3;

Selection Strategy Based Improvement

In genetic algorithm, selection mechanism is an important operation to promote the evolution of population. An appropriate selection mechanism is an important factor aﬀecting genetic algorithm’s eﬃciency in solving optimization problems. Common selection mechanisms are as follows: roulette wheel mechanism, keeping the best individual strategy and ranking methods etc. In general, individuals with better ﬁtness value in a swarm have greater chance to reproduce, while individuals with worse ﬁtness value have lower chance of reproduction. In typical PSO algorithms, particles with worse ﬁtness value will not be discarded directly, but continue to evolve under the inﬂuence of the historical experiences of their neighborhood particles, and move closer to the possible optimal solution. Therefore, there is no selection mechanism in typical PSO algorithms.

Algorithm 4. Logistic Dynamic PSO with Selection Strategy 1 2 3 4 5 6 7 8 9 10

Initialize positions of a swarm of particles randomly; while The termination criterion is not satisfied do Evaluate the ﬁtness values for each particle in a swarm; Update the previous best positions for each particle; Calculate the CT values for each particle according to equation 4; Calculate the OT values for each particle according to equation 5; Compute the new positions according to equation 3; if The occasion of performing selection operation is met then Discard the particle with the worst ﬁtness value; Generate a new particle randomly to replace the discarded particle;

This paper introduces a selection strategy based improvement on LDPSO. The procedure for implementing the LDPSO with selection strategy is as algorithm 4. The main strategy is as follows: perform the selection operation every a certain evolution period. More speciﬁcally, the particle with the worst ﬁtness value at a certain moment will be discarded, and a new particle will be randomly generated in the solution space to replace the discarded particle.

Improve Strategies on LDPSO

325

In the LDPSO with selection strategy, selection operation is performed every a certain stage, and the discarded particle is the particle with the worst ﬁtness value, so there will not be any negative inﬂuence on the evolution of the entire population. In addition, through regular disturbance during the evolution process, the diversity of population is maintained, which is a key factor to make the entire population keep evolving and move closer to the optimal solution.

4 4.1

Experiment and Analysis Experiment Setting

Five common benchmark functions were selected to test performance of LDPSO and its two improvement strategies. These functions are shown in table 1. Speciﬁc parameters of benchmark functions will be mentioned in the following subsections. Table 1. Five benchmark functions Function Sphere

Formula Accepted error f (x) = n x2i < 0.01 i=1 √ 2 −0.5 sin2 x2 +x 1 2 Schaﬀer’s F6 f (x) = [1+0.001(x < 0.00001 2 +x2 )]2 + 0.5 1 2 n−1 2 2 2 Rosenbrock f (x) = i=1 (100(xi+1 − xi ) + (xi − 1) ) < 100 2 Rastrigin f (x) = n (x − 10cos(2πx ) + 10) < 100 i i i=1 n 2 n xi i=1 xi √ Griewank f (x) = 4000 − i=1 cos i + 1 < 0.1

All experiments were repeated 100 times independently and the number of evolution generations is set to 3000. The swarm size is 20, and the population topology is fully connected structure. The following experimental data were investigated, which include the optimal value, the median value, the mean value, the standard deviation, the worst value and the success rate. 4.2

Comparison between LDPSO and Canonical PSO

The LDPSO is compared in ﬁgure 1 and table 2 with the canonical PSO (briefed as CPSO) on Sphere (30 dimensions), Schaﬀer F6 (2 dimensions), Rastrigin (30 dimensions) and Griewank (30 dimensions) functions. As can be seen from ﬁgure 1 and table 2, LDPSO found better solution than CPSO on all four functions. Experiment results show the eﬃciency of the proposed PSO variant without velocity. 4.3

Performance of Mutation Strategy

The mutation strategy based LDPSO is compared in ﬁgure 2 and table 3 with LDPSO on Sphere (60 dimensions), Schaﬀer F6 (2 dimensions), Rosenbrock (60 dimensions) and Griewank (60 dimensions) functions.

326

Q. Ni and J. Deng Table 2. Comparison of experimental results between CPSO and LDPSO Function Method CPSO Sphere LDPSO CPSO F6 LDPSO CPSO Rastrigin LDPSO CPSO Griewank LDPSO

Optimal 1.96E-39 3.80E-77 0[56%] 0[77%] 41.788 5.9698 0[6%] 0[90%]

Median 7.00E-32 7.64E-76 0 0 100.99 15.919 0.041663 0

0 −20 −40 −60 0

1000 2000 Iteration Rastrigin

Log10(fitness value)

Log10(fitness value)

Worst 4.26E-24 5.28E-67 0.009716 0.009716 180.23 31.839 180.3 0.027037

0 CPSO LDPSO

CPSO LDPSO −5

−10

−15

3000

CPSO LDPSO

2 1.5

0

1000 2000 Iteration

3000

Log10(fitness value)

2.5

1

0

1000 2000 Iteration Griewwank

3000

5

3 Log10(fitness value)

SD 4.35E-25 5.28E-68 0.004847 0.001931 28.942 4.9516 30.066 0.003867

Schaffer F6

Sphere 20

−80

Mean 5.38E-26 5.28E-69 0.004275 0.000434 103.29 16.963 9.1311 0.001059

CPSO LDPSO

0 −5 −10 −15

0

1000 2000 Iteration

3000

Fig. 1. Comparison of evolution curves between CPSO and LDPSO

Table 3. Comparison of experimental results between LDPSO and LDPSO-Mutation Function

Method LDPSO Sphere LDPSO-M LDPSO F6 LDPSO-M LDPSO Rosenbrock LDPSO-M LDPSO Griewank LDPSO-M

Optimal 5.07E-27 3.36E-17 5.07E-27 3.36E-17 52.982 44.85 0 0

Median 6.76E-07 1.31E-15 6.76E-07 1.31E-15 165.3 56.669 0.00045 1.61E-15

Mean 0.23809 4.74E-15 0.23809 4.74E-15 382.43 90.644 0.051824 0.00442

SD 1.0057 1.74E-14 1.0057 1.74E-14 1009.9 55.485 0.13826 0.0118

Worst Succ. rate 7.8772 84% 1.66E-13 100% 7.8772 84% 1.66E-13 100% 9688.8 19% 372.59 62% 0.97338 87% 0.080382 100%

As can be seen, mutation strategy based LDPSO outperformed than LDPSO on all functions, and the evolution curves also indicate that the inﬂuence of mutation appears more strongly with the increase of evolution generation.

Improve Strategies on LDPSO

Sphere

Schaffer F6 0

LDPSO LDPSO−mutation

5 0 −5 −10 0

1000 2000 Iteration Rosenbrock

Log10(fitness value)

Log10(fitness value)

10

−15

−10 LDPSO LDPSO−mutation 0

1000 2000 Iteration Griewank

3000

5 LDPSO LDPSO−mutation

8 6 4 2 0

1000 2000 Iteration

3000

Log10(fitness value)

Log10(fitness value)

−5

−15

3000

10

0

327

LDPSO LDPSO−mutation 0

−5

−10

0

1000 2000 Iteration

3000

Fig. 2. Comparison of evolution curves between LDPSO and LDPSO-Mutation

4.4

Performance of Selection Strategy

The selection strategy based LDPSO is compared in ﬁgure 3 and table 4 with LDPSO on Sphere (90 dimensions), Rastrigin (90 dimensions), Rosenbrock (90 dimensions) and Griewank (90 dimensions) functions.

Table 4. Comparison of experimental results between LDPSO and LDPSO-Selection Function

Method Optimal Median Mean SD Worst LDPSO 1.14E-05 1.7651 4.4721 7.2767 34.76 Sphere LDPSO-S 3.45E-16 1.73E-14 0.016593 0.1645 1.645 LDPSO 239.44 726.56 6016.9 32105 3.12E+05 Rosenbrock LDPSO-S 82.158 246.23 376.6 391.42 2088.2 LDPSO 59.188 90.845 92.837 15.694 147.74 Rastrigin LDPSO-S 54.723 88.551 87.934 13.924 117.41 LDPSO 239.44 726.56 6016.9 32105 3.12E+05 Griewank LDPSO-S 82.158 246.23 376.6 391.42 2088.2

In selection strategy based LDPSO, the timing of executing selection strategy is an important control parameter. During the experiment, it is found that the performance of the algorithm is better when the interval is from 150 generations to 200. The experiment of this paper chose to perform a selection operation every 200 generations. As can be seen, selection strategy based LDPSO performed better than LDPSO on all functions, and the advantage is obvious on three functions. And it also can be seen from the evolution curves that selection strategy is eﬀective in maintaining the diversity of particles.

328

Q. Ni and J. Deng

Sphere

Rosenbrock 10

LDPSO LDPSO−selection

5 0 −5 −10 −15

0

1000 2000 Iteration Rastrigin

Log10(fitness value)

Log10(fitness value)

10

2.5 2 1.5 0

4

0

1000 2000 Iteration Griewank

3000

4 LDPSO LDPSO−selection

1000 2000 Iteration

3000

Log10(fitness value)

Log10(fitness value)

3.5 3

6

2

3000

LDPSO LDPSO−selection

8

LDPSO LDPSO−selection

2 0 −2 −4

0

1000 2000 Iteration

3000

Fig. 3. Comparison between evolution curves between LDPSO and LDPSO-Selection

5

Conclusion

The paper describes a new variant of PSO, called Logistic Dynamic PSO (LDPSO). Unlike the typical PSO, LDPSO doesn’t have the velocity property. In LDPSO, the particle position of new generation is computed according to the historical information of paticle’s own and its neighborhood particles. Experiment results on benchmark functions demonstrate the eﬃciency of LDPSO. The paper also proposed two possible improvement strategies on LDPSO. Mutation strategy is employed to perform mutation operation on the particle which obtain the best position at each evolution time. Selection strategy is adopted to execute the selection operation on the particle with the worst ﬁtness value every a certain evolution period. Experiment results demonstrate that the two improvement strategies can maintain the diversity of particles eﬀectively. More detailed experiment should be carried out to analysis the characteristics of LDPSO. And future work with in-depth theoretical analysis should be performed to observe the working mechanism of LDPSO and similar PSO variants.

Acknowledgement This work is supported by the National Natural Science Foundation of China under Grant No. 60803061.

References 1. Clerc, M.: The swarm and the queen: towards a deterministic and adaptive particle swarm optimization. In: The 1999 Congress on Evolutionary Computation, vol. 3, pp. 1951–1957. IEEE, Piscataway (1999)

Improve Strategies on LDPSO

329

2. Clerc, M., Kennedy, J.: The particle swarm-explosion, stability, and convergence in a multidimensional complex space. IEEE Transactions on Evolutionary Computation 6(1), 58–73 (2002) 3. Kennedy, J.: Small worlds and mega-minds: eﬀects of neighborhood topology on particle swarm performance. In: The 1999 Congress on Evolutionary Computation, vol. 3, pp. 1931–1938. IEEE, Piscataway (1999) 4. Kennedy, J.: Dynamic-probabilistic particle swarms. In: The 2005 Genetic and Evolutionary Computation Conference, pp. 201–207. ACM, Washington (2005) 5. Kennedy, J.: In search of the essential particle swarm. In: The 2006 IEEE Congress on Evolutionary Computation, pp. 1694–1701. Inst. of Elec. and Elec. Eng. Computer Society, Vancouver (2006) 6. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: The 1995 IEEE International Conference on Neural Networks, vol. 4, pp. 1942–1948. IEEE, Perth (1995) 7. Kennedy, J., Mendes, R.: Population structure and particle swarm performance. In: The 2002 World Congress on Computational Intelligence, vol. 2, pp. 1671–1676. IEEE, Piscataway (2002) 8. Poli, R.: Mean and variance of the sampling distribution of particle swarm optimizers during stagnation. IEEE Transactions on Evolutionary Computation 13(4), 712–721 (2009) 9. Shi, Y., Eberhart, R.: A modiﬁed particle swarm optimizer. In: The 1998 IEEE International Conference on Evolutionary Computation, pp. 69–73. IEEE, Anchorage (1998) 10. Wang, Z., Xing, H.: Dynamic-probabilistic particle swarm synergetic model: A new framework for a more in-depth understanding of particle swarm algorithms. In: The 2008 IEEE Congress on Evolutionary Computation, pp. 312–321. IEEE, Hong Kong (2008)

Digital Watermarking Enhancement Using Wavelet Filter Parametrization Piotr Lipi´ nski and Jan Stolarek Institute of Information Technology, Technical University of Lodz, Poland [email protected], [email protected]

Abstract. In this paper a genetic-based enhancement of digital image watermarking in the Discrete Wavelet Transform domain is presented. The proposed method is based on adaptive synthesis of a mother wavelet used for image decomposition. Wavelet synthesis is performed using parametrization based on an orthogonal lattice structure. A genetic algorithm is applied as an optimization method to synthesize a wavelet that provides the best watermarking quality in respect to the given optimality criteria. Eﬀectiveness of the proposed method is demonstrated by comparing watermarking results using synthesized wavelets and the most commonly used Daubechies wavelets. Experiments demonstrate that mother wavelet selection is an important part of a watermark embedding process and can inﬂuence watermarking robustness, separability and ﬁdelity. Keywords: watermarking, adaptive wavelets, genetic algorithms.

1

Introduction

The concept of digital watermarking is to embed additional data (“a watermark”) into the media. This can be used either to ensure that medium was not modiﬁed (such watermarks should be fragile i.e. they should be destroyed when the medium is altered in any way) or to allow copyright veriﬁcation (such watermarks should be persistent i.e. removal of watermark should be impossible without damaging the watermarked medium beyond usability). In this paper persistent blind watermarking [5] of images is considered. In the recent years digital watermarking in the wavelet domain has gained much popularity. This is caused by the good time-frequency localization properties of the Discrete Wavelet Transform (DWT), which allows to embed watermark only in the selected regions and frequencies of an image. So far authors of the watermarking algorithms have been arbitrarily choosing the basis wavelet function used for image decomposition and synthesis (Haar or Daubechies wavelets in most cases). The inﬂuence of the wavelet on the watermarking process has been noticed be some authors [7,13], while others have proposed wavelet parametrization to improve watermark security [3,6,9]. Nevertheless, the problem of adjusting the wavelet in order to improve watermarking robustness and ﬁdelity ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 330–339, 2011. c Springer-Verlag Berlin Heidelberg 2011

Digital Watermarking Enhancement Using Wavelet Filter Parametrization

331

has not been addressed so far. Authors of this paper have already done some research in that ﬁeld. In [17] it was demonstrated that wavelets synthesized in order to maximize energy compaction can also improve performance of watermarking algorithms proposed in the literature. However, the proposed approach did not take into account neither the characteristics of the cover image nor the watermark and the watermarking algorithm itself. In this paper that problem is addressed. Genetic algorithm will be used to adapt wavelets to the cover image, the embedded watermark and the watermarking algorithm. It will be shown that such approach can signiﬁcantly improve watermark embedding robustness, separability and ﬁdelity. Robustness will be deﬁned as an ability to conﬁrm presence of a watermark in the watermarked image. Separability will be deﬁned as an ability to faultlessly distinguish the extracted watermark from random watermarks. Fidelity will be measured in terms of minimizing the distortions introduced to the image by the watermarking process. It will be demonstrated that proposed approach synthesizes wavelets that, in comparison to Daubechies wavelets, perform better in terms of all the above criteria. This paper is divided into the following sections. Section 2 introduces the basic concepts of a DWT-based digital watermarking algorithm. Section 3 presents embedding algorithm, the concept of a lattice structure used for wavelet parametrization and describes the genetic algorithm used for wavelet synthesis. Section 4 presents testing methodology and results of performed experiments. Section 5 summarizes the paper and discusses the directions of the future research.

2

Digital Watermarking in the DWT Domain

Many digital watermarking algorithms operating in the the DWT domain have been proposed in the recent years [1,2,8,10,11,19,20]. All these algorithms share the watermark embedding scheme shown in Figure 1. In this scheme original

data compare

DWT

wavelet coeﬃcients

watermark

embed

extracted watermark extract

watermarked wavelet coeﬃcients DWT−1

watermarked wavelet coeﬃcients

DWT

watermarked data

Fig. 1. Generic scheme of watermark embedding and extraction in the wavelet transform domain

332

P. Lipi´ nski and J. Stolarek

data is ﬁrst decomposed using DWT. The watermark is then embedded by applying a selected embedding algorithm. Inverse DWT (DW T −1 in Figure 1) is applied to reconstruct the data. To extract the embedded watermark, DWT must be applied to the watermarked data. Watermark is extracted from the wavelet coeﬃcients and compared with the original one.

3

Genetic-Based Digital Image Watermarking Enhancement

In the proposed adaptive digital image watermarking enhancement approach, the DWT and DWT−1 steps in Figure 1 are modiﬁed. Instead of using an arbitrarily chosen mother wavelet, a genetic algorithm is applied to adapt the mother wavelet to the cover image, a watermark and an embedding algorithm. 3.1

Embedding Algorithm

Due to the proliferation of wavelet-based watermarking algorithms, in this paper a generic watermarking algorithm based on E BLIND/D LC algorithm (Embedding: Blind / Detection: Linear Correlation) [5] is used to demonstrate the proposed watermarking enhancement method, without the loss of generality. In this algorithm watermark wr is a random sequence of N integer numbers of the set {−1, 1}. Multilevel wavelet decomposition of the image is performed using the Mallat’s algorithm. N largest wavelet coeﬃcients from all three detail subbands on third level of image decomposition are selected. The watermark is embedded in selected coeﬃcients using formula cw = c0 + αwr ,

(1)

where c0 are the selected wavelet coeﬃcients, α is the embedding strength, wr is the watermark and cw are the watermarked wavelet coeﬃcients. To extract the watermark, watermarked image has to be decomposed using the Mallat’s algorithm. Watermark is detected by computing normalized correlation between the watermarked wavelet coeﬃcients and the original watermark according to formula 1 (cw (i) − cw )(wr (i) − wr ) C= , (2) N −1 i σc σw where N is the length of the watermark, cw denotes the watermarked coeﬃcients, cw is the mean value of the watermarked coeﬃcients, wr denotes the embedded watermark, wr is the mean value of the embedded watermark, σc and σw are standard deviations of watermarked coeﬃcients and the watermark respectively. Presence or absence of the watermark is usually determined with a threshold τ . If the correlation C is greater than τ then the watermark is present, otherwise it is absent. Therefore, it is important to maximize the correlation.

Digital Watermarking Enhancement Using Wavelet Filter Parametrization

3.2

333

Wavelet Parametrization

In this paper wavelet parametrization based on an orthogonal lattice structure is used [21]. Such structure can be used to perform wavelet decomposition of a signal. Properties of this structure are presented and discussed in detail in [18]. Below is a short summary.

x0

D1

D2

x1 x2

D1

D3

D1

D3

D1

D3

t2

y4 y5

D3

t1

y2 y3

D2

x7

y0 y1

D2

x5 x6

t2

D2

x3 x4

t1

y6 y7

Fig. 2. Lattice structure performing 6–tap transform

Lattice structure is based on the two-point base operations k k w11 w12 Dk = , k k w21 w22

(3)

where k stands for the index of the operation (see Fig. 2). Such two-point base operation can be written in form of a matrix equation: b1 a = Dk · 1 . (4) b2 a2 Lattice structure is composed of K/2 stages, each containing Dk operations repeated N/2 times, where K and N are the lengths of the ﬁlter’s impulse response and of a processed signal respectively. On each stage of the lattice structure, elements of the signal are processed in pairs by Dk base operations. After each stage, base operations are shifted down by one and a lower input of the last base operation in the current stage is connected to the upper output of the ﬁrst base operation in the preceding stage (t1 and t2 on Fig. 2). Upper outputs of base operations in the last layer (y0 , y2 , y4 and y6 on Fig. 2) correspond to the low-pass ﬁlter signal and lower outputs (y1 , y3 , y5 and y7 on Fig. 2) correspond to the high-pass ﬁlter signal. Wavelet ﬁlter bank coeﬃcients are calculated based on the Dk base operations.

334

P. Lipi´ nski and J. Stolarek

Let us assume such Dk base operation, that condition Dk · DkT = I

(5)

holds true, i.e. Dk matrix is orthogonal. This implies that k k k k w21 + w12 w22 =0 , w11 k 2 (w11 )

+

k 2 (w12 )

=1 .

(6) (7)

As was proposed in [15], the following substitution into equation (3) is a suﬃcient condition to satisfy equation (6): k k w21 = w12 , k k = −w11 . w22

(8)

Substituting equation (8) to equation (3), we can rewrite Dk base operation in k k a new form of Sk base operation containing only two parameters w11 and w12 , instead of four: k k w11 w12 . (9) Sk = k k w12 −w11 Equation (7) implies that Sk transform preserves energy. Such Sk base operation is called orthogonal base operation and the lattice structure based on Sk operations is called orthogonal lattice structure. An orthogonal lattice structure must be used in order to synthesize orthogonal wavelets that fulﬁl conditions of wavelet the decomposition [4,14]. Let us assume k w11 = cos(αk ) , k = sin(αk ) . w12

(10)

For such assumption equation (7) holds true, and we can rewrite equation (9) as cos(αk ) sin(αk ) , (11) Sk = sin(αk ) −cos(αk ) which allows to replace two weights of Sk base operation with only one parameter αk . Therefore lattice structure consisting of N layers can be represented using only N numbers (α1 , α2 , . . . , αN ), where each αi ∈ [0, 2π). 3.3

Wavelet Synthesis Algorithm

An orthogonal lattice structure allows to adapt a wavelet ﬁlter bank by adjusting the base operations. When the base operations are modiﬁed, the output signal from the lattice structure changes. This signal can be then rated in terms of its ﬁtness in respect to some quality criteria. In the digital image watermarking, discussed in this paper, there are two contradicting ﬁtness criteria:

Digital Watermarking Enhancement Using Wavelet Filter Parametrization

335

1. the diﬀerence between: (a) correlation of the extracted watermark with the embedded watermark and (b) correlation of the extracted watermark with random watermarks should be maximized, 2. visual diﬀerence between the original image and the watermarked image should be minimized. This turns the problem of mother wavelet synthesis using lattice structure into a multiobjective optimization problem, which is usually solved using evolutionary approach. In [16] a genetic algorithm for synthesizing a wavelet that compacts energy into low-pass wavelet coeﬃcients was introduced. Simple Genetic Algorithm with Evolutionary Strategies was applied to synthesize the optimal mother wavelet by optimizing a deﬁned objective function. Algorithm 1 shows an outline of that algorithm. This method can be easily adapted to synthesize wavelets conforming to both above-mentioned criteria by modifying the ﬁtness functions. Algorithm 1. Genetic algorithm outline initialize random population P of μ individuals for k = 1 to ITERATIONS COUNT do evaluate ﬁtness of individuals in population P create temporary population T containing λ individuals using tournament selection from population P perform crossover and mutation on individuals in population T evaluate ﬁtness of individuals in population T select μ individuals to form new population P end for display best individual in population P

To evaluate the ﬁtness in terms of criterion 1, a set of random watermarks must be generated. Normalized correlation between the extracted watermark and the embedded watermark is calculated and then the normalized correlation between the extracted watermark and the random watermarks is calculated. Since the normalized correlation falls into range [−1, 1], the ﬁtness of i-th individual in terms of condition 1 is calculated using formula minj (Ce − Crj ) Fi1 = max ,0 , (12) 2 where Ce is the normalized correlation between the extracted watermark and the embedded watermark and Crj is the normalized correlation between the extracted watermark and j-th random watermark. This means that we select the smallest (“the worst”) diﬀerence. Since both Ce and Crj fall in range [−1, 1], the result of Ce − Crj falls in range [−2, 2]. The ﬁtness must be from range [0, 1], hence the division by 2 and the max(·) function have to be applied.

336

P. Lipi´ nski and J. Stolarek

Fitness of i-th individual in terms of criterion 2 is calculated using the formula Fi2 = 1 −

M SE , P M SE

(13)

where MSE stands for Mean Square Error and PMSE stands form Peak Mean Square Error between the original and the watermarked image. After Fi2 values have been calculated for all the individuals in a population, they are normalized to ﬁt into range [0, 1]. Thus the worst individual has ﬁtness 0 and the best individual has ﬁtness 1. Algorithm 2. Fitness evaluation for all (individuali ∈ population) do convert (αi1 , αi2 , . . . , αiN ) to wavelet ﬁlter coeﬃcients embed watermark in an image calculate ﬁtness Fi2 of the watermarked image extract watermark calculate ﬁtness Fi1 end for Require: ∀i Fi1 ∈ [0, 1] ∧ Fi2 ∈ [0, 1] for all (individuali ∈ population) do Fi = min(Fi1 , Fi2 ) end for

As a method for multiobjective optimization Global Optimality Level [12] is used. The concept of this approach is to calculate individuals’ ﬁtness for all optimization criteria, ensure that they fall into the same range ([0, 1] in presented approach) and for each individual select the worst of its partial ﬁtnesses as a total ﬁtness of that individual. Evaluation of individuals ﬁtness is outlined in Algorithm 2. Presented enhancement method is a generic one and can be used to improve any digital watermarking algorithm. This requires modifying “embed watermark in an image” and “extract watermark” steps in the Algorithm 2. In this paper these steps represent embedding the watermark using equation (1) and extracting it using equation (2), but they can be substituted with any embedding/extraction algorithm. It will be demonstrated that the ﬁtness evaluation method outlined above leads to synthesis of wavelets that perform better than Daubechies wavelets in terms of robustness, separability and ﬁdelity.

4

Experimental Results

Experiments were carried out to verify the eﬀectiveness of the proposed approach. 64 grey scale images from USC-SIPI Image Database have been chosen (e.g. Lena, Mandrill, Boat, House, etc.). Watermark was a sequence of 256 random numbers from set {−1, 1}. For every image adaptive 4-tap wavelet has been

Digital Watermarking Enhancement Using Wavelet Filter Parametrization

337

Table 1. Results comparison correlation separability PSNR [dB] Daubechies Adaptive Daubechies Adaptive Daubechies Adaptive Average 0.38 0.87 0.18 0.87 46.97 47.09 Std. deviation 0.17 0.13 0.17 0.13 1.08 1.19 Min. value 0.00 0.33 −0.20 0.33 43.36 43.37 Max. value 0.88 1.00 0.68 1.00 50.60 50.59

(a) Daubechies wavelet

(b) Adaptive wavelet

Fig. 3. Comparison of image watermarking artifacts

synthesized using the genetic algorithm approach. Performance of this wavelet was compared with the Daubechies 4 wavelet in terms of robustness, separability and ﬁdelity. Table 1 shows comparison of the results. Image ﬁdelity is expressed using Peak Signal-to-Noise Ratio between the original image and the watermarked one. First row is the average value for all 64 images. Second row contains standard deviation, to provide deeper insight into distribution of the results. Remaining two rows contain minimal and maximal values of measured quantities. It can be clearly noticed, that Daubechies wavelet has been outperformed signiﬁcantly in terms of correlation and separability. In case of image ﬁdelity adaptive wavelet performs slightly better. Figure 3 shows two fragments of the Lena image. Figure 3a demonstrates artifacts caused by watermark embedding using the Daubechies 4 wavelet for image decomposition. On Figure 3b the same watermark has been embedded using adaptively synthesized 4-tap wavelet. It can be noticed, that amplitude of watermarking distortions is smaller with adaptive wavelets, and therefore they are less visible. It must be noted however, that in case of some images visual diﬀerence is practically imperceptible.

5

Conclusions

Robustness, separability and ﬁdelity are the three major requirements in the digital image watermarking. The aim of this paper was to prove that it is possible to improve all these three parameters simultaneously by adjusting mother wavelet

338

P. Lipi´ nski and J. Stolarek

to the properties of an image, a watermark and an embedding algorithm. To achieve this goal, a genetic-based enhancement method has been developed and tested using well-known test images. As was shown in section 4, the proposed method can eﬀectively synthesize wavelets that outperform the Daubechies wavelets in terms of the watermarking robustness, the watermark separability and the image ﬁdelity. Tests were carried out using a generic watermarking algorithm, however the proposed method can be used to enhance any existing watermarking algorithm operating in the DWT domain. Within further research the presented algorithm can be extended to adaptively select the length of a wavelet ﬁlter, the number of image decomposition levels and the subbands to be watermarked. The algorithm can also take into account various attacks against the embedded watermark. Image quality evaluation based on Human Visual System can be introduced instead of the MSE-based one.

References 1. Buse, R.D., Mwangi, E.: A digital image watermarking scheme based on localised wavelet coeﬃcients dependency. In: AFRICON 2009, pp. 1–4 (September 2009) 2. Chang, C.-C., Tai, W.-L., Lin, C.-C.: A multipurpose wavelet-based image watermarking. In: First International Conference on Innovative Computing, Information and Control, ICICIC 2006, vol. 3, pp. 70–73 (2006) 3. Cheng, G., Yang, J.: A watermark scheme based on two-dimensional wavelet ﬁlter parametrization. In: Fifth International Conference on Informaion Assurance and Security, pp. 301–304 (2009) 4. Cooklev, T.: An eﬃcient architecture for orthogonal wavelet transforms. IEEE Signal Processing Letters 13(2) (February 2006) 5. Cox, I.J., Miller, M.L., Bloom, J.A., Fridrich, J., Kalker, T.: Digital Watermarking and Steganography, 1st edn. Elsevier, Amsterdam (2008) 6. Dietl, W., Meerwald, P., Uhl, A.: Protection of wavelet-based watermarking systems using ﬁlter parametrization. Signal Processing 83(10), 2095–2116 (2003) 7. Dietze, M., Jassim, S.: Filter ranking for DWT-domain robust digital watermarking. EURASIP Journal on Applied Signal Processing 14, 2093–2101 (2004) 8. Ejima, M., Miyazaki, A.: A wavelet-based watermarking for digital images and video. In: Proceedings of International Conference on Image Processing, 2000, vol. 3, pp. 678–681 (2000) 9. Huang, Z.Q., Jiang, Z.: Watermarking still images using parametrized wavelet systems. In: Image and Vision Computing, Institute of Information Sciences and Technology, Massey University (2003) 10. Huo, F., Gao, X.: A wavelet based image watermarking scheme. In: IEEE International Conference on Image Processing 2006, pp. 2573–2576 (October 2006) 11. Kim, J.R., Moon, Y.S.: A robust wavelet–based digital watermark using leveladaptive thresholding. In: Proceedings of the 6th IEEE International Conference on Image Processing, ICIP 1999 (October 1999) 12. Kowalczuk, Z., Bialaszewski, T.: Genetic algorithms in multi–objective optimization of detection observers. In: Korbicz, J., Ko´scielny, J.M., Kowalczuk, Z., Cholewa, W. (eds.) Fault Diagnosis. Models, Artiﬁcial, Intelligence, Applications, pp. 511–556. Springer, Heidelberg (2004)

Digital Watermarking Enhancement Using Wavelet Filter Parametrization

339

13. Miyazaki, A.: A study on the best wavelet ﬁlter bank problem in the wavelet-based image watermarking. In: 18th European Conference on Circuit Theory and Design, ECCTD 2007, pp. 184–187 (August 2007) 14. Rieder, P., Gotze, J., Nossek, J.S., Burrus, C.S.: Parameterization of orthogonal wavelet transforms and their implementation. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 45(2), 217–226 (1998) 15. Stasiak, B., Yatsymirskyy, M.: Fast orthogonal neural networks. In: Rutkowski, L., ˙ Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 142–149. Springer, Heidelberg (2006) 16. Stolarek, J.: Improving energy compaction of a wavelet transform using genetic algorithm and fast neural network. Archives of Control Sciences 20(4), 381–397 (2010) 17. Stolarek, J., Lipi´ nski, P.: Improving digital watermarking ﬁdelity using fast neural network for adaptive wavelet synthesis. Journal of Applied Computer Science 18(1), 61–74 (2010) 18. Stolarek, J., Yatsymirskyy, M.: Fast neural network for synthesis and implementation of orthogonal wavelet transform. In: Image Processing & Communications Challenges. AOW EXIT (2009) 19. Tsai, M.-J., Hung, H.-Y.: DCT and DWT–based image watermarking by using subsampling. In: Proceedings of 24th International Conference on Distributed Computing Systems Workshops, 2004, pp. 184–189 (2004) 20. Wang, Y., Doherty, J.F., Van Dyck, R.E.: A wavelet-based watermarking algorithm for ownership veriﬁcation of digital images. IEEE Transactions on Image Processing 11(2), 77–88 (2002) 21. Yatsymirskyy, M.: Lattice structures for synthesis and implementation of wavelet transforms. Journal of Applied Computer Science 17(1), 133–141 (2009)

CellularDE: A Cellular Based Differential Evolution for Dynamic Optimization Problems Vahid Noroozi, Ali B. Hashemi, and Mohammad Reza Meybodi Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran {vnoroozi,a_hashemi,mmeybodi}@aut.ac.ir

Abstract. In real life we are often confronted with dynamic optimization problems whose optima change over time. These problems challenge traditional optimization methods as well as conventional evolutionary optimization algorithms. In this paper, we propose an evolutionary model that combines the differential evolution algorithm with cellular automata to address dynamic optimization problems. In the proposed model, called CellularDE, a cellular automaton partitions the search space into cells. Individuals in each cell, which implicitly create a subpopulation, are evolved by the differential evolution algorithm to find the local optimum in the cell neighborhood. Experimental results on the moving peaks benchmark show that CellularDE outperforms DynDE, cellular PSO, FMSO, and mQSO in most tested dynamic environments. Keywords: Dynamic environments, Differential evolution, Cellular Automata.

1 Introduction Many optimization problems in the real-world are dynamic in which the fitness function changes over time. These problems can be modeled with multiple local optimums that one of them is the global optimum. When a change is occurred in the environment, the global optimum may change. Hence the optimization algorithm has to track the changes in the environment and find the new global optimum quickly. The dynamic optimization problems have challenged traditional optimization algorithms as well as conventional evolutionary algorithms, which were designed for nonstationary problems. This is due to the goal of conventional evolutionary algorithms, converging to the fixed global optimum, which results in losing diversity and decreases the power of algorithm to search for the new optimum after the environment changes. Many approaches have been introduced to address dynamic optimization problems that can be categorized into four groups: (1) Diversity maintenance approaches [1-2]. (2) Memory based approaches [3-4]. (3) Increasing diversity approaches [5-6]. (4) Multi-population approaches [7-8]. A comprehensive survey on evolutionary algorithms that are developed and applied to dynamic environments can be found in [9]. Most of the recent researches on evolutionary algorithms for dynamic optimization problems are focused on the fourth category which also has been shown the most outstanding results in dynamic optimization among other proposed approaches [10]. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 340–349, 2011. © Springer-Verlag Berlin Heidelberg 2011

CellularDE: A Cellular Based Differential Evolution

341

It has been shown that Differential evolution algorithm (DE) [11] is be a simple and powerful algorithm for continuous function optimization, not even in static [12] but also in dynamic environments [7, 13-15]. Moreover, DynDE [7], which to the best of our knowledge is the best-performing differential evolution algorithm for dynamic optimization problems, produces competitive results compared to other dynamic optimization algorithms. Although an improved version of DynDE [13] is presented, it only improves the performance of a few environments a little, while it significantly increases the complexity of the algorithm. In [16-17], Hashemi and Meybodi proposed cellular PSO, a hybrid model of particle swarm optimization and cellular automata and showed it produces promising results. In cellular PSO, a cellular automaton is embedded into the search space and partitions the search space, such that each partition corresponds to a cell. At any time, in some cells of the cellular automaton, a group of particles are present and search for an optimum. In the search for an optimum, particles in a cell use their best personal experience and the best solution found in their neighborhood cells. Moreover, in order to prevent losing diversity of particles, a limit on the number of particles searching in each cell is imposed. In this paper, we combine DE and cellular automata in order to maintain diversity of the population in the search space as well as to perform effective local search at the same time. In the proposed model, named CellularDE, like cellular PSO, a cellular automaton is embedded into the search space and partitions the search space such that each partition corresponds to a cell. The population implicitly is divided into sub-populations, each residing in a cell. Individuals in each cell search for a local optimum using local information in that cell and its neighbors. In moving towards dynamic optimum, preserving individual's density in each cell below a specified threshold is of great concern. To keep the individuals density below this threshold, a portion of individuals may be reinitialized at the end of each iteration. This mechanism makes the individuals spread out over the highest peaks across the environment, meanwhile it helps converging to each peak in a short time. Moreover, after detecting a change in the environment, individuals temporarily perform random local search around the local optimum in their neighborhood, which helps the algorithm to quickly find new peaks. The performance of the proposed CellularDE is compared with DynDE [7], cellular PSO [16-17], FMSO [18] and mQSO [19-20] in various dynamic environments modeled by the moving peaks benchmark [21-22]. The results of the experiments show that CellularDE outperforms other algorithms in most tested dynamic environments. In addition, it is more robust to the number of peaks in the environment. Moreover, CellularDE takes the advantage of local information exchange and decentralize control of cellular automaton which makes the algorithm scalable. The rest of this paper is organized as follows. The next section describes the differential evolution algorithm. Section 3 provides detailed specification of the proposed algorithm with a brief introduction to cellular automata as the foundations of our approach. Section 4 gives out experimental results of the proposed model. Finally, Section 5 concludes this paper.

342

V. Noroozi, A.B. Hashemi, and M.R. Meybodi

2 Differential Evolution Differential Evolution (DE) [11] is a population based evolutionary algorithm which differs from other evolutionary algorithms in its mechanism of generating offspring. In evolutionary algorithms, an individual plays the role of a parent to generate an offspring. However, in DE an offspring is generated using vector differences among individuals in the population [12]. The canonical DE algorithm works as follows. 1. Initialize population with uniform distribution. 2. Until a termination condition is met, for each individual in parallel: i. ii.

, Select three different individuals population randomly. Generate a trial vector using eq. (1).

, and

in the population do

from the current

r r r r v i = x r 1 + F .( x r 2 − x r 3 )

(1)

where F∈[0,1] is a scale factor. iii.

Create a new vector ⎧⎪v ij u ij = ⎨ ⎪⎩ x ij

using binomial crossover according to eq.(2).

if (U ( 0,1) ≤ C r or j = r j ) otherwise

(2)

where Cr∈(0,1) is known as crossover probability and rj∈[1,D] is a random integer and D is number of dimensions. iv. v.

Evaluate the candidate . Replace individual with , if fitness of the newly create vector better than fitness of individual .

is

In this algorithm F and Cr are control parameters which play the same role as mutation and crossover probability in the evolutionary algorithms such as Genetic algorithms. Many typical variants of DE, called schemes, have been proposed in literatures. Each scheme varies with respect to the number of random individuals that are used to construct a new trial vector, as well as with respect to whether or not the current individual or the global best individual is used as part of that computation.

3 The Proposed Algorithm In the proposed algorithm, we follow the partitioning approach using cellular automata [16-17]. Cellular automaton (CA) is a mathematical model for systems consisting of large number of simple identical components with local interactions. It is called cellular because it is made up of cells like points in a lattice or like squares of checker

CellularDE: A Cellular Based Differential Evolution

343

boards, and it is called automaton because it follows a simple rule [23]. Each individual cell is in a specific state. The cells synchronously update their states on discrete steps according to a local rule which depends on the previous states of the cell itself and its neighborhood. Fig. 1 (a, b) shows Neighborhood in a 2-dimensional cellular automaton and two well-known neighborhoods, Moore and von Neumann. The overall structure can be viewed as a parallel processing device. In CellularDE, a cellular automaton is embedded into the search space and divides it into a number of partitions each of which corresponds to one cell in the cellular automaton. Fig. 1 (c) illustrates a 2-D cellular automaton which is embedded into a two-dimensional search space and partitions it into 55 cells. Cellular automaton implicitly divides the population into a number of sub-populations each of which resides in a cell and is responsible for finding the highest peak in that cell and its neighborhood if there is any peak. In the following, first we introduce components of the CellularDE, and then detailed steps of the proposed algorithm are described. 3.1 Memory of the Cells In CellularDE, each cell has two memories, cellBestMemi and gBestMemi. cellBestMemi is the best position found in cell i since the last change in the environment (eq. (3)). gBestMemi represents the best position found in neighborhood of cell i, which include cell i and its neighbors, since the last change in the environment (eq. (4)). cellBestMem i ← gBestMem i ←

argmax ∀k , individualk is in cell i

argmax ∀j , cell j is a neighbor of celli

{f ( individual ) , f (cellBestMem )}

{f

k

i

(cellBestMem i ) , f (cellBestMem j )}

(3)

(4)

3.2 Cellular Differential Evolution In CellularDE, individuals within a cell and its neighbors implicitly create a subpopulation. Each sub-population aims to search for the local optimum in its neighborhood using differential evolution algorithm. Therefore, differential evolution in CellularDE should provide local exploration capability rather than global exploration of canonical DE. Hence, CellularDE utilizes the DE/rand-to-best/1/bin scheme, in which the mutant vector for individual residing in cellj is calculated using eq. (5). .

.

(5)

where , , and are three different individuals that are randomly selected among the individuals residing in cell j. F specifies the greediness of the search algorithm towards the current best point and λrand is a uniform random variable in range [0,1].

344

V. Noroozi, A.B. Hashemi, and M.R. Meybodi

(a)

(b)

(c)

Fig. 1. (a) Moore neighborhood (b) Von Neumann neighborhood (c) A 2-D search space partitioned by a cellular automaton with 55 cells

3.3 Maintaining Diversity of Population In CellularDE, to prevent convergence of many individuals to a region and maintain diversity of the population in the search space, the number of individuals in each cell is limited to a specific value, referred to as cell capacity (θ). When the number of individuals in a cell exceeds the cell capacity, the worst individuals in the cell are moved to new randomly selected cells such that the size of the population in the cell reduces to the cell capacity. Since the connections in the cellular automata are local, an individual, which is selected to move to a random cell, can only move one cell at a time. Until this individual arrives to its destination cell, it will be inactive, i.e. it will not participate in the evolution process, and hence it does not require any fitness evaluation. 3.4 Dealing with the Change CellularDE detects a change in the environment by re-evaluating cellBestMem of each cell. If the fitness of cellBestMem has been changed since its last evaluation, a change is detected. Upon detecting a change in environment, the fitness of every individual will be re-evaluated and gBestMem of all cells will be cleared. However, in order to use the previous search efforts of the population, each cell preserves cellBestMem after reevaluation. Moreover, after a change is detected in the environment, the population perform local search for the next LSnum iterations. To perform the local search, corresponding to each individual in cell i, a new individual is generated and positioned randomly in a hyper sphere with the radius of LSr centered at gBestMemi. If the fitness of the new individual is better than the old one, it will replace the old individual. This local search helps the individuals around each peak to follow the peak and find the new position of peak quickly. The detailed steps of the proposed CellullarDE are summarized in Algorithm 1.

CellularDE: A Cellular Based Differential Evolution

345

Algorithm 1. CellularDE Initialize a cellular automaton with CD equal-sized cells Initialize population of size Pnum randomly in the cellular automaton do for all celli in cellular automaton do Update cellBestMemi according to eq. (3) Update gBestMemi according to eq. (4) if there was a change in the past LSnum iterations Then for all individualm in cell i do Set individualm to a random position in a hyper-sphere with radius LSr centered at gbestMemi. end-for else Evolve active population in celli by DE using the mutant vector eq. (5) end-if while population of celli > θ Re-initialize the worst active individual to a random cell in the cellular automata end-while if a change is detected Then Re-evaluate cellMemi

Clear gBestMemi for all individualm in celli do Re-evaluate individualm end-for end-if end-for until a termination condition is met

4 Experiments 4.1 Moving Peaks Benchmark Moving Peaks Benchmark (MPB) [21-22] is widely used in the literature to evaluate the performance of optimization algorithms in dynamic environments [10]. This benchmark defines several moving peaks in a multi-dimensional space which the height, width, and position of the peaks can be changed periodically. The default parameter setting of MPB, known as the scenario II, is presented in Table 1. In MPB, shift length s is the radius of peak movement after environment changes. m is the number of peaks. f is the frequency of the changes in environment as number of fitness evaluations. H and W denote range of the height and width of peaks which change by height severity ( ) and width severity ( ), respectively. I is the initial heights of the peaks. Parameter A denotes the range of the search space for all dimensions.

346

V. Noroozi, A.B. Hashemi, and M.R. Meybodi Table 1. Parameters of Moving Peaks Benchmark (MPB) for scenario II Parameter Number of peaks (m) Change frequency (f) Height severity ( Width severity ( I

Value 10 5000 0.7 0.1 50

Parameter Shift length (s) Number of dimensions (D) A H W

Value 1.0 5 [0, 100] [30, 70] [1, 12]

In order to measure performance of an algorithm, offline error is used which is defined as the average fitness error of the best position found by the algorithm at every point in time. Offline Error =

1 FEs

FEs

∑ ( f (bestSolution (t )) − f ( globalOptimum (t )) )

(6)

t =1

where FEs is the maximum fitness evaluation, and bestSolution(t) and globalOptimum(t) are the best position found by the algorithm and the global optimum at the tth fitness evaluation, respectively. 4.2 Experimental Settings The results of the proposed algorithm is compared with DynDE [7], cellular PSO [1617], FMSO [18] and mQSO [19-20]. To the best of our knowledge, DynDE is the best-performing DE optimization algorithms introduced for dynamic environments and cellular PSO [16-17] which shares the idea of embedding a cellular automaton in the search space. FMSO [18] and mQSO [19-20] are dynamic optimization algorithms recently introduced and have shown good performance in dynamic environments [10, 20]. For all experiments, parameters of FMSO, MQSO, DynDE, cellular PSO are set to the values reported in [18], [19], [7] and [16-17], respectively. The value of the parameters of CellularDE is selected so that the proposed algorithm performs best for the scenario II of MPB, introduced above. In CellularDE, the cellular automata partition the search space into 105 cells, i.e. each dimension is divided into 10 partitions to maximize the probability of cells containing one peak. Moreover, in the cellular automata, the Moore neighborhood with radius of 2 cells is used, which is large enough to maintain the exploration of the sub-populations in a neighborhood and is small enough to prevent the convergence of the population to a local optimum. The population size (Pnum) is considered 100 and the capacity of the cells (θ) is set to 10 individuals per cell. This way, each of the 10 peaks in an environment defined by the scenario II of MPB can be equally exploited by 10 individuals, if they are located in 10 different cells. Mutation (F) and crossover (Cr) parameter of DE are empirically set to 0.2 and 0.4, respectively. The number of local search iterations (LSnum) and the radius of the local search (LSr) are empirically set to 6 and 1.0, respectively.

CellularDE: A Cellular Based Differential Evolution

347

4.3 Experimental Results All experiments were performed for 500,000 fitness function evaluations. The average offline errors of the algorithms in 100 runs with 95% confidence interval for various dynamic environments are depicted in tables 3 to 5. For each environment, t-tests with a significance level of 0.05 have been applied and the result of the best performing algorithm(s) is printed in bold. When the offline errors of the best performing algorithms are not significantly different, all are printed in bold. Table 2. Offline errors for different number of peaks (f = 1000) m 1 5 10 20 30 40 50 100 200

CellularDE 4.98±0.35 3.96±0.04 3.98±0.03 4.53±0.02 4.77±0.02 4.87±0.02 4.87±0.02 4.85±0.02 4.46±0.01

DynDE 16.84±9.39 5.32±0.44 4.25±0.05 5.34±0.02 5.80±0.04 6.09±0.02 6.22±0.02 6.55±0.03 6.49±0.02

Cellular PSO 7.90±0.47 5.81±0.21 5.70±0.15 5.92±0.18 6.07±0.16 5.95±0.15 6.01±0.15 6.04±0.13 5.90±0.13

FMSO 14.42±0.4 10.59±0.2 10.40±0.1 10.33±0.1 10.06±0.1 9.85±0.11 9.54±0.11 8.77±0.09 8.06±0.07

mQSO 13.30±0.40 5.76±0.15 5.43±0.10 5.84±0.09 6.12±0.11 6.36±0.11 6.61±0.15 6.62±0.11 6.59±0.10

Table 3. Offline errors for different number of peaks (f = 2500) m 1 5 10 20 30 40 50 100 200

CellularDE 2.38±0.78 2.12±0.02 2.42±0.02 3.05±0.04 3.29±0.03 3.43±0.03 3.44±0.02 3.36±0.01 3.13±0.01

DynDE 7.08±2.08 3.14±0.11 2.81±0.05 3.83±0.05 4.32±0.05 4.54±0.05 4.71±0.05 4.90±0.05 4.91±0.04

Cellular PSO 4.91±0.28 2.95±0.20 2.97±0.15 3.51±0.14 3.87±0.12 3.89±0.10 4.16±0.15 4.18±0.11 4.04±0.09

FMSO 6.29±0.20 5.03±0.12 5.09±0.09 5.32±0.08 5.22±0.08 5.09±0.06 4.99±0.06 4.60±0.05 4.34±0.04

mQSO 6.03±0.22 2.98±0.09 3.05±0.09 4.05±0.08 4.53±0.17 4.61±0.13 4.86±0.12 5.13±0.13 5.10±0.15

Table 4. Offline errors for different number of peaks (f = 5000) m 1 5 10 20 30 40 50 100 200

CellularDE 1.53±0.07 1.50±0.04 1.64±0.03 2.46±0.05 2.62±0.05 2.76±0.05 2.75±0.05 2.73±0.03 2.61±0.02

DynDE 4.06±0.49 2.22±0.15 2.26±0.05 3.14±0.07 3.49±0.60 3.82±0.09 4.05±0.07 4.18±0.09 4.06±0.05

Cellular PSO 3.46±0.22 1.79±0.12 1.84±0.08 2.63±0.11 2.91±0.10 3.16±0.11 3.23±0.11 3.43±0.10 3.38±0.09

FMSO 3.44±0.11 2.94±0.07 3.11±0.06 3.36±0.06 3.28±0.05 3.26±0.04 3.22±0.05 3.06±0.04 2.84±0.03

mQSO 3.00±0.09 1.70±0.09 1.96±0.08 3.11±0.07 3.61±0.08 3.88±0.07 3.99±0.13 4.30±0.08 4.32±0.09

The results of the experiments show that CellularDE outperforms all other tested algorithms in all experiments except one case. Moreover, as depicted in Fig. 2 for an environment with 50 peaks, CellularDE can find better solutions after most environment changes. This is because CellularDE, by imposing a limit on the number of

348

V. Noroozi, A.B. Hashemi, and M.R. Meybodi 2

Error

10

Current Error of CellularDE Offline Error of CellularDE Current Error of DynDE Offline Error of DynDE Current Error of Cellular PSO Offline Error of Cellular PSO

1

10

0

10

0

1

2

3

4

5 Fitness Evaluations

6

7

8

9

10 4

x 10

Fig. 2. The current error and the offline error for a dynamic environment with 50 peaks, f=5000

individuals in each cell, maintains the diversity of the individuals better than DynDE, thus performing a better exploration in the environment. In addition, CellularDE takes the advantage of local search capability of DE, which results in more powerful exploitation compared to cellular PSO. Moreover, as the number of peaks in the environment increases, the offline error of all algorithms, except CellularDE, increases significantly. This is because in DynDE, mQSO, and FMSO the numbers of sub-populations are limited, hence they can only scout a few peaks in the environments. But CellularDE is able to cover many peaks simultaneously. Furthermore, the binomial crossover operation in CellularDE brings about a good exploration in a neighborhood, which leads to outperforming cellular PSO in the environments with many peaks.

5 Conclusion In this paper, we proposed CellularDE, a cellular automata based differential evolution algorithm to tackle dynamic optimization problems. In CellularDE a cellular automaton is embedded into the search space and partitions the search space into cells. By imposing a limit on the number of individuals searching in each cell, the proposed algorithm maintains a balance between exploration and exploitation in the environment. Moreover, individuals in each neighborhood search for a peak together using the local information exchange between cells. In addition, in order to track the local optima after a change occurred in the environment, a local random search is performed in the few iterations after the change is detected. Extensive experiments in various dynamic environments modeled by the moving peaks benchmarks were conducted to evaluate the performance of CellularDE. The results of the experiments show that CellularDE outperforms DynDE [7], cellular PSO [16-17], FMSO [18], and mQSO [19-20] in most tested environments which contain many peaks. In addition, it has been shown that CellularDE is more robust to the number of peaks in the environment than other tested algorithms.

References 1. Yang, S.: Genetic algorithms with memory-and elitism-based immigrants in dynamic environments. Evolutionary Computation 16, 385–416 (2008) 2. Grefenstette, J.J.: Genetic algorithms for changing environments. In: Männer, R., Manderick, B. (eds.) Proc. of the 2nd Int. Conf. on Parallel Problem Solving from Nature, pp. 137–144 (1992)

CellularDE: A Cellular Based Differential Evolution

349

3. Goldberg, D., Smith, R.: Nonstationary function optimization using genetic algorithms with dominance and diploidy, pp. 59–68. L. Erlbaum Associates Inc., Mahwah (1987) 4. Yang, S.: Non-stationary problem optimization using the primal-dual genetic algorithm, vol. 3, pp. 2246–2253. IEEE, Los Alamitos (2004) 5. Cobb, H.G., Grefenstette, J.J.: Genetic Algorithms for Tracking Changing Environments. In: Proceedings of the 5th International Conference on Genetic Algorithms, June 01, pp. 523–530 (1993) 6. Morrison, R., De Jong, K.: Triggered hypermutation revisited, vol. 2, pp. 1025–1032. IEEE, Los Alamitos (2002) 7. Mendes, R., Mohais, A.: DynDE: a differential evolution for dynamic optimization problems, vol. 3, pp. 2808–2815. IEEE, Los Alamitos (2005) 8. Parrott, D., Li, X.: Locating and tracking multiple dynamic optima by a particle swarm model using speciation. IEEE Transactions on Evolutionary Computation 10, 440–458 (2006) 9. Jin, Y., Branke, J.: Evolutionary optimization in uncertain environments-a survey. IEEE Transactions on Evolutionary Computation 9, 303–317 (2005) 10. Moser, I., Chiong, R.: Dynamic function optimisation with hybridised extremal dynamics. Memetic Computing 2, 137–148 (2010) 11. Storn, R., Price, K.: Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization 11, 341–359 (1997) 12. Price, K., Storn, R., Lampinen, J.: Differential evolution: a practical approach to global optimization. Springer, Heidelberg (2005) 13. du Plessis, M., Engelbrecht, A.: Improved differential evolution for dynamic optimization problems, pp. 229–234. IEEE, Los Alamitos (2008) 14. Brest, J., Zamuda, A., Boskovic, B., Maucec, M., Zumer, V.: Dynamic optimization using self-adaptive differential evolution, pp. 415–422. IEEE, Los Alamitos (2009) 15. Kanlikilicer, A., Keles, A., Uyar, A.: Experimental analysis of binary differential evolution in dynamic environments, pp. 2509–2514. ACM, New York (2007) 16. Hashemi, A.B., Meybodi, M.: A multi-role cellular PSO for dynamic environments, pp. 412–417. IEEE, Los Alamitos (2009), http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber= 5349615 17. Hashemi, A.B., Meybodi, M.: Cellular PSO: A PSO for dynamic environments. In: Cai, Z., Li, Z., Kang, Z., Liu, Y. (eds.) ISICA 2009. LNCS, vol. 5821, pp. 422–433. Springer, Heidelberg (2009) 18. Li, C., Yang, S.: Fast multi-swarm optimization for dynamic optimization problems, vol. 7, pp. 624–628. IEEE, Los Alamitos (2008) 19. Blackwell, T., Branke, J.: Multiswarms, exclusion, and anti-convergence in dynamic environments. IEEE Transactions on Evolutionary Computation 10, 459–472 (2006) 20. Blackwell, T.: Particle swarm optimization in dynamic environments. In: Evolutionary Computation in Dynamic and Uncertain Environments, pp. 29–49 (2007) 21. Branke, J.: Memory enhanced evolutionary algorithms for changing optimization problems, vol. 3. IEEE, Los Alamitos (2002) 22. Branke, J.: Evolutionary optimization in dynamic environments. Kluwer Academic Publishers, Norwell (2001) 23. Fredkin, E.: An informational process based on reversible universal cellular automata. Physica D: Nonlinear Phenomena 45, 254–270 (1990)

Optimization of Topological Active Nets with Diﬀerential Evolution Jorge Novo, Jos´e Santos, and Manuel G. Penedo Computer Science Department, University of A Coru˜ na, Spain {jnovo,santos,mgpenedo}@udc.es

Abstract. The Topological Active Net model for image segmentation is a deformable model that integrates features of region–based and boundary– based segmentation techniques. The segmentation process turns into a minimization task of the energy functions which control the model deformation. We used Diﬀerential Evolution as an alternative evolutionary method that minimizes the decisions of the designer with respect to other evolutionary methods such as genetic algorithms. Moreover, we hybridized Diﬀerential Evolution with a greedy search to integrate the advantages of global and local searches at the same time that the segmentation speed is improved. Keywords: Deformable contours, Genetic Algorithms, Diﬀerential Evolution, Image Segmentation.

1

Introduction and Previous Work

The active nets model for image segmentation was proposed by Tsumiyama and Yamamoto [1] as a variant of the deformable models [2] that integrates features of region–based and boundary–based segmentation techniques. To this end, active nets distinguish two kinds of nodes: internal nodes, related to the region–based information, and external nodes, related to the boundary–based information. The former model the inner topology of the objects whereas the latter ﬁt the edges of the objects. The Topological Active Net (TAN) [3] model was developed as an extension of the original active net model. It solves some intrinsic problems to the deformable models such as the initialization problem. It also has a dynamic behavior that allows topological local changes in order to perform accurate adjustments and ﬁnd all the objects of interest in the scene. The model deformation is controlled by energy functions in such a way that the mesh energy has a minimum when the model is over the objects of the scene. This way, the segmentation process turns into a minimization task. There is very little work in the optimization of active models with evolutionary algorithms, especially genetic algorithms (GA), mainly in edge or surface extraction [4][5] in 2D tasks. In the case of snakes deformable contours, one of the ﬁrst works which used genetic algorithms was the concept of “genetic snakes” by Ballerini [6]. MacEachern and Manku [7] were also among the ﬁrst to apply GAs for optimizing snakes. Fan et al. [8] used a parallel GA to optimize active ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 350–360, 2011. c Springer-Verlag Berlin Heidelberg 2011

Optimization of Topological Active Nets with Diﬀerential Evolution

351

contours in magnetic resonance brain images. Ooi and Liatsis [9] developed a procedure to perform object tracking in real scenarios. Diﬀerent subpopulations of the GA corresponded with diﬀerent subcontours and all of them evolved in a cooperative manner to achieve the best contour that segmented the object. Tanatipanond and Covavisaruch [10] also applied a GA to optimize snakes in brain MR images. They used a multiscale approach, beginning at coarser scales to extract rough contours. Then, the best deformable contours at this stage were the parent chromosomes at ﬁner scales. S´eguier and Cladel [5] used genetic snakes in a speech recognition application that integrated information from audio and the visual processing of the mouth. In their approach, there were two snakes that deﬁne the lips contours and converge in parallel. Tohka and Mykk¨ anen [11] improved the results of deformable surface meshes with a dual contour method in PET (positron emission tomography) brain images. Tohka [12] also proposed a hybrid approach where a GA globally minimized the energy of a deformable surface mesh. The minimum obtained was further strengthened by a greedy algorithm. The GA detected roughly the target objects and then the greedy search was used for precision. In a previous work [13] we proved the superiority of a global search method by means of a GA in the optimization of the TAN model. The results showed that the GA is less sensitive to noise, energy parameters or the mesh size than local search methods. In this paper, we used Diﬀerential Evolution (DE) [14][15] as an alternative evolutionary method. Moreover, we hybridized DE with a greedy method, so we can join the advantages of the global and local search methods. This paper is organized as follows: Section 2 introduces the basis of the TAN model. Section 3 brieﬂy explains the DE used in the model optimization. In Section 4 representative examples are included to show the capabilities of the diﬀerent approaches. Finally, Section 5 expounds the conclusions.

2

Brief Description of Topological Active Nets

A Topological Active Net (TAN) is a discrete implementation of an elastic n−dimensional mesh with interrelated nodes [3]. The model has two kinds of nodes: internal and external. Each kind of node represents diﬀerent features of the objects: the external nodes ﬁt their edges whereas the internal nodes model their internal topology. As other deformable models, its state is governed by an energy function, with the distinction between the internal and external energy. The internal energy controls the shape and the structure of the net whereas the external energy represents the external forces which govern the adjustment process. These energies are composed of several terms and in all the cases the aim is their minimization. Internal energy terms. The internal energy depends on ﬁrst and second order derivatives which control contraction and bending, respectively. The internal energy term is deﬁned through the following equation for each node: Eint (v(r, s)) = α (|vr (r, s)|2 + |vs (r, s)|2 ) + β (|vrr (r, s)|2 + |vrs (r, s)|2 + |vss (r, s)|2 )

(1)

352

J. Novo, J. Santos, and M.G. Penedo

where the subscripts represent partial derivatives, and α and β are coeﬃcients that control the ﬁrst and second order smoothness of the net. The ﬁrst and second derivatives are estimated using the ﬁnite diﬀerences technique. External energy terms. The external energy represents the features of the scene that guide the adjustment process: ρ 1 Eext (v(r, s)) = ω f [I(v(r, s))] + f [I(v(p))] (2) |ℵ(r, s)| ||v(r, s) − v(p)|| p∈ℵ(r,s)

where ω and ρ are weights, I(v(r, s)) is the intensity of the original image in the position v(r, s), ℵ(r, s) is the neighborhood of the node (r, s) and f is a function, which is diﬀerent for both types of nodes since the external nodes must ﬁt the edges whereas the internal nodes model the inner features of the objects. If the objects to detect are bright and the background is dark, the energy of an internal node will be minimum when it is on a point with a high grey level. Also, the energy of an external node will be minimum when it is on a discontinuity and on a dark point outside the object. Given these circumstances, the function f is deﬁned as: ⎧ IOi (v(r, s)) + τ IODi (v(r, s)) ⎪ ⎪ ⎨

f [I(v(r, s))] =

IOe (v(r, s)) + τ IODe (v(r, s)) + ⎪ ⎪ ⎩ ξ(Gmax − G(v(r, s))) + δGD(v(r, s))

for internal nodes for external nodes

(3)

where τ , ξ and δ are weighting terms, Gmax and G(v(r, s)) are the maximum gradient and the gradient of the input image in node position v(r, s), I(v(r, s)) is the intensity of the input image in node position v(r, s), IO is a term called “InOut” and IOD a term called “distance In-Out”, and GD(v(r, s)) is a gradient distance term. The IO term minimizes the energy of individuals with the external nodes in background intensity values and the internal nodes in object intensity values meanwhile the terms IOD act as a gradient: for the internal nodes (IODi ) its value minimizes towards brighter values of the image, whereas for the external nodes its value (IODe ) is minimized towards low values (background). The optimizations with a greedy algorithm [3] and with a genetic algorithm [13] consider a global energy as the sum of the diﬀerent terms, weighted with the diﬀerent exposed parameters. The adjustment process consists in minimizing these energy functions. In the case of the greedy algorithm, the mesh is placed over the whole image and, in each step, the energy of each node is computed in its current position and in its nearest neighborhood. The position with the lowest energy value is selected as the new position of the node. The algorithm stops when there is no node in the mesh that can move to a position with lower energy.

3

Diﬀerential Evolution

Diﬀerential Evolution (DE) [14][15] is a population-based search method. DE creates new candidate solutions by combining existing ones according to a simple

Optimization of Topological Active Nets with Diﬀerential Evolution

353

1. Initialize all individuals x with random positions in the search space 2. Until a termination criterion is met, repeat the following: For each individual x in the population do: 2.1 Pick three random individuals x1 ,x2 ,x3 from the population they must be distinct from each other and from individual x. 2.2 Pick a random index R1, ..., n, where the highest possible value n is the dimensionality of the problem to be optimized. 2.3 Compute the individual’s potentially new position y = [y1 , ..., yn ] by iterating over each i1, ..., n as follows: 2.3.1 Pick ri U (0, 1) uniformly from the open range (0,1). 2.3.2 If (i = R) or (ri < CR) let yi = x1 + F (x2 − x3 ), otherwise let yi = xi . 2.4 If (f (y) < f (x)) then replace the individual x in the population with the improved candidate solution, that is, set x = y in the population. 3. Pick the agent from the population that has the lowest fitness and return it as the best found candidate solution. Fig. 1. Diﬀerential Evolution Algorithm

formulae of vector crossover and mutation, and then keeping whichever candidate solution has the best score or ﬁtness on the optimization problem at hand. The central idea of the algorithm is the use of diﬀerence vectors for generating perturbations in a population of vectors. This algorithm is specially suited for optimization problems where possible solutions are deﬁned by a real-valued vector. The basic DE algorithm is summarized in the pseudo-code of Figure 1. One of the reasons why Diﬀerential Evolution is an interesting method in many optimization or search problems is the reduced number of parameters that are needed to deﬁne its implementation. The parameters are F or diﬀerential weight and CR or crossover probability. The weight factor F (usually in [0, 2]) is applied over the vector resulting from the diﬀerence between pairs of vectors (x2 and x3 ). CR is the probability of crossing over a given vector (individual) of the population (x1 ) and a vector created from the weighted diﬀerence of two vectors (F (x2 − x3 )), to generate the candidate solution or individual’s potentially new position y. Finally, the index R guarantees that at least one of the parameters (genes) will be changed in such generation of the candidate solution. The main problem of the genetic algorithm (GA) methodology is the need of tuning of a series of parameters: probabilities of diﬀerent genetic operators such as crossover or mutation, decision of the selection operator (tournament, roulette,), tournament size. Hence, in a standard GA it is diﬃcult to control the balance between exploration and exploitation. On the contrary, DE reduces the parameters tuning and provides an automatic balance in the search. As Feoktistov [16] indicates, the fundamental idea of the algorithm is to adapt the step length (F (x2 − x3 )) intrinsically along the evolutionary process. At the beginning of generations the step length is large, because individuals are far

354

J. Novo, J. Santos, and M.G. Penedo

away from each other. As the evolution goes on, the population converges and the step length becomes smaller and smaller. In our application each individual encodes a TAN. The genotypes code the Cartesian coordinates of the TAN nodes. If a component of a mutant vector (candidate solution) goes oﬀ its limits, then the component is set to the bound limit. In this application it means that, in order to avoid crossings in the net structure, each node coordinate cannot overcome the limits established by its neighbors. The neighboring nodes set the boundaries of the area where a node can move as result of the DE formulae for each candidate solution. So, in a given direction, the new coordinate of a node cannot exceed the nearest coordinate of its 3 neighboring nodes in that direction. Moreover, the usual implementation of DE chooses the base vector x1 randomly or as the individual with the best ﬁtness found up to the moment (xbest ). To avoid the high selective pressure of the latter, the usual strategy is to interchange the two possibilities across generations. Instead of this, we used a tournament to pick the vector x1 , which allows us to easily establish the selective pressure by means of the tournament size.

4

Results

We selected representative artiﬁcial and real CT images to show the capabilities and advantages of the DE approach and its hybridization with a greedy local search. All the processes used a population of 1000 individuals. The tournament size to select the base individual x1 in the DE runs was 3% of the population. We used a ﬁxed value for the CR parameter (1.0) whereas we used a maximum value of 0.6 for the F parameter. In the diﬀerent applications of the equation which determines a candidate solution (step 2.3.2 in Figure 1), we used a random value for F between 0.2 and such maximum value (for each node), parameters that were experimentally tuned to provide the best results in most of the images. This allows that each node can move its position in a diﬀerent intensity, although in the direction imposed by the diﬀerence vector (x2 − x3 ), which facilitates that each node can independently fall in its best location, as the object boundaries in the case of the external nodes. This strategy provided us with the best results in all the images. Table 1 includes the energy TAN parameters used in the segmentation examples. Those were experimentally set as the ones in which the corresponding evolutionary algorithm gave the best results for each kind of image. 4.1

Comparison with a Genetic Algorithm

In [13] it is included the deﬁnition of the genetic operators used with a GA: arithmetic crossover, mutation of a node, mutation of a group of neighboring nodes, shift of a mesh and spread of a mesh. It is also depicted the deﬁnition of two evolutionary phases that are necessary to obtain correct results in any image. The two phases are: a ﬁrst one, whose aim is to produce a population of

Optimization of Topological Active Nets with Diﬀerential Evolution

355

Table 1. TAN parameter sets used in the segmentation processes of the examples Figure

Size

α

ω

ρ

ξ

2

10 × 8

1.0

2.5

10.0

4.0

5.0

10.0

20.0

3 & 4

8×8

3.5

5.0

10.0

4.0

5.5

8.0

10.0

5

8×8

2.0

5.0

10.0

4.0

5.0

6.0

40.0

6(a)

10 × 10

5.0

1.0

10.0

4.0

1.0

20.0

10.0

6(b)

9×8

2.0

3.0

10.0

4.0

0.0

6.0

10.0

6(c)

10 × 10

3.0

5.0

1.0

4.0

0.0

5.0

30.0

6(d)

8×8

5.0

2.0

20.0

4.0

0.0

10.0

10.0

β

δ

τ

individuals that cover the object in the image, and a second one, with a diﬀerent parameter of energy parameters, to reﬁne the adjustment. So, the ﬁrst phase provides a rough boundary detection meanwhile the second phase provides a better boundary segmentation and a better distribution of nodes. In the ﬁrst phase, to obtain a rough boundary detection, the parameters were tuned giving high importance to the gradient distance energy term meanwhile the internal energy parameters took low values. These two phases are necessary as with only one phase it is easy to fall in local minima because the internal energy parameters tend to compress the mesh. This can be seen in Figure 2, 1st row, where the best individuals only cover a part of the object. Meanwhile, the classic GA process

Fig. 2. Comparison of evolutionary methodologies. The four ﬁgures in each row correspond with the best individual at diﬀerent generations, from the initial generation to the best found result at the end of the evolutionary process. 1st row, classic GA process, one evolutionary phase and initial random individual sizes. 2nd row, classic GA process, one evolutionary phase and initial large individual sizes. 3rd row, classic GA process, two evolutionary phases. 4th row, diﬀerential evolution process.

356

J. Novo, J. Santos, and M.G. Penedo

Fig. 3. Best individual ﬁtness (energy) and average ﬁtness of the population with the diﬀerent evolutionary processes. The curves are an average of 20 diﬀerent runs with diﬀerent initial populations.

with the two phases, can ﬁrstly cover the object and secondly reﬁne a ﬁnal correct result, as shown in Figure 2, 3rd row. The problem of falling in local minima, as shown in Figure 2, 1st row, can be solved with the initialization of the population with individuals with a minimum large size. Thus, these individuals can cover the possible objects present in the scene avoiding the possible falls in parts of them. An example of this solution with the GA process with only one phase can be seen in Figure 2, 2nd row, where the individuals cover the entire object and thanks to the mutation and group mutation operators converge increasingly to the object contour. However, this solution implies a very slow convergence, as we show later. Nevertheless, the same idea is useful with the proposed diﬀerential evolution process. Hence, we can tune only one set of energy parameters to be used with only one evolutionary phase. An example of evolution of the best individual with DE is shown in Figure 2, 4th row. We simpliﬁed the genetic process with the proposed DE methodology because the set of genetic operators (crossover, mutation, group mutation, spread and shift of the GA) was avoided. Moreover, the convergence was faster. An example of this is shown in Figure 3. This Figure compares the evolution of the best energy (ﬁtness) and the average energy of the population over the generations using DE, a GA with only one phase and a GA with two evolutionary phases, using the initializations explained before. In this last case, the graph shows only the evolution of the second phase, from generation 100, as the energy of the ﬁrst phase is not comparable because of the diﬀerent energy parameters. Moreover, these ﬁtness evolutions were the result of an average of 20 diﬀerent evolutionary processes with diﬀerent initial populations. This comparison was made testing the diﬀerent approaches in the object of Figure 4, using the same energy parameter set in all cases. As it can be seen, the convergence of the GA process with one phase (dashed lines) is the worst because the large initial individuals have to be progressively approximated to the boundary of the object. As explained before, this is solved with the deﬁnition of the two diﬀerent evolutionary phases in the GA process

Optimization of Topological Active Nets with Diﬀerential Evolution

357

Fig. 4. Best individual across generations in diﬀerent evolutionary processes. 1st row, classic GA process, two evolutionary phases. 2nd row, classic GA process, one evolutionary phase. 3rd row, diﬀerential evolution process.

with diﬀerent tasks (dotted lines). However, with the DE process, with only one phase and initial large individuals we obtain a signiﬁcant faster convergence (solid lines). In Figure 4 we can see the best individual at diﬀerent generations with all the mentioned processes. We can see the best individual after the 1st generation (1st column), intermediate results in generations 50 and 150 (2nd and 3rd columns) and the ﬁnal result (4th column). As it can be seen in the intermediate results, the GA with only one phase and large initial individuals (2nd row) moves slowly the external nodes to the object boundary, situation that is overcome with the GA with 2 phases (1st row). In this case, the ﬁrst phase is focused in surrounding the contour of the object so we obtain the external nodes well placed in less generations. However, the DE approach, with only one phase and large initial individuals, thanks to the way that it produces new individuals can quickly obtain a population surrounding the object, producing a faster correct segmentation with a correct distribution of nodes. 4.2

Hybridization of Diﬀerential Evolution and Greedy Search

We combined DE with the greedy local search with two aims: to integrate the advantages of global and local searches, and to obtain faster segmentations. The hybrid approach followed a Lamarckian strategy since the results of the greedy search revert in the original genotypes used by DE. Hence, a number of greedy steps was applied in all the genotypes of the population used by the DE process. A greedy step implies the application of the greedy movements in all the nodes of the codiﬁed TAN. The number of steps is a small number (randomly between 0 and 4) to minimize the falling in local minima. Moreover, the greedy algorithm is applied only in particular generations of the DE process.

358

J. Novo, J. Santos, and M.G. Penedo

(b)

(a)

(c)

Fig. 5. (a) Best individual evolution comparing the DE approach with diﬀerent hybrid approaches and the greedy method. (b) Best ﬁnal result with the greedy method. (c) Best ﬁnal result of the hybrid approach with greedy steps between 0 and 4.

Figure 5(a) shows diﬀerent evolutions of the best individual over the generations using diﬀerent conﬁgurations of the hybrid approach and compared with the DE and the greedy approach. In this case, the greedy steps were applied to the individuals each 10 generations. The graphs of ﬁtness evolution are an average of 10 evolutionary runs of the corresponding algorithm with diﬀerent initial populations. As the graphic shows, the more greedy steps are used, the faster the energy minimization is, but the higher is the predominance of the greedy minimization with respect to the DE minimization. So, we can use a hybrid combination that uses a relative small number of greedy steps to speed up the process without penalizing the robustness of the DE methodology. Finally, we point out the poor segmentation provided by the greedy approach that gave a result with several nodes stuck in the local noise presented in the image used (Figure 5(b)). This is a CT image of the knee with external noise, where the internal nodes must escape from the internal knee bones. Meanwhile, the hybrid approach (with greedy steps between 0 and 4 in this case) was able to overcome these diﬃculties (Figure 5(c)). Regarding computing times, in an Intel Core 2 at 2.83 GHz, the evolution of the DE approach across the 250 generations (1000 individuals) required an average of 8 minutes, whereas the hybrid combination (with steps between 0 and 4) required an average time of 4 minutes to obtain the same ﬁtness as the DE alternative (in generation 100). Figure 6 shows diﬀerent representative segmentation examples with a high level of diﬃculty. In this case we used the depicted hybrid method, applying a random greedy number of steps between 0 and 4 to all the individuals each 10 generations. The 1st and 2nd examples are images with uniform Gaussian noise added over an artiﬁcial image with a hole and a CT image of a foot, the 3rd one incorporates spots added as local noise and the 4th is an original CT image of the skull with real noise produced by the scanner in the bone contour. The greedy search was not able to segment these images, whereas the hybrid method overcame the noise problems providing correct segmentations.

Optimization of Topological Active Nets with Diﬀerential Evolution

(a)

(b)

(c)

359

(d)

Fig. 6. Best ﬁnal results obtained using the hybridized DE in noisy images. (a),(b) Uniform Gaussian noise added. (c) Spots added as noise. (d) CT noisy image.

5

Conclusions

We applied DE to the optimization of the TAN model for image segmentation. The DE approach allowed a better implementation as the decisions of the designer were minimized, with the experimental tuning of the two main deﬁning parameters of DE. The main operator of DE was able to cope with the necessary changes in the topological net structure, and with only one evolutionary phase in comparison with the GA alternative. Moreover, the hybrid combination between DE and the greedy local search integrated the advantages of both searches: the global DE overcame the problems in noisy images whereas the greedy search improved the segmentation speed. So, the combination avoided the falling in local minima, providing correct segmentations in representative images with diﬀerent levels of diﬃculty. Acknowledgments. This paper has been funded by the Ministry of Science and Innovation of Spain (project TIN2007-64330) and by the Instituto de Salud Carlos III (grant contract PI08/90420) using FEDER funds.

References 1. Tsumiyama, K.S.Y., Yamamoto, K.: Active net: Active net model for region extraction. IPSJ SIG notes 89(96), 1–8 (1989) 2. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(2), 321–323 (1988) 3. Ansia, F.M., Penedo, M.G., Mari˜ no, C., Mosquera, A.: A new approach to active nets. Pattern Recognition and Image Analysis 2, 76–77 (1999) 4. Ballerini, L.: Medical image segmentation using genetic snakes. In: Proceedings of SPIE: Application and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation II, vol. 3812, pp. 13–23 (1999) 5. S´eguier, R., Cladel, N.: Genetic snakes: Application on lipreading. In: International Conference on Artiﬁcial Neural Networks and Genetic Algorithms (2003) 6. Ballerini, L.: Genetic snakes: Active contour models by genetic algorithms. In: Cagnoni, S., Lutton, E., Olague, G. (eds.) Genetic and Evolutionary Computation in Image Processing and Computer Vision. EURASIP Book Series on SP & C, pp. 177–194 (2007)

360

J. Novo, J. Santos, and M.G. Penedo

7. MacEachern, L.A., Manku, T.: Genetic algorithms for active contour optimization. In: Proc. IEEE International Symposium on Circuits and Systems, vol. 4, pp. 229– 232 (1998) 8. Fan, Y., Jiang, T.Z., Evans, D.J.: Volumetric segmentation of brain images using parallel genetic algorithms. IEEE Tran. on Medical Imaging 21(8), 904–909 (2002) 9. Ooi, C., Liatsis, P.: Co-evolutionary-based active contour models in tracking of moving obstacles. In: International Conference on Advanced Driver Assistance Systems, pp. 58–62 (2001) 10. Tanatipanond, T., Covavisaruch, N.: A multiscale approach to deformable contour for brain MR images by genetic algorithm. In: The Third Annual National Symposium on Computational Science and Engineering, pp. 306–315 (1999) 11. Tohka, J., Mykk¨ anen, J.M.: Deformable mesh for automated surface extraction from noisy images. Int. J. Image Graphics 4(3), 405–432 (2004) 12. Tohka, J.: Global optimization of deformable surface meshes based on genetic algorithms. In: Proceedings ICIAP, pp. 459–464 (2001) 13. Ib´ an ˜ez, O., Barreira, N., Santos, J., Penedo, M.G.: Genetic approaches for topological active nets optimization. Pattern Recognition 42, 907–917 (2009) 14. Price, K.V., Storn, R.M.: Diﬀerential evolution - a simple and eﬃcient heuristic for global optimization over continuous spaces. Journal of Global Optimization 11(4), 341–359 (1997) 15. Price, K.V., Storn, R.M., Lampinen, J.A.: Diﬀerential Evolution. A Practical Approach to Global Optimization. Natural Computing Series. Springer, Heidelberg (2005) 16. Feoktistov, V.: Diﬀerential Evolution: In Search of Solutions. Springer, New York (2006)

Study on the Eﬀects of Pseudorandom Generation Quality on the Performance of Diﬀerential Evolution ¨ am¨o, and Matthieu Weber Ville Tirronen, Sami Ayr¨ University of Jyv¨ askyl¨ a, Mattilanniemi 2, 40100 Jyv¨ askyl¨ a, Finland {ville.tirronen,sami.ayramo,matthieu.weber}@jyu.fi

Abstract. Experiences in the ﬁeld of Monte Carlo methods indicate that the quality of a random number generator is exceedingly signiﬁcant for obtaining good results. This result has not been demonstrated in the ﬁeld of evolutionary optimization, and many practitioners of the ﬁeld assume that the choice of the generator is superﬂuous and fail to document this aspect of their algorithm. In this paper, we demonstrate empirically that the requirement of high quality generator does not hold in the case of Diﬀerential Evolution. Keywords: Diﬀerential Evolution, Pseudorandom number generation, Optimization.

1

Introduction

Most publications that detail a new evolutionary algorithm include the following sentence: “The random number generator rand(0,1) returns a uniformly distributed value in the range [0, 1).” In this work we try to look deeper into this mysterious rand(0,1) and the pseudorandom number generators (hereafter abbreviated as PRNGs) producing these values. It is widely known that Monte Carlo simulations require high quality random numbers. In [3], for instance, it was found that otherwise well-behaving generators fail in Monte Carlo studies. The besides the obvious algorithmic qualities, there are important diﬀerences between the PRNGs, such as computational cost, ease of use in parallel computations ([6]), and diﬀerent programming strategies that can be used. These are important factors in building embedded systems and applying high performance computing systems, such as computational clusters or GPU processing, which results in a need to know if their empirical performance is stable. The issue of random number quality is largely ignored in the practical work of evolutionary algorithm design and testing, often by using the most elaborate algorithm available. However, studies done with Genetic Algorithms (GA) do ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 361–370, 2011. c Springer-Verlag Berlin Heidelberg 2011

362

¨ am¨ V. Tirronen, S. Ayr¨ o, and M. Weber

not show correlation between high quality of the PRNG and high quality optimization results. In the article [15], Meysenburg and Foster demonstrate that driving standard GA with twelve diﬀerent PRNGs have no signiﬁcant statistical eﬀect on the performance of GA on many common test functions. In the follow up research [16] a much more sophisticated testing took place, but instead of ﬁnding positive correlation between respective qualities of PRNG and GA, the study uncovered a fact that in some cases a very poor PRNG was achieving better results than more sophisticated ones. Similar conclusion has also been reached in [2], where the eﬀect of PRNGs to diﬀerent components of GA is studied. The study concludes that, though there is no correlation between quality of PRNG and GA runs, the initial sampling is most aﬀected by diﬀerent PRNGs. In this work, we perform a similar study by driving a more modern optimization method, Diﬀerential Evolution[18] (DE) by various PRNGs in an attempt to conﬁrm the previous study in context of diﬀerent methods and real-valued optimization. In this paper we present evidence that strongly indicates that, counterintuitively, most PRNGs perform equally in optimization. This is done by empirically evaluating diﬀerent pseudo-random number generators (PRNGs) in the context of the Diﬀerential Evolution. The result is an important discovery, since it is consistent with the assumption on which the majority of evolutionary optimization papers are based. More precisely, we study whether – the generator quality aﬀects the speed of evolution, (no) – the generator quality aﬀects the quality of ﬁnal solution, (no) – the use of inferior generators can cause higher incidence of stagnation or premature convergence. (no) – computation can be saved by using a simpler generator. (yes, about 15%).

2

Pseudorandom Number Generation in Evolutionary Methods

Today, the most used random number generators seem to be either simple generators like Linear Congruential Generators (LCG), shift-register generators, or complex generators, such as the Mersenne twister. Requirements for a proper PRNG include properties like uniformity of the random number distribution, independence of subsequences, the length of the period, repeatability, portability, disjointness of subsequences generated for several machines, and eﬃciency [1]. Use of a proper PRNG is considered to be vital for many modern applications such as Monte Carlo simulations and encryption schemes [11], and many standard PRNGs, such as the venerable LCG, are not up to this task. For example, Marsaglia shows that all the linear congruential generators generalize poorly to high-dimensional problems [10]. Moreover, modern computing environments often require generators that have speciﬁc properties, such as the ability to run them in parallel, which is not given even for otherwise excellent generators[6]. The tests used for evaluating PRNGs have grown more stringent and show spectacular diﬀerences in quality between the generators. The testing procedures

Study on the Eﬀects of Pseudorandom Generation Quality

363

range from simulating some process for which a result is known, such as playing a game of craps in DIEHARD test by G. Marsaglia, to complex spectral and theoretical tests (for a survey, see [8]). The trends like the above cast serious doubt over the ﬁeld of evolutionary optimization (EO). EO methods are often likened to Monte Carlo simulations. Moreover, the basic testing procedure for the algorithms is also a Monte Carlo simulation. In this light, the choice of the PRNG seems especially relevant.

3

Empirical Framework

Diﬀerential evolution is a modern real valued optimization algorithm with characteristic versatility and reliability. DE can be characterized as a population based evolutionary method with a steady state logic. In this paper we use a variant called DE/rand/1/exp that is used as the benchmark case in [9] after substantial tuning eﬀorts. i iDEi is ai steady state algorithm updating population of candidate vectors P = i p1 , p2 , p3 . . . pn according to equation (1). First, candidate vectors are derived by applying scaled distance between two random points pib and pic to a third random point pia , and part of the vector resulting from this operation is used to update the indevidual we are seeking to replace: oj = Crossover pj , pia + F pib − pic . (1) The crossover operator samples a random value from the exponential distribution with parameter λ = Cr and swaps that many consecutive elements from the candidate vector to the original vector starting from an uniformly picked index, wrapping around the end of the vector if necessary. Finally, the original indevidual is replaced if the resulting vector oj is better than original vector pij . The main novelty of DE is in its way of generating the candidate oﬀspring by sampling direction vectors contained in the population depicted as in Figure 1 and formalized in equation (1). This allows the algorithm to employ the shape

Fig. 1. Diﬀerential evolution schematic

364

¨ am¨ V. Tirronen, S. Ayr¨ o, and M. Weber

of the population to guide the search over the ﬁtness landscape by exploiting promising exploration directions. To study the eﬀect of PRNGs, we construct, using diﬀerent PRNGs, several variants of the Diﬀerential Evolution algorithm. We consider the six generators in the following list. Implementation in C is given for the ﬁrst ﬁve and omitted, due to length, for Mersenne Twister. Reference implementations for MT are available from the homepage of the original author, M. Matsumoto. 1. RANDU uint32 t RANDU(void) { return (randu x=((65539∗randu x+362437)%2147483648)); } 2. Linear Congruential Generator (LCG)[7] uint32 t cong(void) { return (seed=69069∗seed+362437);} 3. Xorshift (XOR)[12] //x,y,z,w,v are seed values uint32 t xorshift(void) { t=(xˆ(x>>7)); x=y; y=z; z=w; w=v; v=(vˆ(v<<6))ˆ(tˆ(t<<13)); return (y+y+1)∗v;} 4. Multiply with Carry Generator(MWC-256)[13] c=262436 seed=/∗256 random 32−bit integers∗/ uint32 t MWC256(void){ uint64 t t,a=809430660LL; static uint8 t i=255; t=a∗seed[++i]+c; c=(t>>32); return(seed[i]=t);} 5. Multiply with Carry Generator(MWC-4096)[13] c=362436 seed=/∗4095 random 32−bit integers∗/ uint32 t MWC4096(void){ uint64 t t, a=18782LL; static uint32 t i=4095; uint32 t x,r=0xﬀﬀﬀfe; i=(i+1)&4095; t=a∗seed[i]+c; c=(t>>32); x=t+c; if(x
Study on the Eﬀects of Pseudorandom Generation Quality

365

The infamous Park-Miller style generator, also known as ‘RANDU‘ has been in use since 1960’s and is often used as an example of a bad PRNG (see [10]), while the rest of the generators have a long history of use and they are available in standard libraries of various languages The complexity of the generators ranges from the very simple LCG to the extremely complicated Mersenne Twister. The expense of complexity ranges from 32 bits of state for LCG to 19937 bits for the Mersenne Twister. In this paper we use the test benchmark devised for Special Issue of Soft Computing: A Fusion of Foundations, Methodologies and Applications on Scalability of Evolutionary Algorithms and other Metaheuristics for Large Scale Continuous Optimization Problems[9]. This test benchmark contains a well thought set of high dimensional benchmarking functions along with well-tuned DE parameters for this set of functions. We use relatively high dimensions in order to reveal the eﬀects of random variable interdependencies. We also expect that the large number of dimensions could amplify the diﬀerences between the PRNGs, if any found. In contrast, we also provide a short evaluation of the same benchmark in the more realistic 10 dimensions. In each test, we perform 100 simulations for each DE variant. Both the initial population and the subsequent runs are generated with the same PRNG. Both visual and statistical evaluation of the end result sets are presented. The visual evaluation is done with sparkline histograms (for more details, see [19]). The visualisation is done by selecting a suitable interesting range of values containing the optimal value found and plotting values in this range as a histogram (Table 2). Values outside the range (ie. those not good enough for the comparison) are placed into the rightmost bin of the histogram and separated from the rest by a typographical ellipsis. To augment the inspection of the end results we also give examples of converge graphs in 3, where all runs except RANDU are basically overlapping. A small number of examples suﬃce as the rest of the result set is almost identical to the trend exhibited here. Subsequently, the Q-test (Q stands for Quality) described in [5] is applied to give a numerical representation of the speed of the algorithm in the table 1. For each test problem and each algorithm, the Q measure is computed as Q = ne R where the robustness R is the percentage of successful runs, i.e the runs reaching a given threshold and the ne is the number of ﬁtness evaluations required to reach this threshold. It is clear that, for each test problem, the smallest value equals the best performance in terms of convergence speed. The value ”∞” means that R = 0, i.e., the algorithm never reached the threshold. Further, the analysis of the runtimes of the algorithms is presented in Figure 2, where we present relative kernel density estimates for the runtime of each algorithm. These results are based on bootstrapping of a calibrated sample taken with a modern desktop computer. As always, the runtime estimates are subject to some noise resulting from intricacies of modern operating systems and diﬀerences of various computer architectures.

366

¨ am¨ V. Tirronen, S. Ayr¨ o, and M. Weber 1

LCG Xorshift MWC256 Mersenne Twister MWC4096

0 145ns

150ns

155ns

160ns

165ns

170ns

175ns

180ns

185ns

Fig. 2. Kernel density estimate of runtimes of various PRNGs

4

Numerical Results and Analysis

The results presented in the Table 2 show result distributions of the algorithm variants over all the functions in 100 dimensions. The results of the 100d test cases show more variance than the 10d cases. This variance results from the application of DE to a function where it is either unable to ﬁnd the optimum or has insuﬃcient computational resources to do so. For our purpose however, this case is more enlightening, as it allows to compare non-converged algorithm states. The results in Figure 3 show that apart from RANDU, all generators produce results that are mostly similar. For the easy functions F1–F6 we see that the algorithm converges to a near optimum with all the other PRNGs.This result contradicts the similar result on GA given in [17], where it was found that RANDU is mostly stable in GA. Curiously enough, functions F2 and F6 converge with RANDU, which further emphasises the fact that DE seems rather indiﬀerent towards the PRNG. For the more diﬃcult functions F7 through F19 we observe non-converged end states. With these functions RANDU is never competitive, and there are some slight diﬀerences with other PRNGs as well. For example, F15 and F3 show that the variant with Mersenne Twister obtains overall best runs and MWC-4096 does the same F8 and Xorshift on F11. However, there is no statistical diﬀerence between the distributions of the algorithms and the eﬀect size is neglible. Importantly we can conclude that the simplest real generator, the LCG, is quite equivalent to the modern Mersenne Twister. Furthermore, since the initial points were generated with the same PRNG, we can also conclude that whatever non-uniformity the generators have is not enough to aﬀect DE’s performance. The convergence speed is studied with Q-Test in Table 1 which shows minor diﬀerences between the PRNGs. Apart from RANDU and Xorshift, all PRNGs perform slightly superiorly to others in some cases. No trend for the performance can be seen, and since the eﬀect size is quite small the diﬀerences must be attributed to the overall stochastic nature of the algorithm. Runtime distribution estimates of the various PRNGs are plotted in the Figure 2. The ﬁgure shows a gaussian Kernel Density estimate based on a bootstrapped[4] runtime sample. The median runtimes range from 152ns for the simple LCG to 177ns for the slowest generator, the MWC4096 , yielding a performance diﬀerence of approximately 15%. Considering the number of pseudorandom numbers generated in a single run, or worse, a batch of runs by DE,

Study on the Eﬀects of Pseudorandom Generation Quality

367

Table 1. Q-Test results for 100 dimensional test case

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19

MWC-256 1.22e+02 1.80e+00 7.04e+01 4.18e+02 1.02e+02 1.80e+00 1.80e+00 1.80e+00 9.18e+02 1.74e+02 9.22e+02 1.00e+02 5.76e+01 1.33e+03 1.86e+00 7.24e+01 3.05e+01 1.07e+03 2.11e+00

MWC-4096 1.22e+02 1.80e+00 7.00e+01 4.25e+02 1.01e+02 1.80e+00 1.80e+00 1.80e+00 9.19e+02 1.73e+02 9.20e+02 1.01e+02 5.79e+01 1.32e+03 2.11e+00 7.07e+01 3.15e+01 1.07e+03 2.33e+00

MT 1.23e+02 1.80e+00 6.93e+01 4.21e+02 1.02e+02 1.80e+00 1.80e+00 1.80e+00 9.19e+02 1.73e+02 9.26e+02 1.01e+02 5.74e+01 1.33e+03 1.88e+00 7.12e+01 2.90e+01 1.06e+03 2.47e+00

RANDU ∞ 1.80e+00 ∞ ∞ ∞ 1.80e+00 2.21e+00 1.16e+01 ∞ ∞ ∞ ∞ ∞ ∞ 1.82e+00 ∞ ∞ ∞ 1.80e+00

Xorshift 1.22e+02 1.80e+00 6.96e+01 4.19e+02 1.02e+02 1.80e+00 1.80e+00 1.80e+00 9.17e+02 1.74e+02 9.25e+02 1.01e+02 5.78e+01 1.34e+03 1.97e+00 7.20e+01 3.08e+01 1.07e+03 2.12e+00

50

70

MWC-256 MWC-4096 LCG MT RANDU Xorshift

60

MWC-256 MWC-4096 LCG MT RANDU Xorshift

45 40 35 Fitness

50 Fitness

LCG 1.22e+02 1.80e+00 6.97e+01 4.20e+02 1.01e+02 1.80e+00 1.80e+00 1.80e+00 9.17e+02 1.73e+02 9.20e+02 1.02e+02 5.70e+01 1.33e+03 1.92e+00 7.13e+01 3.04e+01 1.05e+03 2.29e+00

40 30

30 25 20 15

20

10 10

5 0

0

200000 100000 Fitness evaluations

0

300000

200000 100000 Fitness evaluations

0

(b) F17 in 10D

(a) F14 in 10D 1800

1400

MWC-256 MWC-4096 LCG MT RANDU Xorshift

2.5e+10 2e+10 Fitness

1200 Fitness

3e+10

MWC-256 MWC-4096 LCG MT RANDU Xorshift

1600

300000

1000 800

1.5e+10 1e+10

600 400

5e+09 200 0

0 0

200000 100000 Fitness evaluations

(c) F14 in 100D

300000

0

200000 100000 Fitness evaluations

300000

(d) F17 in 100D

Fig. 3. Examples of convergence. (Note that the lines overlap in the ﬁgure).

this is quite signiﬁcant amount. In simple cases, such as in this article, this means that using a simple PRNG, the algorithm can calculate 10% longer runs, which gives a more signiﬁcant impact than that of the diﬀerence in the PRNG quality.

Max

Max

Min

Max

MWC-256 MWC-4096 LCG MT RANDU Xorshift

Min

MWC-256 MWC-4096 LCG MT RANDU Xorshift

Min

MWC-256 MWC-4096 LCG MT RANDU Xorshift

1.97e-10

F15

1.61e+09

F8

0.00e+00

F1

F2

F16

4.69e-10 8.72e-05

ppp ppp ppp ppp ppp ppp

2.92e+09 1.03e-01

ppp ppp ppp ppp ppp ppp

F9

6.26e+05 0.00e+00

ppp ppp ppp ppp ppp ppp

F3

F17

1.74e-04 1.57e+01

ppp ppp ppp ppp ppp ppp

2.24e-01 0.00e+00

ppp ppp ppp ppp ppp ppp

F10

0.00e+00 4.72e+02

ppp ppp ppp ppp ppp ppp

F4

F18

4.63e+10 1.74e-02

ppp ppp ppp ppp ppp ppp

1.60e+04 9.90e-02

ppp ppp ppp ppp ppp ppp

F11

5.28e+02 0.00e+00

ppp ppp ppp ppp ppp ppp

F5

F19

3.62e-02 3.04e-13

ppp ppp ppp ppp ppp ppp

2.25e-01 3.83e-06

ppp ppp ppp ppp ppp ppp

F12

2.01e+03 0.00e+00

ppp ppp ppp ppp ppp ppp

F6

9.63e-13

ppp ppp ppp ppp ppp ppp

1.21e-05 6.47e+01

ppp ppp ppp ppp ppp ppp

F13

4.33e+03 0.00e+00

ppp ppp ppp ppp ppp ppp

Table 2. Sparkline histograms for algorithm results in 100d

F7

6.78e+01 1.17e-02

ppp ppp ppp ppp ppp ppp

F14

0.00e+00 8.89e-10

ppp ppp ppp ppp ppp ppp

1.01e+00

ppp ppp ppp ppp ppp ppp

1.72e-09

ppp ppp ppp ppp ppp ppp

368 ¨ am¨ V. Tirronen, S. Ayr¨ o, and M. Weber

Study on the Eﬀects of Pseudorandom Generation Quality

5

369

Conclusions and Further Work

This paper presents strong empirical evidence that it is safe to assume near identical performance from most pseudorandom number generators in the context of Diﬀerential Evolution. This result however suggests more studies with diﬀerent algorithms and casts some doubt over the stochastic nature of Diﬀerential Evolution itself. Furthermore, even though the optimization algorithm itself is not aﬀected by the choice of PRNGs, the method of repeated tests and statistical analysis does require independent initializations of the algorithms and is often done in parallel, which poses some considerable demands on the PRNG.

References 1. Brent, R.P.: Uniform random number generators for supercomputers. In: Proceedings Fifth Australian Supercomputer Conference, pp. 95–104 (1992) 2. Cant´ u-Paz, E.: On random numbers and the performance of genetic algorithms. In: Langdon, W.B., Cant´ u-Paz, E., Mathias, K.E., Roy, R., Davis, D., Poli, R., Balakrishnan, K., Honavar, V., Rudolph, G., Wegener, J., Bull, L., Potter, M.A., Schultz, A.C., Miller, J.F., Burke, E.K., Jonoska, N. (eds.) GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 311–318 (2002) 3. Coddington, P.: Analysis of random number generators using monte carlo simulation. International Journal of Modern Physics C 5(3), 547–560 (1994) 4. Davison, A., Hinkley, D.: Bootstrap methods and their application. Cambridge University Press, Cambridge (1997) 5. Feoktistov, V.: Diﬀerential evolution: in search of solutions. Springer, Inc., New York (2006) 6. Hellekalek, P.: Don’t trust parallel monte carlo! In Parallel and Dis- tributed Simulation. In: Proceedings of Twelfth Workshop on Parallel and Distributed Simulation, PADS 1998, pp. 82–89. IEEE, Los Alamitos (1998) 7. Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2. Addison-Wesley, Reading (1997) 8. L’Ecuyer, P.: Testing random number generators (1992) 9. Lozano, M., Herrera, F.: Special issue of soft computing: A fusion of foundations, methodologies and applications on scalability of evolutionary algorithms and other metaheuristics for large scale continuous optimization problems (2009), Web document http://sci2s.ugr.es/eamhco/CFP.php 10. Marsaglia, G.: Random numbers fall mainly in the planes. Proceedings of the National Academy of Sciences of the United States of America 61(1), 25 (1968) 11. Marsaglia, G.: A current view of random number generators. In: Computer Science and Statistics, Sixteenth Symposium on the Interface, pp. 3–10 (1985) 12. Marsaglia, G.: Xorshift rngs. Journal of Statistical Software 8(14), 1–6 (2003) 13. Marsaglia, G., Zaman, A.: A new class of random number generators. The Annals of Applied Probability 1(3), 462–480 (1991) 14. Matsumoto, M., Nishimura, T.: Mersenne twister: a 623- dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS) 8(1), 3–30 (1998)

370

¨ am¨ V. Tirronen, S. Ayr¨ o, and M. Weber

15. Meysenburg, M., Foster, J.: The quality of pseudo-random number generators and simple genetic algorithm performance. In: Proceedings of the Seventh International Conference on Genetic Algorithms, pp. 276–282 (1997) 16. Meysenburg, M., Foster, J.: Randomness and ga performance, revisited. In: Proceedings of the Genetic and Evolutionary Computation Conference, vol. 1, pp. 425–432. Morgan Kaufmann Publishers, San Francisco (1999) 17. Meysenburg, M., Hoelting, D., McElvain, D., Foster, J.: How random generator quality impacts genetic algorithm performance. In: Langdon, W.B., Cant´ u-Paz, E., Mathias, K., et al. (Hg.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2002), pp. 480–483. Citeseer (2002) 18. Storn, R., Price, K.: Diﬀerential evolution a simple and eﬃcient heuristic for global optimization over continuous spaces. Journal of global optimization 11(4), 341–359 (1997) 19. Tirronen, V., Weber, M.: Sparkline histograms for comparing evolutionary optimization methods. In: Proceedings of International Conference on Evolutionary Computation 2010. Springer, Heidelberg (2010)

Sensitiveness of Evolutionary Algorithms to the Random Number Generator Miguel Cárdenas-Montes1, Miguel A. Vega-Rodríguez2, and Antonio Gómez-Iglesias3 1

Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Department of Fundamental Research, Madrid, Spain [email protected] 2 University of Extremadura, ARCO Research Group, Dept. Technologies of Computers and Communications, Cáceres, Spain [email protected] 3 Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, National Laboratory of Fusion, Madrid, Spain [email protected]

Abstract. This article presents an empirical study of the impact of the change of the Random Number Generator over the performance of four Evolutionary Algorithms: Particle Swarm Optimisation, Differential Evolution, Genetic Algorithm and Firefly Algorithm. Random Number Generators are a key piece in the production of e-science, including optimisation problems by Evolutionary Algorithms. However, Random Number Generator ought to be carefully selected taking into account the quality of the generator. In order to analyse the impact over the performance of an evolutionary algorithm due to the change of Random Number Generator, a huge production of simulated data is necessary as well as the use of statistical techniques to extract relevant information from large data set. To support this production, a grid computing infrastructure has been employed. In this study, the most frequently employed high-quality Random Number Generators and Evolutionary Algorithms are coupled in order to cover the widest portfolio of cases. As consequence of this study, an evaluation about the impact of the use of different Random Number Generators over the final performance of the Evolutionary Algorithm is stated. Keywords: Performance Analysis, Evolutionary Algorithm, Random Number Generator.

1 Introduction There are numerous papers published every year in optimisation problems based on Evolutionary Algorithms (EAs) which implement diverse high-quality Random Number Generators (RNGs). However, it is difficult to ascertain the role played by the RNGs in the final results. The intent of this paper is to figure out the role of RNG in the final performance, and to assess if this element has any impact over the results obtained. EAs techniques rely heavily on the use of RNG. From initial population generation, through the specific canonical operators applied to create new temporary population, A. Dobnikar, U. Lotriˇc, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 371–380, 2011. c Springer-Verlag Berlin Heidelberg 2011

372

M. Cárdenas-Montes, M.A. Vega-Rodríguez, and A. Gómez-Iglesias

the use of randomness is pervasive through EAs. Therefore, it is reasonable to wonder how RNG quality affects EAs performance. In this paper two RNGs (Mersenne Twister and GCC rand()) have been used to test their impact in the final performance of four EAs (Particle Swarm Algorithm, Differential Evolution, Genetic Algorithm, and Firefly Algorithm). The two RNG have been selected based on two following criteria: – The RNG have to be frequently used in research papers. – The RNG have to be considered as high-quality RNG. In order to test our hypothesis about the impact of the RNG in the performance of these four EAs, a large number of experiments have to be executed to produce a high statistic. As consequence, for the analysis of these large data sets, the employment of statistical techniques is hardly necessary. Due to the large volume of data to analyse and the need of extracting relevant information or statements about some parameters, the most suitable statistical technique for this work is the statistical inference. Statistical inference allows deciding whether to accept or to reject statements about some parameters of the data sets. In this paper, this technique will be put into practice to decide if the use of a particular RNG affects or not to the final performance of a particular EA. Grid computing has emerged as a powerful paradigm in e-Science, providing to the researchers an immense volume of computational resources distributed along diverse institutions. This paradigm has made proof of be able to cover the requirements of a lot of scientific communities [1]. The computing capabilities delivered by this paradigm have increased the generation of new science. For this reason, a platform of grid computing has been selected for the present work in order to provide the necessary resources to produce the large data sets required. Particularly, to support the large-scale production, the grid computing platform of Spanish National Grid Initiative has been used. This paper is organised as follows: Section 2 summarizes the Related Work and previous efforts done. In Section 3, a resume of the Random Number Generators used in the research is introduced. In Section 4, the Evolutionary Algorithms tested in this article are briefly described. The underpinning of the Statistical Inference is exposed in Section 5. In Section 6, the details of the implementation and the Production Setup are shown. The Results and the Analysis are displayed in Section 7. And finally, the Conclusions and the Future Work are presented in Section 8.

2 Related Work Several works have examined diverse aspects of the impact of the RNG over the final performance of EAs. However, relevant differences between these previous works and the present study have been found. The first works in this area [2], [3] examined the impact of the RNG choice on the performance of Genetic Algorithm (GA). In those works RNG drove a simple GA, applied to a collection of several well-known GA test functions. In this study no statistical evidence of any impact over the GA performance due to the RNG quality was found.

Sensitiveness of Evolutionary Algorithms to the Random Number Generator

373

However, the coarse-grained statistical measure employed puts in quarantine the conclusions attained. In a further study using finer statistics [4], no correlation between goodness on the RNG tests —Diehard suite— and good performance by the GA was obtained. The Diehard tests are a set of statistical tests for measuring the quality —randomness— of a RNG. Other paper [5] has studied the sensitiveness of GA to the choice of RNG focusing on the components that are most affected by the RNG. The work presents ablation experiment using two RNG and the true random number from an atmospheric noise source. The experiments showed that the RNG used to initialize the population had a critical impact over the final performance; whereas the RNG used as input to other operations —crossover and mutation— did not affect the performance significantly.

3 Random Number Generator It may seem to be a conceptual impossibility to produce "random" numbers with computers which are completely deterministic machines. Nevertheless, Random Number Generators (RNGs) are in common use, especially in science and engineering disciplines. Sometimes the term pseudorandom is used for computer-generated sequences, while the word random is reserved for intrinsically random physical process. Such fine distinctions are not made in this paper. In general, the random number sequences produced by RNG ought to be uniform, uncorrelated and of extremely long period. It is clear that for optimum performance and accuracy of EA, the RNG used ought to have these good properties. Some of the most used tests to examine the quality of RNGs are: Uniformity test: Break up the interval between zero and one into a large number of small bins and after generating a large number of random numbers, check for uniformity in the number of entries in each bin. Overlapping M-tuple test: Check the statistical properties of the number of times that M-tuples of digits appear in the sequence of random numbers. Parking lot test: Plot points in a m-dimensional space where the m-coordinates of each point are determined by m-successive calls to the RNG. Then look for regular structures. 3.1 The Random Number Generators Tested The two RNG tested in this paper are Mersenne Twister and GCC RAND. Both RNGs fit the criteria previously exposed: high-quality and widely used in optimization problems with EAs. Mersenne Twister RNG. The Mersenne Twister RNG was developed in 1997 by Matsumoto and Nishimura [6]. It is a high-quality RNG designed specifically to rectify many of the flaws found in older RNGs. Its name derives from the fact that period length is chosen to be a Mersenne prime number. The main features of the Mersenne Twister RNG are:

374

M. Cárdenas-Montes, M.A. Vega-Rodríguez, and A. Gómez-Iglesias

– It has a very long period of 219937 − 1 106000 . While a long period is not a guarantee of quality in a random number generator, short periods —common in many software packages— are problematic. – It is k-distributed to 32-bit accuracy for every 1 ≤ k ≤ 623 (Overlapping M-tuple test). – It passes numerous tests for statistical randomness, including the Diehard tests. It passes most, but not all, of the even more stringent TestU01 Crush randomness tests. GCC RAND RNG. In C and C++, the default RNG is the rand() function. Many platforms have poor-quality versions of the rand() function, however GNU platforms —glibc— implement version with higher quality and broadly accepted as good quality RNG. The main features of the GCC RNG are: – The implementation of glibc corresponds to the category of Linear Congruential Generator [7]. – It has a period of 231 − 1. This period is accepted in general as long but it is clearly shorter than the Mersenne Twister RNG. Spite of the difference of period between both RNGs, the GNU rand() is accepted as good quality RNG, being used in many scientific and technical works. The question is if these differences affect to the final results of the optimization process based on EAs. The intent of this work is to unveil this question. In our implementations, the seed of the RNG is reinitialised for each new execution in order to keep the study as fair as possible.

4 Evolutionary Algorithms In this paper, four EAs have been tested in relation to the change of RNG. These EAs are employed in the community to optimise a huge number of problems. The selected EAs are: Particle Swarm Algorithm (PSO) [8], [9], [10], Differential Evolution (DE) [11], [12], Genetic Algorithm (GA) [13], [14], and Firefly Algorithm (FA) [15], [16]. They have been selected being very representative and widely employed in the community. Deeper description of the EAs tested can be found in the referenced articles, being this purpose out of the scope of the present paper. In all EAs used in this work, the population structure is panmictic. Thus, the intrinsic operations to each EA —mutation, reproduction, selection, replacement, etc.— take place globally over the whole population. Furthermore, in all cases the EAs follow a generational model, in which a whole new population of individuals replaces the old one [17].

5 Statistical Inference Many problems in engineering require deciding whether to accept or to reject a statement about some parameters. The statement is a hypothesis, and the decision-making

Sensitiveness of Evolutionary Algorithms to the Random Number Generator

375

procedure about the hypothesis is called hypothesis testing [18]. This is one of the most useful aspects of statistical inference, and it will be employed in this paper to decide whether the use of a particular RNG affects to the algorithm performance. Statistical hypothesis testing is a fundamental method used at the data analysis stage of a comparative experiment. For this comparison, two kind of tests can be used: parametric and non-parametric. The main difference between parametric and non-parametric tests relies on the assumption of a distribution underlying the sample data. Given that non-parametric do not require explicit conditions on the underlying sample data, they are recommended when the statistical model of data is unknown [19]. Hypothesis-testing procedure relies on using the information in the samples. If this information is consistent with the hypothesis, it can be concluded that the hypothesis is true; however, if this information is inconsistent with the hypothesis, it can be concluded that the hypothesis is false. The null hypothesis (H0 ) is the hypothesis it wishes to test. Rejection of the null hypothesis always leads to accepting the alternative hypothesis (H1 ). In our case, the null hypothesis is: H0 : μ1 − μ2 = Δ0 = 0, or H0 : μ1 = μ2 , where μ1 and μ2 are the means of minima of each case (each RNG). As alternative hypothesis, H1 : μ1 = μ2 is stated. 5.1 Non-parametric Statistical Inference When the conditions for the safe use of parametric tests are not met—independence of events, normality of the underlying distribution and homoscedasticity of distributions [20], [21]—, the use of non-parametric tests is the most recommended statistical analysis. The Wilcoxon signed-rank test belongs to the category of non-parametric test. It is a pairwise test that aims to detect significant differences between to sample means [19], [22], that is in our study, the behaviour of EA face to RNG. The Wilcoxon signed-rank test works as follows: Let di be the difference between the performance scores of the two classifiers on ith out of Nds data-sets. The differences are ranked according to their absolute values; average ranks are assigned in case of ties. Let R+ be the sum of ranks for the data-sets on which the first RNG outperformed the second, and R− the sum of ranks for the opposite. Ranks of di = 0 are split evenly among the sums; if there is an odd number of them, one is ignored. In the Eq. 1, a schema of the Wilcoxon signed-rank test is presented. R+ = di >0 rank(di ) + 12 di =0 rank(di ) (1) R− = di <0 rank(di ) + 12 di =0 rank(di ) Let T be the smaller of the sums, T = min(R+ , R− ). If T is less than or equal to the value of the distribution of Wilcoxon for Nds degrees of freedom, the null hypothesis of equality of means is rejected. For a significance α = 0.05, the acceptance or rejection of the null hypothesis will be based if the p-value —produced by the Wilcoxon signed-rank test— is higher or lower than α.

376

M. Cárdenas-Montes, M.A. Vega-Rodríguez, and A. Gómez-Iglesias

6 Production Setup The empirical study was conducted using a set of benchmarks, with diverse functions widely used in these cases. These functions were selected in order that the set has a mixture of multimodal (functions: f1 , f2 , f6 and f8 ) and monomodal functions (functions: f3 , f4 , f5 , f7 , f9 , f10 and f11 ). The benchmark functions selected are presented in Table 1 as well as their characteristics (Table 2). These functions have been extracted or inspired from CEC 2010 and CEC 2008 Special Sessions and Competition on LargeScale Global Optimization [23], [24]; as well as other papers benchmarking EAs. Table 1. Benchmark functions used in the paper Expression i f1 = D [sin(xi ) + sin( 2·x )] 3 i=1 2·x ·x D−1 f2 = i=1 [sin(xi · xi+1 ) + sin( i 3 i+1 )] D f3 = i=1 [(xi + 0.5)2 ] f4 = D [(xi )2 − 10 · cos(2πxi ) + 10] i=1 D f5 = i=1 [(xi )2 ] f6 = D i=1 [xi · sin(10 · π · xi )] D x2 i=1 i f7 = 20 + 20 · exp(−20 · exp(−0.2 )) − exp( D i=1 D D f8 = 418.9828 · D − i=1 [xi · sin( |xi |)] f9 = D−1 [100 · (xi+1 − x2i )2 + (xi − 1)2 ] i=1 D f10 = i=1 [i · (xi )2 ] D i D i 2 2 4 f11 = D i=1 [(xi ) ] + [ i=1 ( 2 · xi )] + [ i=1 ( 2 · xi )]

Optimum −1.21598 · D −2 · D + 2 0 0 0 −5 · D cos(2πxi ) ) D

0 0 0 0 0

Table 2. Characteristics of benchmark functions used in the paper Function f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11

Interval [3, 13] [3, 13] [-100, 100] [-5.12, 5.12] [-5.12, 5.12] [-1, 2] [-30, 30] [-500, 500] [-5.12, 5.12] [-5.12, 5.12] [-5.12, 5.12]

Characteristics Multimodal Separable Multimodal Full-non-separable Monomodal Separable Monomodal Separable Monomodal Separable Multimodal Separable Monomodal Separable Multimodal Separable Monomodal Full-non-separable Monomodal Separable Monomodal Separable

Regarding the features of the selected functions, they are as diverse as possible in order to cope with different scenarios of the RNG and EA. The functions (Table 2) termed multimodal present diverse minima in the interval of the variables, whereas the functions termed monomodal present only one minimum.

Sensitiveness of Evolutionary Algorithms to the Random Number Generator

377

Concerning the separability of the variables, if the variables involved in a function are independent of each other, the function is termed separable. In this case, the problem can be easily solved by decomposing it into a number of sub-problems, each of which involving only one decision variable while treating all the others as constants. Oppositely the functions f2 and f9 are full-non-separable due to any two of their variables are not independent. In order to have a solid statistic, a total of 104 tries of each benchmark function have been executed. In this production, the powerful machinery of the grid was used to support the computational activity. To manage the complexity of the problem —involving diverse benchmark functions— the grid jobs were created with 500 tries of the configuration. This structure assures the optimization of the execution time for the grid environment. Several runs were executed to reach the statistical relevance desired. Each job is composed by a shellscript that handles the execution, and a tarball containing the source code (C++) of the program and the configuration files. When the job arrives to the Worker Node, it executes the instructions of the shellcript: roll-out the tarball, compile the source code and execute the 500 tries of the configuration for the benchmark function, and finally tarball the results files. When the job finishes, it recuperates the results tarball. Taking into account the total number of fitness functions, algorithms and runs, the whole production involved 1,760 jobs and 880,000 tries. The configuration was selected in order to have the maximum number of calls to the corresponding RNG. Thus, the configuration selected involved 10,000 cycles, 100 particles/individuals as population size and a dimensionality of 100 for all fitness functions. The use of this configuration, independently of the EA employed, involves at least 108 calls to the RNG by execution, and 1012 calls by benchmark function.

7 Results and Analysis In Table 3, the p-value of Wilcoxon signed-rank test for each EA and fitness function is presented. In our study, the analysis of the sensitiveness of the EAs is based on these values. The acceptance or rejection of the null hypothesis for a significance, α = 0.05, is based on whether the p-value in Table 3 is higher (acceptance) or lower (rejection) than α. As it can be observed in Table 3, the null hypothesis (H0 : μ1 = μ2 ) can be rejected in all cases for DE. Hence, DE algorithm is very sensitive to the choice of RNG, producing results significantly different in relation to the RNG implemented. For f1 the p-value can not be obtained due to all results for both RNGs are equal, being not possible to perform the Wilcoxon signed-rank test. For PSO in 5 tests (f1 , f3 , f6 , f10 and f11 ) the null hypothesis can be rejected; otherwise in 6 cases the null hypothesis can not be rejected. Consequently, PSO shows a low sensitiveness to the choice of the RNG. Furthermore for FA, the p-values obtained allow to reject the null hypothesis in 4 cases (f1 , f5 , f6 and f9 ), failing to reject in 7 cases; showing the lowest sensitiveness of the four EAs studied.

378

M. Cárdenas-Montes, M.A. Vega-Rodríguez, and A. Gómez-Iglesias Table 3. p-value for non-parametric hypothesis testing for each EA and fitness function Function PSO DE GA FA f1 1.79e-06 nan 8.31e-11 0.034 f2 0.572 0.0 0.0 0.574 f3 3.40e-06 0.0 0.292 0.885 f4 0.783 0.0 0.0 0.834 f5 0.816 0.0 2.78e-10 5.89e-06 f6 0.0001 2.62e-12 0.0 0.013 f7 0.222 0.0 0.0 0.558 f8 0.130 0.0 0.0 0.522 f9 0.640 0.0 0.003 0.014 f10 0.013 0.0 0.646 0.355 f11 0.0002 0.0 0.002 0.135

Between these two extreme cases —DE and FA— the p-values obtained for GA allow to reject the null hypothesis in 9 cases, and only in 2 cases (f3 and f10 ) the null hypothesis can not be rejected. Moreover, in relation with the main focus of this work, most of the cases analysed point that the use of a particular RNG affects to the final performance. This affectation is particularly more explicit in DE, where for all cases the null hypothesis is rejected, than in PSO, where only for 5 cases, or FA, where only for 4 cases the null hypothesis can be rejected. The analysis of these results allow building a scale of sensitiveness for the EAs tested: DE > GA > P SO > F A. Based on the non-parametric analysis performed in this section, it can be concluded that the choice of the RNG affects to the final performance of the EA tested, although the level of sensitiveness displayed is different between them. Unfortunately, this analysis does not allow to establish conclusions to bring over of the best performance of the EAs with the use of a particular RNG.

8 Conclusions and Future Work This paper deals with an empirical study of the impact of the change of the RNG in the performance of the four EAs: PSO, DE, GA and FA. For the analysis of this impact, a large-scale production as well as statistical techniques have been mandatory. The evaluated tests show that each EA has a very different sensitivity in relation to the change of RNG. Some EAs —i.e. DE and GA— have proved to be extremely sensitive to this change; whereas PSO and FA have shown a very low sensitivity. However, at least one function for each EA tested —Wilcoxon signed-rank test— obtains a significantly different performance. More comparative work and further studies should be carried out to provide a more detailed analysis and refinement. Future work with other EAs and RNGs should be performed to observe if these EAs are sensitive to the change of RNG. From the research that has been carried out, it can be concluded that the choice of a particular RNG could have an important impact in the final performance of EAs.

Sensitiveness of Evolutionary Algorithms to the Random Number Generator

379

Specifically the EAs tested have displayed a very different degree of sensitiveness in relation to the change of RNG.

Acknowledgement The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) via the project EUFORIA under grant agreement number 211804; and the project EGI-InSPIRE under the grant agreement number RI-261323. The author thanks the Spanish Network for e-Science (CAC-2007-52) for their support when using the NGI resources.

References 1. Kesselman, C., Foster, I.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco (November 1998) 2. Meysenburg, M.M., Foster, J., Saghi, G., Dickinson, J., Jacobsen, R.T., Shreeve, J.M.: The effect of pseudo-random number generator quality on the performance of a simple genetic algorithm. Master’s thesis (1997) 3. Meysenburg, M.M., Foster, J.A.: The quality of pseudo-random number generations and simple genetic algorithm performance. In: ICGA, pp. 276–282 (1997) 4. Meysenburg, M.M., Foster, J.A.: Randomness and GA performance, revisited. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference, Orlando, Florida, USA, July 13-17, vol. 1, pp. 425–432. Morgan Kaufmann, San Francisco (1999) 5. Cantú-Paz, E.: On random numbers and the performance of genetic algorithms. In: GECCO, pp. 311–318 (2002) 6. Matsumoto, M., Nishimura, T.: Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Transactions on Modeling and Computer Simulation 8(1), 3–30 (1999) 7. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge (1992) 8. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, vol. IV, pp. 1942–1948 (1995) 9. Eberhart, R.C., Shi, Y., Kennedy, J.: Swarm Intelligence, 1st edn. The Morgan Kaufmann Series in Artificial Intelligence. Morgan Kaufmann, San Francisco (April 2001) 10. Eberhart, R.C., Kennedy, J.: A new optimizer using particle swarm theory, 39–43 (1995) 11. Price, K.V., Storn, R., Lampinen, J.: Differential Evolution: A practical Approach to Global Optimization. Springer, Berlin (2005) 12. Storn, R., Price, K.: Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. J. of Global Optimization 11(4), 341–359 (1997) 13. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer, Inc., New York (1994) 14. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1998) 15. Yang, X.-S.: Firefly algorithms for multimodal optimization. In: Watanabe, O., Zeugmann, T. (eds.) SAGA 2009. LNCS, vol. 5792, pp. 169–178. Springer, Heidelberg (2009) 16. Yang, X.-S., Deb, S.: Eagle strategy using lévy walk and firefly algorithms for stochastic optimization. In: González, J.R., Pelta, D.A., Cruz, C., Terrazas, G., Krasnogor, N. (eds.) NICSO 2010. Studies in Computational Intelligence, vol. 284, pp. 101–111. Springer, Heidelberg (2010)

380

M. Cárdenas-Montes, M.A. Vega-Rodríguez, and A. Gómez-Iglesias

17. Alba, E., Tomassini, M.: Parallelism and evolutionary algorithms. IEEE Trans. Evolutionary Computation 6(5), 443–462 (2002) 18. Montgomery, D., Runger, G.: Applied Statistics and Probability for Engineers. John Wiley and Sons Ltd, New York (2002) 19. García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms behaviour: a case study on the cec’2005 special session on real parameter optimization. J. Heuristics 15(6), 617–644 (2009) 20. Sheskin, D.: Handbook of parametric and nonparametric statistical procedures. Chapman Hall CFC, London (2003) 21. Zar, J.: Biostatistical Analysis. Prentice-Hall, Inc., Upper Saddle River (2007) 22. García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput. 13(10), 959–977 (2009) 23. Tang, K., Li, X., Suganthan, P.N., Yang, Z., Weise, T.: Benchmark functions for the cec’2010 special session and competition on large-scale global optimization. Technical report, Nature Inspired Computation and Applications Laboratory (NICAL), School of Computer Science and Technology, University of Science and Technology of China (USTC), Electric Building No. 2, Room 504, West Campus, Huangshan Road, Hefei 230027, Anhui, China (2009) 24. Tang, K., Yao, X., Suganthan, P.N., MacNish, C., Chen, Y.P., Chen, C.M., Yang, Z.: Benchmark functions for the CEC 2008 special session and competition on large scale global optimization. Technical report, Nature Inspired Computation and Applications Laboratory, USTC, China (2007)

New Efficient Techniques for Dynamic Detection of Likely Invariants Saeed Parsa, Behrouz Minaei, Mojtaba Daryabari, and Hamid Parvin Computer Engineering School, Iran University of Science and Technology (IUST), Tehran, Iran {Parsa,b_minaei,Drayabari,Parvin}@iust.ac.ir

Abstract. Invariants could be defined as prominent relation among program variables. Daikon software has implemented a practical algorithm for invariant detection. There are several other dynamic approaches to dynamic invariant detection. Daikon is considered to be the best software developed for dynamic invariant detection in comparing other dynamic invariant detection methods. However this method has some problems. Its time order is highly which this results in uselessness in practice. The bottleneck of the algorithm is predicate checking. In this paper, two new techniques are presented to improve the performance of the Daikon algorithm. Experimental results show that With regard to these amendments, runtime of dynamic invariant detection is much better than the original method. Keywords: Dynamic Invariant Detection, Genetic Algorithm, Daikon.

1 Introduction Invariants are formulas or rules in programs that are emerged from source code of program and remain unique and unchanged with respect to running of program using different parameters. For instance, in a sort program which is to sort an array of integers, the first item in the array must be bigger than the second item and the second item must be bigger than the third, etc. Invariants have significant impact on software testing. Daikon is a suitable software for dynamic invariant detection among all softwares developed until now. however this method has some problems and weaknesses. So, many studies have been carried out with the aim of improving its performance which has resulted in several different versions of Daikon up to now (Ernst et al. 2000), (Perkins and Ernst 2004). For instance latest version of Daikon included some new techniques for equal variables, dynamically constant variables, variable hierarchy and suppression of weaker invariants (Perkins and Ernst 2004). This paper discusses dynamic invariant detection in Daikon and presents two ideas to improve its performance. Invariants in programs are sets of rules that govern among the values of variables and remain unchanged in the light of different values of the input variables in consecutive runnings of the program. Invariants are very useful in testing software behavior, based on which a programmer can conclude that if its program behavior is A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 381–390, 2011. © Springer-Verlag Berlin Heidelberg 2011

382

S. Parsa et al.

true (Kataoka et al. 2001), (Nimmer and Ernst 2002). For instance, if a programmer, considering invariants, realizes that the value of a variable is unwillingly always constant, s/he may conclude that its codes have some bugs. Also, invariants are useful to compare two programs by programmers and can help them check their validities. For instance, when a person writes a program for sorting a series of data, s/he can conclude that his program is correct or has some bugs by comparing his/her program invariants against the invariants of a famous reliable sort program; such as Merge Sort. Here, the presupposition is that two sets of invariants detected in the program and the merge sort program must almost be the same. Additionally, invariants are useful in documentation and introduction of a program attributes; i.e. in cases where there are no documents and explanations on a specific program and a person wants to recognize its attributes for correcting or expanding program, invariants are very helpful for attaining the goal, especially if the program is big and has huge and unstructured code. There are two ways for invariant detection that are called static and dynamic. In the static way, invariants are detected by means of compiler-based (for example, extraction of data flow graphs of the program source code). Dynamic way, on the other hand, detects invariants by means of several program runnings using different input parameter values and based on the values of variables and relations between them. Dynamic methods will be explained by detail in next section (Ernst et al. 2006). Each of two methods has some advantages and disadvantages which will be debated in this paper. There are some tools such as Key & ESC for java language and LClint for C language for static invariant detection (Evans et al. 1994), (Schmitt and Benjamin 2007). In static detection, the biggest problem is the difficulty with which a programmer can discover the invariants. Tracing of codes and detection of rules between variable values will be a difficult job especially if the programmer wants to consider such cases as pointers, polymorphisms and so on. In dynamic methods, the biggest problem is that they are careless and timeconsuming and, more importantly, do not provide very reliable answers. Special when the number of variables is high, the performance of dynamic invariant detection methods decreases. So turning to heuristic method is inevitable and justified in such situation. If we want to obtain very certain invariants, we are obliged to use large test files, which it results to decreasing dynamic invariant detection software. In these situations heuristic method is also justified and effective. This paper first explains dynamic invariant detection method in section 2. Then, it discusses the problems of Daikon algorithm in section 3. Two ideas regarding the optimization of Daikon runtime and improvement of its speed will be elaborated in section 4. The comparison of the results of the new ideas and previous version of Daikon based on two C language programs is presented in section 5. And finally in section 6, conclusion and future works will be debated.

2 Daikon Algorithm Daikon first of all, runs program with several different input parameters. Then it instruments program that will be explained in details in next section and finally in every

New Efficient Techniques for Dynamic Detection of Likely Invariants

383

running of the program saves variable values on a file called data trace file. Daikon continues its work with extracting the values of variables from data trace files and by use of a series of predefined relations discovers the invariants and saves them. Daikon discovers unary, binary and ternary invariants. Unary invariants are invariants on one variable; for instance, X>a presents a variable bigger than a constant value. For another example X (mod b) shows X mod b=a. X>Y, X=Y+c are samples of binary invariants and Y=aX+bZ is a ssample of ternary invariant considered in Daikon in which X, Y, Z are variables and a & b are constant values. Daikon will check invariants on the next run of the program on the data trace file and will throw them out from list of true invariants if it is not true on current values of variables. Daikon continues this procedure several times while concluding proper reliability of invariants (Ernst et al. 2006). 2.1 Problem Space As it was shown previously, Daikon runs program with several different input parameters and extracts invariants on variables on data trace files. Program codes must be traced for instrumentation which means discovering values of variables at a program point and they could have been instrumented dynamically by means of a series of special tools. Daikon has two tools to achieve this goal. These tools are Chicory and Kvasir both of which have been written with java and are open source. Kvasir can just be run in Linux and for each procedure in source code produces one “Enter Procedure” and one “Exit Procedure” which on “Enter Procedure” writes values of variables since variables are entering into the procedure and on “Exit Procedure” writes values of variables since variables are exiting the procedure (Ernst et al. 2006). Daikon uses Kvasir results which are files with dtrace extension to discover invariants. A sample source code is represented in Fig 1. Kvasir's corresponding output to the Fig 1 has been shown in Fig 2. #include <stdio.h> int main() { int year; scanf("%d",&year); int w; int s; w=year; s=year+1; compute(year, w, s); } int compute(int yr, int d1, int d2) { if (yr % 4) return d1 + d2; else return d1 + d2 + 1; }

Fig. 1. A sample source code in C language

384

S. Parsa et al.

As you see, for Main and Compute functions in source code product, enter and exit sections have been defined. Variable values have also been shown as previously stated. ..main():::ENTER ..compute():::ENTER yr 4 1 d1 4 1 d2 5 1 ..compute():::EXIT0 yr 4 1 d1 4 1 d2 5 1 return 10 1 ..main():::EXIT0 return 10 1

Fig. 2. Kvasir's corresponding output to code mentioned in Fig 1

3 Daikon Disadvantage Daikon Algorithm has some disadvantages. One of the most important problems is its long run time. Daikon run time is depended on the number of running programs, the size of the programs and the number of templates with which the values of variables are tested. The long run time has also created the problem of carelessness because the Daikon has been forced to reduce its templates to cope with run time limitation. Daikon discovers only invariants over one, two and three variables. Two ideas will be presented in next section which, employing them will significantly reduces the run time (Ernst et al. 2006).

4 New Ideas for Improving Invariant Detection This paper tries to indicate that enforcement of the following amendments in Daikon source code will significantly improve its runtime. Note that, despite some changes in

New Efficient Techniques for Dynamic Detection of Likely Invariants

385

Daikon source code, output invariants are exactly the same as before. In the following, each of the ideas will be tested for representing the results. The first idea deals with the fact that some variables do not need to be checked. Properties of variables that do not need to be checked will be observed in section 4.1. In fact, the second idea is subsequent to the first. Second idea is related to sorting data trace files and this will be probed in section 4.2. 4.1 The First Idea Before explaining the first idea for improving the run time of invariants detection Daikon algorithm, attention must be paid to this fact that Daikon algorithm runs program with several different input parameters and then, extracts the values of variables and checks them, whereas there are many derived variables and function parameters which may not undergo changes in values in subsequent runs. Therefore it isn’t necessary to check the values of unmodified variables. For instance, if there were three variables named X, Y, Z with the values 3, 4 and 5 respectively, some of invariants that may be discovered are X
386

S. Parsa et al.

1

2

3

4

5

6

2

4

3

(a)

5

6

1 (b)

Fig. 3. Chromosomes corresponding with table1 and table2 data trace file

4.2.2 Crossover Operator This problem uses of cycle crossover that creates a child where every gen is occupied by a correspondiment from one cycle of the parents. For example see Fig 4. This Fig depicts two parents as well as their resultants children after cyclic crossover (Moraglio et al. 2006). 4.2.3 Mutation Operator The probability of mutation is set to a positive real number near to zero, here 0.001 is selected. It also applied per chromosome. This operator contains exchanging values of two random gens. 4.2.4 Selection Operator Here truncation selection is employed as selection operator of genetic algorithm. In this selection, first all chromosomes including parents and their children sorted according to their fitnesses. Then the best of them is selected as new population of genetic algorithm. 4.2.5 Fitness Function It is desired to sort data trace files based on minimum differences in the number of variables whose values are changed in consecutive runs. As it is obvious Table 1 shows that the variables are modified just on 8 points. There are four differences on the running of 3.dtrace and four on the running of 4.dtrace, while in Table 2, there are twelve differences on the running of 1.dtrace, 4.dtrace and 3.dtrace altogether. So, first combination of data trace files is better than second one. The fitness values for Table 1 and Table 2 are 8 and 12 respectively.

5 Comparison of the Results Two programs written with C language were used for comparing the results. Kvasir was run on the first program which its source is presented in Fig 1, for six times and the results are shown in the Table 1 in terms the attained values of the variables.

New Efficient Techniques for Dynamic Detection of Likely Invariants

387

Fig. 4. Cyclic crossover of two parents and their resultants children

Then, Daikon source code was changed so as not to check unmodified variables in consequent data trace files, and as every variable was checked, the word "Trace" was written in output. Based on all the observations, sixteen times the word "Trace" was appeared when the unrevised version of Daikon was run over the data in Table 1. It is clear that every occurrence of the word "Trace" on the output shows that Daikon have checked all of invariants templates. However, when the revised version of Daikon was run with the above-mentioned amendment (so as not to check unmodified variables), the word "Trace" was just appeared eight times. So, reduction of the occurrences of the word "Trace" to half shows that the run time effectively decreased. However, if the program is run with the following arrangement, the word "Trace" appeared twelve times. The reason is that this arrangement has more differences on the values of the variables. Table 1. The values of the variables obtained by Kvasir over the first program Data Trace File 1.dtrace 2.dtarce 3.dtrace 4.dtrace 5.dtrace 6.dtrace

Yr

d1

d2

Return

4 4 2 5 5 5

4 4 2 5 5 5

5 5 3 6 6 6

10 10 5 11 11 11

388

S. Parsa et al.

As before mentioned Table 1 shows that the variables are modified just on 8 points. While in Table 2, shows that the variables are modified on 12 points. As discussed above, if best or nearly the best combination of data trace files based on minimum differences in the values of the variables is selected, then the performance will be improved considerably provided that the first modification over Daikon is considered. Table 2. The values of the variables in Table 1 in a non-effective reordering Data Trace File 5.dtrace 6.dtrace 1.dtrace 2.dtarce 4.dtrace 3.dtrace

Yr

d1

d2

Return

5 5 4 4 5 2

5 5 4 4 5 2

6 6 5 5 6 3

11 11 10 10 11 5

For second instance, the source code Fig 5 was tested that is the same as source code in Fig 1, but used pointers. #include <stdio.h> int main() { int year; scanf("%d",&year); int w=year,s=year+1; compute(year, &w, &s);} int compute(int yr, int* d1, int* d2) { if (yr % 4){ return *d1 + *d2;} else{ return *d1 + *d2 + 1;}}

*d1=*d2;

Fig. 5. The sample source code of Fig 1 employing the pointers

The following values for variables, Table 3, were calculated on six runs of kvasir. Table 3. The values of the variables obtained by Kvasir over the second program Data Trace File 1.dtrace 2.dtarce 3.dtrace 4.dtrace 5.dtrace 6.dtrace

Yr

d1

d2

Return

1 3 4 2 4 2

1 3 4 2 4 2

2 4 5 3 5 3

4 8 11 6 11 6

New Efficient Techniques for Dynamic Detection of Likely Invariants

389

After running the original Daikon on the values of variables that are listed in Table 3, the word "Trace" was written 35 times and when the revised version of the Daikon (Daikon that does not check unmodified variables) was run over them, it also appeared 35 times because all the variables have different values with respect to corresponding ones in their previous run. If revised version of the Daikon runs on the following combination of data trace files obtained by genetic algorithm, i.e. Table 4, then the word "Trace" appears 23 times while the original Daikon the word "Trace" appears as before 35 times. Table 4. The values of the variables in Table 3 reordered by GA in an effective manner Data Trace File 1.dtrace 2.dtarce 6.dtrace 4.dtrace 3.dtrace 5.dtrace

Yr

d1

D2

Return

1 3 2 2 4 4

1 3 2 2 4 4

2 4 3 3 5 5

4 8 6 6 11 11

6 Conclusion and Further Works Because invariants have a very influential role on software testing, decreasing its run time and increasing its care will absolutely help the field of software engineering. Two improvements are proposed to increase dynamic invariant detection performance. With regard to the afore-mentioned discussions and amendments, runtime of dynamic invariant detection was much better than the original method. Special when the number of variables is high, the performance of dynamic invariant detection methods decreases. So turning to heuristic method such as the proposed improvements is inevitable and justified in such situation, and the proposed method will show its major effect in large variable number. From other perspective, if we want to obtain very certain invariants, we are obliged to use large datatrace files, which it results to decreasing dynamic invariant detection software. In these situations the proposed method is also justified and effective. For further study about dynamic invariant detection, an issue that can be studied is calculate the runtimes of comparing invariant templates with the values of the variables and vice versa and finding out the options that which method is better, and then with fuzzy algorithms as every options select one way.

References 1. Cook, J.E., Wolf, A.L.: Discovering Models of Software Processes from Event-Based Data. ACM Trans. Software Eng. and Methodology (1998a) 2. Cook, J.E., Wolf, A.L.: Event-Based Detection of Concurrency. In: Proc. ACM SIGSOFT Symp. (1998b)

390

S. Parsa et al.

3. Dwyer, M.B., Clarke, L.A.: Data Flow Analysis for Verifying Properties of Concurrent Programs. In: Proc. Second ACM SIGSOFT Symp. (1994) 4. Ernst, M.D., Griswold, W.G., Kataoka, Y., Notkin, D.: Dynamically Discovering Program Invariants Involving Collections. Technical Report UW-CSE-99-11-02 (2000) 5. Ernst, M.D., Perkins, J.H., Guo, P.J., McCamant, S., Pacheco, C., Tschantz, M.S., Xiao, C.: The Daikon System for Dynamic Detection of Likely Invariants. Science of Computer Programming (2006) 6. Evans, D., Guttag, J., Horing, J., Tan, Y.M.: LCLint:A Tool for Using Specification to Check Code. In: Proc. Second ACM SIGSOFT Symp. (1994) 7. Kataoka, Y., Ernst, M.D., Griswold, W.G., Notkin, D.: Automated Support for Program Refactoring Using Invariants. In: Proc. Int”l Conf. Software Maintenance (2001) 8. Lencevicius, R., Hölzle, U., Singh, A.K.: Query-Based Debugging of Object-Oriented Programs. In: Proc. Conf. Object-Oriented Programming, Systems, Languages, and Applications (1997) 9. Mitchell, M.: An Introduction to Genetic Algorithms. The MIT Press, Cambridge (1998) 10. Moraglio, A., Kim, Y.H., Yoon, Y., Moon, B.R., Poli, R.: Generalized cycle crossover for graph partitioning. In: GECCO (2006) 11. Nimmer, J.W., Ernst, M.D.: Automatic Generation of Program Specifications. In: Proc. Int”l Symp. Software Testing and Analysis (2002) 12. Perkins, J.H. and Ernst, M.D.: Efficient Incremental Algorithms for Dynamic Detection of Likely Invariants. In: Proc. ACM SIGSOFT Symp. (2004) 13. Schmitt, H., Weiβ, B.: Inferring Invariants by Static Analysis in KeY (2007)

Classification Ensemble by Genetic Algorithms Hamid Parvin, Behrouz Minaei, Akram Beigi, and Hoda Helmi School of Computer Engineering, Iran University of Science and Technology (IUST),Tehran, Iran {Parvin,B_Minaei,Beigi,Helmi}@iust.ac.ir

Abstract. Different classifiers with different characteristics and methodologies can complement each other and cover their internal weaknesses; Thus Classifier ensemble is an important approach to handle the drawback. If an automatic and fast method is obtained to approximate the accuracies of different classifiers on a typical dataset, the learning can be converted to an optimization problem and genetic algorithm is an important approach in this way. We proposed a selection method for classification ensemble by applying GA for improving performance of classification. CEGA is examined on some datasets and it considerably shows improvements. Keywords: Classifier Selection, Classifier Ensemble, Genetic Algorithms.

1 Introduction Classification and regression are common problems in pattern recognition field. Classification methods are supervised learning methods and determine a discriminative function for mapping the n-dimensional feature space to decision regions by assigning data samples to predefined classes [1]. Designing models with high recognition rates is very important goal in this field. In pattern recognition, the input space is mapping into the high dimensional feature space, and is trying to determine the optimal hyperplane in the feature space. The mapped function approximates the main function for each unseen data. Regression includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. The accuracy of a regression is evaluated by summation of square error of the function and regression function in training data. Recognition systems are used in many applications and researchers aim to improving performance of those systems [2]. Although the accuracy of the classifier ensemble is not always better than the most accurate classifier in ensemble, its accuracy is never less than average accuracy of them [3]. Combination of multiple classifiers (CMC) can be considered as a general solution method for pattern recognition problems [4]. Inputs of CMC are result of separate classifiers and output is their consensus decisions. Neural network ensemble is also used in data mining [5]. It is shown that if more diverse classifiers are used in the ensemble, then error can considerably be reduced [6]. Ensemble learning algorithms train multiple base classifiers and then combine A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 391–399, 2011. © Springer-Verlag Berlin Heidelberg 2011

392

H. Parvin et al.

their predictions. It was established firmly as a practical and effective solution for difficult problems. It has been shown that genetic algorithm is an effective tool for data mining and pattern recognition tasks [7]. There are two different approaches in genetic algorithm applications: first is its direct usage as classifier and second is its usage as optimizer [8, 9].

2 Background 2.1 Classifier Ensemble Classifier ensemble works well because different classifiers with different characteristics and methodologies can complement each other and cover their internal weakness. If different classifiers vote as an ensemble the overall error rate will decrease significantly rather than using each individually. One of the most common policy in classifier ensembles is majority voting. The output of each classifier in this method is its vote. The class with most of vote is winner. Assume that E is the ensemble of n classifiers {e1, e2, …, en} and there are m classes. Binary D matrix is obtained from applying the ensemble over data sample. ⎡ d1 1 ⎢ . D=⎢ ⎢d m −1 1 ⎢ ⎢⎣ d m 1

d1 .

2

d m −1 2 dm 2

⎤ ⎥ ⎥ . d m −1 n ⎥ ⎥ . d m n ⎥⎦ . .

d1 .

n

(1)

If classifier j votes that data sample is belonged to class i, di,j is one and otherwise is zero. According to eq. 2, the ensemble decides the data sample to belong class b. m

n

i =1

j =1

b = max arg ∑ d i

j

(2)

In weighted majority vote algorithm, members’ vote has different worth and every vote is multiplied by its worth [3]. 2.2 Genetic Algorithm Genetic Algorithms are adaptive heuristic search algorithm premised on the evolutionary ideas of natural selection and genetic. GAs represents an intelligent exploitation of random search within a defined search space to solve problems. Many of the real world problems involve finding optimal parameters, which might prove difficult for traditional methods but ideal for GAs [8]. 2.3 Regression Suppose a training set containing n data samples {x1, x2, …, xn}, with corresponding values {t1, t2, …, tn}. Fig. 1 shows a training set with n = 10 data points and the curve shows the function sin(2πx) used to generate the data. Goal is to predict the value of t for some new value of x, without knowledge of the real curve.

Classification Ensemble by Genetic Algorithms

393

Fig. 1. Function sin (2πx) used to generate 10 data samples

The support vector machine is one of the best learners for regression issues. For robustness of regression results and getting rid of output variance, an ensemble of support vector machines are utilized as regressors, i.e. k support vector machines are prepared (trained).

3 Related Works For classifier selection the assumption is that there is an oracle which can decide which classifier is the best for classifying each data sample and that result is used as the final decision of classifier ensemble [6]. Classifiers are sorted based on their predetermined accuracies and for a data sample, reliability degree of each classifier is assumed to be the highest probability with which it determines the class of the instance. The procedure will be started from the most accurate classifier and it continues as follow. If reliability degree of the classifier for the input sample is high, the classifier output will be the final decision of the ensemble and if it is not, reliability degree of the next classifier is calculated until either a reliable classifier is found or all the classifiers in ensemble are checked and no reliable classifier is found. If all the classifiers have low reliability degrees for the input sample, then the classifiers ensemble does not have a decision for that sample, or a classifier with most degree of reliability will be chosen [10]. There are two general approaches for classifier selection which are explained in subsequent sections. 3.1 Dynamic Estimation of the Best Local Classifier In this approach, while constructing an ensemble, the best classifier is selected for each region. In [11], the classifier selection is done based on position of the input sample in feature space. For estimation of reliability degree of an input sample, the classes of its k-nearest-neighbors are determined by all the classifiers. Then the accuracy of each classifier is calculated for the region of that instance. These accuracies are used as an estimation of reliability degrees of the instance.

394

H. Parvin et al.

3.2 Estimation of Competence Regions Calculating k-nearest-neighbors is a time-consuming task. For reducing the time complexity, classification task is divided into subsystems. Dividing the whole space to different regions, some of the regions may contain few data samples. For regions balancing, data can be first clustered and then for each cluster a region is specified [8]. In order to gain necessary diversity, the data sets can be modified so that each classifier in the ensemble is trained on its own dataset. Two of the most brilliant approaches in the field of classifier ensembles are proven to be bagging and boosting methods. In bagging method, the ensemble is made of classifiers built on bootstrap replicates of the training set. Finally after training of the base classifiers, whether parallel or serial, their outputs are combined by the majority vote rule [12]. Boosting, also are of the best approaches which is able to create a full strong ensemble [13].

4 Proposed Method 4.1 Problem Definition How a near optimized classification framework for a huge dataset can be designed? An optimum classifier is a both robust and accurate one. Classifier fusion has more robustness rather than a basic classifier. As the accuracy is the most important characteristic of classifiers, the classifier selection is necessary. Also the time overload of the system should be transferred from test phase to train phase. 4.2 Challenges of the Problem There is an unbalanced distribution of component (i.e. base classifiers) accuracies through feature space. This can be a drawback of base classifiers, because for an input instance, it is not simple to decide which classifier must be applied. But classifier selection benefits from the unbalanced distribution of component (i.e. base classifiers) accuracies through feature space to improve the performance of classification. It tries to find classifiers with best performance in desired region (neighborhood of input instance). In classifier fusion it is tried to overcome this drawback by increasing voters. This work can result in robustness of results. But classifier fusion also faces to another difficult problem named diversity problem, i.e. the components of an ensemble in order to result in higher performance must be diverse. Some algorithms like genetic algorithm are capable of solving hard problems to an admissible performance. Especially it has been shown that these series of algorithms are applicable for finding diverse classifier ensemble [14]. How are continuous features handled? For handling continuous features one should divide the data space to subspaces [15], in other words discretization is needed. As it is said before, another challenge in existent methods is time overload of finding k nearest neighbors which results in increase of the test time, while our goal is to shift the time overload to training phase and keep the test phase as short as possible. In [10] it is shown that each continuous feature can be converted to a discrete one. A kind of classifier is also presented there, which uses rules for classification. Each

Classification Ensemble by Genetic Algorithms

395

rule consists of a condition part (A) and a result part (B) and is shown as A→B. for each rule, if condition part is true (It usually determines if a sample is on a subset of feature space or not), the result part is executed. 4.3 Genetic Algorithm Based Classifier Ensemble In this work, a data set is divided to several subsets and each subset itself, is assumed a sub-data set. These subsets may have overlapped. For each subset, the best classifier is selected or designed. Some rules can also be extracted in parallel.

CEGA Algorithm: Initialization (dataset D, min_number of regions MINR, max_number of regions MAXR, number of classes c, number of feature f, q); TRD = Extract_training_data(D); VD = Extract_validation_data(D) TED = Extract_test_data(D); Ensemble = Create_ensemble_of_our_classifer (TRD, VD); DD = Generate_dataset(q ,c, f); Features = Extract_features(DD); For i=1 to Num(classifiers) Accuracies = train_classifier[i](DD); Dataset_of_dataset[i] = Combine(Features, Accuracies); end for for i= 1 to Num(classifiers) regressor[i] = obtain_regressors_ensemble(Dataset_of_datasets[i]); previous_acc = 0; for i= MINR to MAXR subspaces = Run_Ga[i]; GA_Ensemble = Analysis (subspaces); current_acc = evaluate_Ensemble(GA_Ensemble, Ensemble, VD); If(current_acc<previous_acc) Break; previous_acc = current_acc; end for GA_Ensemble = Analysis (subspaces); validation_accuracy = evaluate_Ensemble(GA_Ensemble, regressor[1:p], VD); test_accuracy = evaluate_Ensemble(GA_Ensemble, regressor[1:p], TED); return GA_Ensemble, regressor[1:p] , validation accuracy, test accuracy; function evaluate_Ensemble(GA_Ensemble, regressor[1:p], data) Result = Test Ensemble(GA_Ensemble, data); for i= 1 to number of data if(isEmpty(Result(i))) Result(i) = Test Ensemble(GA_Ensemble, data[i]); return compute accuracy(Result);

Fig. 2. GA based algorithm for classifiers ensemble

396

H. Parvin et al.

For example, which set of classifiers is more diverse and consequently is a better ensemble. Because of high complexity in this task, it is done approximately, not exact. GA can solve this problem with a good approximation. The main goal of this task is to make the classification more accurate. In fact GA is used to escape local optimas and to reach a near optimum ensemble. In proposed method, each chromosome defines several subspaces on training data set. The data samples are trained by the classifier which is selected by the regression function of error rate of the classifier. Chromosome length depends on feature space and maximum subspaces. Fitness value is high for more accurate and data cover ensembles. The outputs are subspaces with maximum accuracy. Pseudo-code of the algorithm is presented in Fig. 2. The data set is first split into training set, testing set and evaluation set. Then a number of classifiers is trained on training set and put in an ensemble of classifiers. Then some synthetic datasets are produced and some representative and fast extractable features are produced from each. This unlabeled dataset is called dataset of datasets. Then each of the base classifiers in ensemble is trained on each dataset (samples in dataset of datasets) and their accuracies are calculated. These accuracies are set as labels for that dataset (sample). Now each sample has as many labels as number of base classifiers. After that an ensemble of regressors is employed to learn each of these tags. Therefore we have actually an ensemble of regressors which approximates the accuracy of each classifier. Now while accuracy on evaluation dataset increases, a region is added to GA. But GA is looking for subspaces (regions) which contain most of the training data and classifiers on those subspaces are the most accurate ones. Fig. 3 depicts the structure of a chromosome.

Fig. 3. A chromosome: each chromosome consists of m regions, each region is defined with a lower and an upper bound of features

Each chromosome determines all the regions so samples belonging to each region can be easily determined. The accuracy of training data in each region is approximated using an ensemble of SVMs. Fitness value of a chromosome is defined so that, those chromosomes that their regions contain more data and accuracy in them is high gradually fill the chromosome pool. Assume that Acc(1…m) is the maximum approximated accuracy that regressors regressed for region 1 to m, and Num is number of samples in training dataset which fall into regions 1 to m. if number of training samples in one region is less than a threshold, that region is ignored. The fitness function of chromosomes is calculated as eq. 3.

Classification Ensemble by Genetic Algorithms

397

m

∑ ( Acc ) i

Fitness =

i =1

m

∑ (1 − Acc ) + 1

* Num

(3)

i

i =1

5 Experimental Results 5.1 References Datasets

The proposed method is evaluated on seven standard datasets, namely Iris, Wine, Balance-scale, Bupa and Monk’s problems (including three problems) and also three 2-dimensional hand-made datasets containing three classes which are depicted in the Fig 4. None of the databases had missing values. Before using datasets, all the features in all of the datasets are normalized to N(0,1). These standard data sets are obtained from UCI repository [16].

Fig. 4. Three hand-made datasets

These Tables show the results of executed experiments in 3 Data Base. Each DB has 3 classes, and they are sampled from Normal distribution -N (0, 1). 5.2 Parameters Setting

In experimental results, number of records in training set is 65%, testing set is 15% and evaluation set is 20%. The base classifiers which are used include: SVM, MLP, KNN, MKNN1, MKNN2 [17], and PNN. Using LDA and the same set of classifiers as mentioned, another six classifiers are created and added to ensemble. Thousand synthetic datasets are produced and nine features are extracted from each. The nine features used in our algorithm are as follow: • •

• • •

The class change rate in minimum spanning tree on data set. Degree of linearity of dataset by averaging random selections of two samples from dataset and testing the samples on the line between those two random samples using a simple linear classifier. Number of inherent clusters in training data. Minimum and maximum number of samples belonging to a class to the number of samples in training set. Maximum fisher measure on features of training samples.

398

• • •

H. Parvin et al.

The ratio of number of samples to number of features in training samples. Average of distances between first neighbor of the same class and first neighbor which belongs to another class. Average number of neighbors of the same class in training samples.

So dataset of datasets has 1000 records, and 9 features plus 12 labels. . Therefore we have actually an ensemble of SVM regressors which approximates the accuracy of SVM and other mentioned classifiers on dataset. Now while accuracy on evaluation dataset increases, a region is added to GA. Population size and number of generations in this examination are 500 and 1000. The proportional selection is chosen. The achieved results are averaged on 5 distinct runs and they are shown in Table 1. Table 1. Experimental results Accuracy of Gabe on Validation

Accuracy of Gabe on Test

Accuracy of Ensemble on Validation

Accuracy of Ensemble on Test

Monk 1

0.9896

0.9896

0.9669

0.9896

Monk 2

0.8838

0.8773

0.8311

0.8227

Monk 3

0.9803

0.9776

0.9692

0.9624

Bupa

0.7501

0.7059

0.7147

0.6897

Iris

0.934

0.9642

0.873

0.9011

Wine

1

0.9783

0.9702

0.9432

Balance_Scale

0.9362

0.8817

0.8936

0.8387

Dataset 1

0.7988

0.8000

0.7536

0.7626

Dataset 2

0.8461

0.7778

0.7738

0.7149

Dataset 3

0.8454

0.7978

0.7646

0.7131

6 Conclusion Different classifiers with different characteristics and methodologies can complement each other and cover their internal weaknesses; Thus Classifier ensemble is an important approach to handle the drawback. Determining which classifiers are better or which set of classifiers is more diverse and consequently is a better ensemble in a complex problem is a very important problem. If an automatic and fast method is obtained to approximate the accuracies of different classifiers on a typical dataset, the learning can be converted to an optimization problem. So heuristic optimizers like GA can solve the problem with a good approximation. In proposed method (CEGA), space has divided to some subspaces encoded in a chromosome. Then using some fast regressors it has approximated accuracy of each base classifier in each subspace. The maximum accuracy in each subspace is selected. Fitness function is set to weighted average accuracies of subspaces. Proposed CEGA is examined on some datasets.

Classification Ensemble by Genetic Algorithms

399

References 1. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, NY (2001) 2. Parvin, H., Alizadeh, H., Minaei-Bidgoli, B., Analoui, M.: An Scalable Method for Improving the Performance of Classifiers in Multiclass Applications by Pairwise Classifiers and GA. NCM, Korea (2008) 3. Kuncheva, L.I.: Combining Pattern Classifiers, Methods and Algorithms. Wiley, New York (2005) 4. Saberi, A., Vahidi, M., Minaei-Bidgoli, B.: Learn to Detect Phishing Scams Using Learning and Ensemble Methods. In: IEEE/WIC/ACM International Conference on Intelligent Agent Technology, Workshops (IAT 2007), Silicon Valley, USA, November 2-5, pp. 311– 314 (2007) 5. Qiang, F., Shang-xu, H., Sheng-ying, Z.: Clustering-based selective neural network ensemble. Journal of Zhejiang University Science 6A(5), 387–392 (2005); ISSN 1009-3095, Fu et al. / J Zhejiang Univ SCI 2005 6. Lam, L.: Classifier combinations: Implementations and theoretical issues. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 77–86. Springer, Heidelberg (2000) 7. Freitas, A.A.: A survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery. Advances in Evolutionary Computation. Springer, Heidelberg (2002) 8. Bandyopadhyay, S., Muthy, C.A.: Pattern Classification Using Genetic Algorithms. Pattern Recognition Letters 16, 801–808 (1995) 9. Bala, J., De Jong, K., Huang, J., Vafaie, H., Wechsler, H.: Using learning to facilitate the evolution of features for recognizing visual concepts. Evolutionary Computation 4(3) (1997); Special Issue on Evolution, Learning, and Instinct: 100 years of the Baldwin Effect 10. Woods, K., Kegelmeyer, W.P., Bowyer, K.: Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 405–410 (1997) 11. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs, 3rd edn. Springer, Heidelberg (1996) 12. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) 13. Breiman, L.: Arcing classifiers. The Annals of Statistics 26(3), 801–849 (1998) 14. Gunter, S., Bunke, H.: Creation of classifier ensembles for handwritten word recognition using feature selection algorithms. In: IWFHR (2002) 15. Puuronen, S., Tsymbal, A., Skrypnyk, I.: Correlation-based and contextual merit-based ensemble feature selection. In: Hoffmann, F., Adams, N., Fisher, D., Guimarães, G., Hand, D.J. (eds.) IDA 2001. LNCS, vol. 2189, pp. 135–144. Springer, Heidelberg (2001) 16. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html 17. Parvin, H., Alizadeh, H., Minaei-Bidgoli, B.: MKNN: Modified K-Nearest Neighbor. In: WCECS 2008, San Francisco, USA (2008)

Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm for Electric Circuit Units (ECUs) Umair F. Siddiqi1 , Yoichi Shiraishi1 , Mona A. El-Dahb1 , and Sadiq M. Sait2 1

Department of Production Science & Technology, Gunma University, Japan {umair,siraisi,mona}@emb.cs.gunma-u.ac.jp 2 King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia [email protected]

Abstract. ECU (Electric Circuit Unit) is a type of embedded system that is used in automobiles to perform diﬀerent functions. The synthesis process of ECU requires that the hardware should be optimized for cost, power consumption and provides fault tolerance as many applications are related to car safety systems. This paper presents a Simulated Evolution (SimE) based multiobjective optimization algorithm to perform the ECU synthesis. The optimization objectives are: optimizing hardware cost, power consumption and also provides fault tolerance from single faults. The performance of the proposed algorithm is measured and compared with Parallel Re-combinative Simulated Annealing (PRSA) and Genetic Algorithm (GA). The comparison results show that the proposed algorithm has an execution time that is 5.19 and 1.15 times lesser, and cost of the synthesized hardware that is 3.35 and 2.73 times lesser than the PRSA and GA. The power consumption of the PRSA and GA (without fault tolerance) are 0.94 and 0.68 times of the proposed algorithm with fault tolerance. Keywords: Electric Circuit Unit, Embedded Systems, Synthesis, allocation, assignment, scheduling, Simulated Evolution.

1

Introduction

In automotive electronics, Electric Circuit Unit (ECU) refers to any embedded system that is responsible for controlling one or more electrical systems or subsystems. Automobiles can have from dozens to hundreds of ECUs. The ECUs are generally partitioned by their domains. The two broad classes of ECUs are: (1) hard real-time control of mechanical parts, and information-entertainment, and (2) information management, navigation, computing, external communication and entertainment [1]. Examples of ECUs include the following: Engine control unit that controls the fuel injection, ignition timings, idle speed, valve timing, and electronic valve. On-board diagnostics is another type of ECU that is responsible for controlling and managing the vehicles’ self-diagnostic and reporting ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 400–409, 2011. c Springer-Verlag Berlin Heidelberg 2011

Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm

401

system. ECU for transmission control uses sensors from the vehicle and data provided by the ECU to determine when and how to change gears in an automobile for optimum performance, fuel eﬃciency and shift quality. The propulsion system of Electric Vehicles (EV) uses microprocessors, digital signal processors (DSPs), and microcontrollers. In all types of automobiles the ECU plays an important role and improvement in its design brings an overall improvement. Recently, new applications have been deployed in automobiles and the complexity of the ECU design increases from a single microcontroller to a system that includes processors, Field Programmable Gate Arrays (FPGAs) and/or Application Speciﬁc Integrated Circuits (ASICs). Some examples of new applications for the ECUs are: Driver Assistance (DA) functions which help drivers in parking vehicles, maintaining safe distance from other cars, inform the drivers from threats that they cannot see, and many other useful tasks. The DA functions use a number of near-range sensors (ultrasonic & cameras) that are located all around the vehicle. Similarly, the night vision systems use cameras with infrared (IR) illumination and thermal imaging technology and also use the navigation system display of the automobile to show heat-sensitive or IR images to the driver. In all such applications, ECUs are responsible for processing the signals and generating responses. The requirements for eﬀective real time operation of most of the types of ECUs are: meeting hard time deadlines, fault-tolerance, minimizing power, and minimizing the hardware cost. The proposed synthesis process is designed to optimize the same four parameters. The proposed synthesis process is referred to as ECU synthesis process in the rest of the paper. The inputs to the synthesis process are directed task graphs in which the nodes represent the tasks or computation and edges represent the communication between the tasks. The process of synthesis of embedded systems consists of three sub-problems: (a) Allocation, (b) Assignment, and (c) Scheduling. The allocation refers to selecting the suitable hardware units (also called as processing elements (PEs)) and communication resources (CRs). The assignment process involves binding the tasks to speciﬁc PEs and edges to CRs. If the nodes at the two ends of any edge are assigned to the same PE then no CR is required for that edge. The scheduling problem is to ﬁnd the order of execution of tasks on each PE. In our research we allowed dedicated connection between PEs, therefore, the CRs do not require scheduling. The assignment/allocation and scheduling problems are known to be NP-Complete [3] [4]. Fault tolerance is an important feature of ECUs in which the task graphs will complete their execution and met hard time deadlines even if faults occurs during the execution. Many ECUs perform safety related functions like DA functions and fault tolerance can enable the task graphs to complete their execution even if faults occurs during the execution. Faults can be of two types: (1) Permanent faults, and (2) Transient and intermittent faults. The permanent faults includes microprocessor failure, etc. The tolerance from permanent faults is provided through redundant hardware. Transient faults can occur due to electromagnetic interference and intermittent faults can appear and disappear repeatedly. The behavior of transient

402

U.F. Siddiqi et al.

and intermittent faults appear very similar to the fault tolerance techniques and therefore, same methods can provide tolerance from both these types of faults. The ratio of transient and intermittent faults to permanent faults is 100:1, and therefore the former two types form the majority of the faults. The hardware solution that provides tolerance from n permanent faults also provide tolerance from n transient and intermittent faults. However, the solutions for tolerance from permanent faults are expensive and the transient and intermittent faults are much more frequent then the permanent faults therefore, separate methods are developed to provide tolerance from transient and intermittent faults. In this paper, the authors have introduced a method of synthesizing optimized ECU hardware for the input task graph with hard time deadlines. The proposed method optimizes the system cost and energy consumption while ensures that all hard time deadlines are met and implements fault tolerance from single fault per task. The input is any task graph with information on time deadlines, execution times and energy consumption of PEs and time delays of CRs. The output from the method is the assignment of tasks and edges to the PEs and CRs and the order of execution (or schedule) in which the tasks will be executed in each PE. All outputs are valid solutions in which no hard time deadline is violated. SimE is found to be very eﬀective in solving circuit design problems [6] [7] with less computational and memory requirements. The work in this paper is unique from previous works because it evaluated Simulated Evolution (SimE) [5] to solve the multi-objective ECU synthesis problem. The SimE maintains one solution at a time. In the previous works, the algorithms like GA or PRSA that maintains several solutions at a time were considered. This paper shows that SimE can be used to solve ECU synthesis problems and can yield good results. SimE is less computational as compared to GA or PRSA. The paper is organized as follows: Section 2 will provide a survey of the relevant previous work. Section 3 contains a detailed description of the proposed method. Simulations and comparison of the proposed method with another method is shown in section 4. Finally, conclusions will be presented in section 5.

2

Previous Work

This section will summarize some of the previous work that is most relevant to the proposed method. SLOPES by L. Shang, R.Dick, and N. Jha [2] is a hardware-software cosynthesis algorithm that uses Parallel Re-combinative Simulated Annealing (PRSA) [8] to solve the allocation/assignment problem. Their algorithm maintains library of hardware units like processors, memories, dynamically reconﬁgurable FPGAs, and communication resources. The FPGA can conﬁgure any task by loading the appropriate bit stream from external memory. The reconﬁguration latency of tasks is called reconﬁguration time. They solved the scheduling problem by two approaches. In the ﬁrst approach the scheduling sequence i.e. the order in which the tasks are scheduled is based on static slack based priority. The second approach that is recommended for FPGAs, the scheduling sequence is determined dynamically by task priorities that

Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm

403

consider both real-time constraints and frame-by-frame reconﬁguration overhead information. Vida et al. [9] proposed the use of Strength Pareto Evolutionary Algorithm (SPEA) [10] that is a multi-objective optimization heuristics. The objective of their synthesis process is to simultaneously minimize multiple objectives for which the corresponding values cannot be improved without degradation in another. The SPEA uses Pareto dominance in calculating ﬁtness function, and maintaining population density. They also used clustering to reduce the complexity of assignment and scheduling problems. In clustering, all the tasks that should be mapped to the same PE are included in one cluster. Kuchcinski proposed to solve the problem of embedded system synthesis using constraint logic programming [11]. They consider two types of constraints, the ﬁrst type belongs to the system functional speciﬁcation and the second type belongs to the non-functional requirements. The system is represented by the constraints over ﬁnite domain variables (FDV). Each FDV eventually obtained an integer value that speciﬁes a solution. Diﬀerent constraints solving techniques are used to ﬁnd optimal and suboptimal solutions in which the given constraints are satisﬁed and the cost function is optimized. P. Eles et al. [12] proposed a Tabu Search [5] based method that optimizes the assignment of fault tolerance policy i.e., rollback recovery with checkpointing, active replication, passive replication, or a combination of the techniques to each process and also optimizes the mapping of processes to PEs (or hardware nodes). The method ﬁnds the upper bound for the number of checkpoints using formula derived by Punnekkat et al. [13] and the Tabu search based method performs the further optimization. The method yields solutions in which fault tolerance is provided from upto k (where k >0) faults and the hard time deadlines are met and the cost constraints are satisﬁed.

3

Proposed Algorithm

This section describes the method proposed for the synthesis of ECU. As described in the ﬁrst section the input consists of task graphs with hard time deadlines. The rest of the section shows the design of SimE for the ECU synthesis problem. The proposed method uses two types of PEs: (a) Uniprocessor PEs, in which only one task can be executed at a time. (b) Multiprocessor PE, in which n number of tasks can be executed concurrently, where n is the number of processors. FPGAs can also fall into the category of multiprocessors when several tasks are conﬁgured in them in parallel. 3.1

SimE for the ECU Synthesis Problem

SimE is an iterative heuristic for the combinatorial optimization problems. It is a general randomized search algorithm that is based on concepts learned from biological evolution. SimE can be tailored for speciﬁc applications. This paper shows SimE tailored for solving the allocation/assignment problem in the ECU Synthesis. The allocation/assignment problem in ECU Synthesis is NP-complete optimization problem in which the assignment of tasks and edges to diﬀerent PEs

404

U.F. Siddiqi et al.

B= input value ∈[-0.2,0.2], VS = null φ = Population of all tasks & edges in the task graph Initialization: φ= Random Assignment(φ) while (stopping criteria is not met) { Evaluation: for i=0 to φsize f ind goodness(mi ); mi ∈ φ, Selection: for i=0 to φsize S= S ∪ mi if Selection(mi) returns true Scheduling: Schedule tasks in each PE if check validity(φ)=Yes then VS=VS ∪ φ Sort(S) Allocation: for i=0 to Ssize Allocation(ni ), ni ∈ S } where φsize = total elements in the population, Ssize = total elements in the set S

Fig. 1. Simulated Evolution (SimE) Algorithm with time deadline checking

and CRs need to be optimized in order to meet the real time deadlines and minimize the system cost and energy consumption. The PEs and CRs are randomly selected from the library (allocation). However, the goodness measure in the SimE favours selection of PEs and CRs that beneﬁts the optimization objectives. The SimE algorithm is shown in Fig. 1. In SimE, the population consists of all tasks and edges in the task graph. It maintains one solution at a time. However, as shown in Fig.1 all valid solutions can be stored into a set VS. A solution is valid if it does not violate any hard time deadline. The algorithm starts with an initial solution in which the tasks and edges are randomly assigned to any PEs and CRs. This is done through the function Random Assignment. The value of B is input to the algorithm, B value can vary from [-0.2 to 0.2]. The algorithm loop continues until one of the stopping criteria can be met. The criteria could be the maximum number of valid solutions obtained. In the loop, the ﬁrst step is the calculation of evaluation function. In which the goodness of each element in the population is calculated. The goodness function is described in Fig.2. The goodness function is based on ﬁnding ranking of diﬀerent quantities. The factor[0] is the rank that is given by the number of alternate assignments for the node or edge i that are not better than the current assignment in terms of execution time or time delay, or in other words the current assignment for the node or edge i is better than that many number of other assignments. In the same way, factor[1] is the rank of hardware cost that is associated with the node factor[0] factor[1] factor[2] factor[3]

= = = =

rank(Exec T imei orDelayi ) rank(HW Costi ) rank(P oweri ) T M rank(Minimum( Mpp , Tpp )) 3 f actor[i] 1 goodnessi = 4 ( i=0 T otal number of P Es or CRs ) where rank(x) function returns the number of possible assignments to other PEs or CRs that are better then the current assignment Tp = N umber of tasks ready f or P arallel Execution Mp = Maximum number of tasks that can execute in parallel on the P E goodnessi value ranges from 0 to 1

Fig. 2. Ranking based goodness function

Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm

405

or edge i, and factor[2] is the rank of energy consumption of the node i (factor[2]=0 & factor[3]=0 for edges). The factor[3] is the rank of the ratio between the number of tasks that can execute in parallel (i.e. the tasks that are independent from each other and also ready for execution at the same time) to the parallel processing capability of the assigned P Ei . The minimum value is taken between the ratio and its inversion that is to make sure that the value remains between 0 to 1 and 1 is the optimum value when the PE’s allowed parallelism is fully used. The total goodness value can vary from 0 to 1. The next step in the SimE is to apply Selection function on all elements of the population. The selection function is given as, Selection function: if (random < 1 - goodnessi + B) return true else return false. random refers to a any number that is randomly generated within the range of 0 to 1. goodnessi is the goodness of the node or edge i. The nodes or edges for which the selection function returns true will be selected into the set S. In the next step, the tasks in each PE are scheduled according to the scheduling scheme. The scheduling scheme will be described in the next subsection. The next step is to ﬁnd the validity of the solution by checking all hard time deadlines. If all deadlines are met then the solution is placed in the set VS. The next step is sorting, in which the nodes and edges in the set S are sorted in the ascending order of their goodness. The allocation is the next step, in which for each node and edge in the set S an exhaustive search [14] is carried out that compares their current goodness values with the goodness values of other possible assignments. At the end of the search the node or edge can either be re-assigned to the PE or CR that has maximum goodness value with probability pi or reassigned randomly to any PE or CR. The pi should be kept high (e.g. pi =0.90). 3.2

Scheduling Algorithms

The static slack based scheduling scheme [11] is used. In which the Earliest Start Time (EST) and Latest Start Time (LST) for the node is calculated using forward and reverse topological sorts. The idea behind the scheme is to assign low priority to tasks that can tolerate some delay in their execution. The processor scheduling scheme is shown below, in which priority of a task i is given by: P riorityi = −(ESTi + LSTi). 3.3

Fault Tolerance (FT) Schemes

Diﬀerent techniques exist to provide tolerance from transient and intermittent faults. The work in this paper uses two techniques namely: Rollback Recovery with Checkpointing (RRC) and Active Replication (AR). In RRC, the task is checkpointed at one or more points. When a checkpoint is reached during execution then the values of the inputs to that checkpoint are stored in a separate static memory. If any fault occurs after the checkpoint then the task can be reexecuted from the last checkpoint using the values stored in the static memory. In that way, the reexecution time decreases as compared to executing the complete task after fault occurrence. In AR a replica of the task will

406

U.F. Siddiqi et al.

For each node ni in P opulation do pe1 PE to which ni is assigned t1 = start time of ni , t2 = end time of n1 as found by the scheduling operation set AR=0 For each element pe2 ∈ P EAssigned and pe2 = pe1 do if pe2 is free during the time slot t1 and t2 Apply Active Replication (AR) and assigned the replica of ni to pe2 & set AR=1 end loop else (if AR==0) Apply Rollback Recovery with Checkpointing (RRC) with m number of checkpoints, m is randomly chosen between 1 and cp ×(α + λ) < tnode end loop P EAssigned is the set of only those PEs to which task(s) are assigned in the assignment operation

Fig. 3. Proposed Technique to Apply Fault Tolerance using AR or RRC

be simultaneously executing on another PE. In that way, when a fault is detected on any task then its output can be obtained from the replica. Both RRC and AR increases the execution time of tasks. The execution time of tasks when fault occurs (tf aultcase ) and execution time when no fault occurs (tnof aultcase ) can be determined by using the following formulas: When RRC is used: tnof aultcase = tnode + cp × (α + λ), tf aultcase = tnof aultcase + tslack , and tslack = tnode + μ. When AR is used: tnof aultcase = tnode , tf aultcase = tnode + α. cp where α refers to the time required to detect the fault and λ refers to the time required in checkpointing at one point. cp refers to the number of checkpoints in the task. tnode is the tasks execution time. μ is the recovery overhead. The FT is applied during initialization and reapplied after allocation in the optimization loop. The approach used to apply FT is shown in Fig. 3. FT can provide tolerance from single fault per node, i.e. computation at any node can still proceed to completion even if a fault occurs during its execution. No fault tolerance is provided in replicas’ execution. The algorithm proposed to apply RRC and AR to the nodes is shown in Fig. 3.

4

Simulations

This section shows results of the synthesis process using the proposed SimE based algorithm and their comparison with the previous algorithms. The input task graphs, PE, CRs are generated using the research tool TGFF [15]. The TGFF can generate task graphs of diﬀerent sizes that have hard time deadlines and can resemble real world applications. It can also generate PEs, and CRs that have information on tasks execution times, communication delays, cost, power consumption and parallel processing capability of the PEs. The software development is performed using UML and Java. The UML model deﬁnes the structure of the task graphs, PEs, CRs, and structure of the SimE and relationships among them. The SimE functions are implemented as operations in the UML model. The functions in the task graphs computations are included in the task graph structure so that the same task graphs can be used with any algorithm. The fault tolerance parameters are α = 10%, μ = 30%, and λ = 5% of the task execution time. The platform used is Intel Core i5 running at 2.27 GHz

Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm

407

and with 4 GB of memory. The SimE performance is compared with: (1) PRSA architecture that is described in SLOPES [2]. The PRSA is implemented with parameters: population size =10, number of clusters= 6 and stopping criteria is that the optimization loop is executed for 15 minutes, and (2) Genetic Algorithm (GA) [5], in which the population size is 12 and stopping criteria for the optimization loop is 1.5 minutes. PRSA and GA does not implement fault tolerance (FT). Total 60 (30/30) PEs and CRs are generated in the resource library using TGFF. The comparison results are shown in Table 1. All results include only valid solutions. The ﬁrst column contains the task graph characteristics i.e., task graph name and number of nodes (N) and edges (E). The second column contains the name of the algorithm to which the remaining values in the same row belongs to, i.e. Proposed algorithm with fault tolerance (Proposed(FT)), proposed algorithm without fault tolerance (Proposed(nonFT)), PRSA, and GA. Third column contains the average execution time taken by one iteration. Fourth and ﬁfth columns contain the minimum hardware cost and minimum power consumption that was obtained from the algorithms. Last column contains the average numbers of nodes that uses active replication (AR) or rollback recovery with checkpointing (RRC) for fault tolerance. The values in the last column are only for the Proposed(FT) algorithm. At the bottom in table 1 the average gain (in terms of number of times) that is obtained using the proposed algorithm over PRSA and GA is shown. The average gain obtained are:(1)Proposed algorithm with FT has an execution time that is on average : 5.19, 1.15 times lesser Table 1. Simulation & Comparison Results Task Graph Nodes/Edges tg0(57/56) tg1(45/48) tg2(44/43) tg3(44/45) tg4(73/76) tg5(60/62) tg6(48/53) tg7(39/42) tg8(40/43) tg9(47/46) Average Gains

Algorithm Execution Time Min. Cost Min. Power FT Scheme (ms) Proposed(FT)/Proposed(nonFT) 1480.54/40.53 1054.96/1034.4 85.91/37.83 AR=48 PRSA /GA 1514.71/ 519.38 3467.90/ 3171.9 77.3/58.3 Proposed(FT)/Proposed(nonFT) 10.41/8.27 808.06/890.46 65.77/29.43

RRC=8 AR=29

PRSA/ GA 461.92/ 70.15 Proposed(FT)/Proposed(nonFT) 31.76/6.9

2702.8 / 2424.1 60.2/ 44.6 821.07/852.99 61.01/26.69

RRC=15 AR=37

PRSA /GA 755.67/ 142.49 2757.30/ 2129.9 56.30/ 42.7 Proposed(FT)/Proposed(nonFT) 14.38/7.84 823.58/763.72 60.690/26.44

RRC=6 AR=36

PRSA/ GA 599.28/ 118.18 2634.60/ 2042.7 56.30/ 42.7 Proposed(FT)/Proposed(nonFT) 158.21/2.87 1274.33/1401 111.1/50.44

RRC=7 AR=71

PRSA/ GA 2323.91 758.65 4723.5/ 4069.7 100.3/ 79.4 Proposed(FT)/Proposed(nonFT) 52.06/14.77 1206.66/1161.5 89.79/41.16

RRC=1 AR=43

PRSA/ GA 752.87/ 244.15 3768.50/ 3214.4 91.20/ 60.5 Proposed(FT)/Proposed(nonFT) 13.16/9.1 945.36/688.52 69.30/31.03

RRC=16 AR=24

PRSA/ GA 566.52/ 81.46 Proposed(FT)/Proposed(nonFT) 58.83/4.9

2995.30/ 2341.7 68.10/ 45.5 847.05/708.92 57.46/25.40

RRC=23 AR=25

PRSA/ GA 752.63/ 23.89 Proposed(FT)/Proposed(nonFT) 18.38/5.5

2844.30/1898.8 57.70/36.4 748.09/847.96 58.14/25.71

RRC=13 AR=31

PRSA/ GA 469.92/ 78.91 Proposed(FT)/Proposed(nonFT) 22.87/8.2

2631.60/ 2130.5 53.80/ 39.8 868.81/860.39 69.71/30.89

RRC=8 AR=40

PRSA/ GA Proposed (FT) over PRSA Proposed (FT) over GA Proposed (nonFT) over PRSA Proposed (nonFT) over GA

1466.67/ 116.22 3047.5/ 2286.1 5.19 1.15 88.76 19.78

3.35 2.74 3.43 2.79

64.7/ 48.1 0.94 0.68 1.53 2.25

RRC=6

408

U.F. Siddiqi et al.

䢶䢷䢲䢲

䢶䢲䢲䢲

䢵䢷䢲䢲

Cost

䢵䢲䢲䢲

䢴䢷䢲䢲

䢴䢲䢲䢲

䢳䢷䢲䢲

䢳䢲䢲䢲

䢲

䢷䢲

䢳䢲䢲

䢳䢷䢲

䢴䢲䢲

䢴䢷䢲䢵䢲䢲 iterations

䢵䢷䢲

䢶䢲䢲

䢶䢷䢲

䢷䢲䢲

Fig. 4. Optimization behavior of the SimE on task graph tg5

than the PRSA and GA, and can produce hardware that have cost that is 3.35 and 2.74 times lesser than the PRSA and GA. The power consumption of the PRSA and GA are 0.94 and 0.68 times of the propsoed (FT) algorithm. (2) Proposed algorithm without FT has an execution time that is 88.76 and 19.77 times lesser than the PRSA and GA respectively. The hardware cost is 3.43 and 2.79 times and power consumption is 1.53 and 2.11 times lesser than the PRSA and GA respectively. The Fig. 4 plots the hardware costs of the solutions obtained in the VS set (Fig. 1) at the end of the optimization process for the task graph tg5. The x-axis contains the increasing order of the iterations in which valid solution were obtained and y-axis contains the corresponding hardware cost value of the ECU.

5

Conclusion

This paper has shown a new design of SimE that can solve the embedded system synthesis problem. The ECU speciﬁc optimization objectives were considered to make sure that it can be useful for the ECU synthesis. The goodnesses of nodes or edges is measured by taking average of the individual goodnesses of diﬀerent optimizations terms (time, cost, power, and resource utilization). The fault tolerance (FT) from single faults per task is also provided by using a combination of active replication (AR) and rollback recovery with checkpointing (RRC) techniques. The FT scheme ﬁrst attempts to apply AR without using any new PEs exclusively for FT. If AR cannot be applied to that node then RRC will be applied. The results have shown that the proposed algorithm can explore the search space quickly and apply alternate choices in less amount of time as compared to an architecture of PRSA proposed in an algorithm SLOPES and conventional GA.

Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm

409

References 1. Sangiovanni-Vincentelli, A., Di Natale, M.: Embedded System Design for Automotive Applications. Computer 40(10), 42–51 (2007) 2. Shang, L., Dick, R.P., Jha, N.K.: SLOPES: Hardware-Software Cosynthesis of Low Power Cosynthesis of Low-Power Real-Time Distributed Embedded Systems with Dynamically Reconﬁgurable FPGAs. IEEE Transactions on Computer Aided Integrated Circuits & Systems 26(3) (2007) 3. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., New York (1997) 4. Kwok, Y.-K., Ahmad, I.: Dynamic Critical-Path Scheduling: An Eﬀective Technique for Allocating Task Graphs to Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 7(5), 506–521 (1996) 5. Sait, S.M., Youssef, H.: Iterative Computer Algorithms with Applications in Engineering. IEEE Computer Society Press, Los Alamitos (1999) 6. Sait, S.M., El-Barr, A., Al-Saiari, U.S., Sarif, B.A.B.: Digital Circuit Design Through Simulated Evolution (SimE). In: The 2003 Congress on Evolutionary Computation (IEEE CEC 2003), Canberra, Australia, vol. 1, pp. 375–381 (2003) 7. Al-Saiari, U.S.: Digital Circuit Design Through Simulated Evolution, MS Thesis, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia (2003) 8. Mahfoud, S.W., Goldberg, D.E.: Parallel Recombinative Simulated Annealing: A Genetic Algorithm. Parallel Computing 21(1), 1–28 (1995) 9. Kianzad, V., Bhattacharyya, S.S.: CHARMED: A Multiobjective Co-Synthesis Framework for Multi-mode Embedded Systems. In: Proc. 15th IEEE Conference on Application Speciﬁc Systems, Architectures and Processors (ASAP 2004), pp. 28–40 (2004) 10. Zitler, E., Laumanns, M., Thiele, L.: SPEA2: Improving the Strength Pareto Evolutionary Algorithm for Multiobjective Optimization. In: Evolutionary Methods for Design, Optimization, and Control, pp. 95–100 (2002) 11. Kuchcinski, K.: Constraints-Driven Scheduling and Resource Assignment. ACM Transactions on Design Automation of Electronic Systems 8(3), 355–383 (2003) 12. Pop, P., Izosimov, V., Eles, P., Peng, Z.: Design Optimization of Time- and CostConstrained Fault-Tolerant Embedded Systems with Checkpointing and Replication. IEEE Transactions on VLSI Systems 17(3) (2009) 13. Punnekkat, S., Burns, A., Davis, R.: Analysis of checkpointing for real-time systems. Real-Timr J. 20(1), 83–102 (2001) 14. Kling, R.M., Banerjee, P.: ESP: Placement by Simulated Evolution. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems 8(3), 245–256 (1989) 15. Dick, R.P., Rhodes, D.L., Wolf, W.: TGFF: task graphs for free. In: Proc. of the Sixth Intl. Workshop on Hardware/Software Codesign (CODES/CASHE 1998), Seattle, WA (1998)

Taxi Pick-Ups Route Optimization Using Genetic Algorithms Jorge Nunes, Luís Matos, and António Trigo Coimbra Institute of Engineering, Rua Pedro Nunes - Quinta da Nora, 3030-199, Coimbra, Portugal {a21120620,a21160264}@alunos.isec.pt, [email protected]

Abstract. This paper presents a case study of a taxi drive company whose problem is the pick up passengers more efficiently in order to save time and fuel. The taxi company journey starts and ends in the two near by locations, which can be address as the same location for the problem solving, transforming this problem in a typical Travelling Salesman Problem where the goal is, given a set of cities and roads, to find the best route by which to visit every city and return home. The result of the study is a user-friendly software tool that allows the selection on a map of the pick-up locations of the taxi passengers presenting afterwards in the same map the best route that was computed using a genetic algorithm. The taxi company is currently using the developed software. Keywords: Travelling Salesman Problem, Genetic Algorithms and Software for route optimization.

1 Introduction Gabriel Simões Figueira Lda is a taxi company based in Coimbra, Portugal dedicated to the public transport of passengers. Beyond the normal pick-up service for passengers in taxi ranks or on the street, where the objective is the shortest distance possible between collection and delivery of the passenger, what can be done using a GPS device, this firm provides another service which is transporting the elderly from home to hospital and vice versa, to perform treatments, such as hemodialysis. For this type of service it is necessary to sign a written contract with the passenger with various parameters, including a calendar with days where the patient has to go to hospital. The challenge proposed by the taxi company was to develop a software application that optimizes the taxi route to pick-up/drop of passengers to and from Coimbra's University Hospitals (HUC). Following a literature review [1,2] found that this issue is fits in the problem of routing optimization, in particular the Travelling Salesman Problem (TSP) where the goal is, given a set of cities and roads, to find the best route by which to visit every city and return home. In this particular place the departure and arrival of the taxi occurs in different places, which leads to a variant of the TSP problem as it will be described in the following section. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 410–419, 2011. © Springer-Verlag Berlin Heidelberg 2011

Taxi Pick-Ups Route Optimization Using Genetic Algorithms

411

Fig. 1 shows a typical route done by one of the company’s taxi. The capital letters show five pick-up places: A, Rua 12 de Abril nº21E Porto do Bordalo Coimbra; B, Miranda do Corvo, Coimbra; C, Penela, Miranda do Corvo; D, Condeixa-a-Nova, Coimbra; E, Hospitais da Universidade de Coimbra.

Fig. 1. Route usually done by one of the company’s taxi

This paper is structured as follows: next section presents a short review on the Asymmetric version of TSP; after that we present the mathematical formulation of the problem; section four presents the methodology adopted to solve the problem, namely resorting to the use of genetic algorithms; section five presents the application software developed in C# using Visual Studio .NET and Bing Maps Winforms User Control [3]; section six presents results and a reflection on them; finally, the papers ends with some considerations regarding the developed work.

2 Asymmetric Travel Salesman Problem Traveling Salesman Problem is of combinatorial nature being used in various applications [4], such as, design of integrated circuits, vehicle tracking, production scheduling, robotics, etc. In its simplest form, given a set of cities, the clerk must visit each city once and then return to its home city. Given the cost of travel (or distance) between each of the cities, the TSP is to find out the route that has the lowest cost. A TSP can be represented as a graph, so that cities are the vertices of the graph, the paths are the edges and the distance between two cities is the edge length. Often the model is a complete graph, i.e. each vertex is connected to all others. Fig. 2 shows an example of a graph of a TSP problem.

412

J. Nunes, L. Matos, and A. Trigo

Fig. 2. Example of a TSP graph

After analyzing Fig. 2 one conclusion that can be drawn is that the distance between 2 and 3 is the same between 3 and 2, i.e., the distance is symmetric. However, this isn't what happens in the case under study, where the distance between the vertices is asymmetrical. While in a traditional symmetrical TSP there are only two restrictions: • •

The departure and arrival city are the same; A city cannot be visited more than once.

In the case under study there are the following restrictions: • • • •

The place of departure of the taxi is always the same (the residence of the taxi driver); The last stop is also fixed (HUC); There is a maximum capacity of passengers that the taxi can transport; And the distance between two places must always be smaller than the distance between these same sites through a third city.

3 Mathematical Formulation of the Problem After a literature review it was decided to adapt the formalization presented in [5] to our problem, the optimization of taxi passengers pick-up route, which mathematical model is presented bellow in equations (1) to (5). z =

n

n

i =1

j =1

∑∑

x (i, j ) d (i, j )

(1)

Equation (1) is the objective function, where d (i, j ) is the straight-line distance between city i and j , and x (i, j ) has a value of 1 if taxi goes from city i to j , or 0 if otherwise. The n present in this and in the remaining formulas is the number of cities for the route. A taxi only travels from one city to another city via a route which leads to the definition of two restrictions: equation (2), the inability to a taxi to enter / reach one city from different cities and equation (3), the impossibility of a taxi to get out / go from one city to different cities simultaneously.

Taxi Pick-Ups Route Optimization Using Genetic Algorithms

n

∑

413

x ( i , j ) = 1, j = 1, 2 ,..., n

(2)

x ( i , j ) = 1,i = 1 , 2 ,..., n

(3)

i =1 n

∑

j =1

It will also require that the route contains no sub-paths, that is, any subset of nodes must be connected together in order to ensure connectivity of the route and the lack of sub-paths, since it has already been ensured in the equations (2) and (3) that the route to and from a given node can only be done by a single path.

∑ x (i , j ) ≤

S − 1,∀ S ⊂ {1, 2,..., n}

(4)

i , j∈S

Where S is a subset of cities and

S

is the number of elements contained in that sub-

set. To determine the total number of passengers we carry, we can use the following formula.

k =

n

∑ p (i ) i=1

(5) .

Where k is the total number of passengers carried and p(i ) is the number of passengers carried on each location. The route done by the taxi is a little different from the traditional TSP problem in the sense that, unlike the TSP problem where the salesman ends in the departure city, the origin and destination city are not the same [6]. However as the taxi after arriving at destination, the HUC, must return to company headquarters and to simplify the problem it was decided to merge the origin and destination as illustrated in Fig. 3.

Fig. 3. Initial and final graph, following the merger of our origin (1) and destination (5) nodes

Tables 1 and 2 show an example of two possible costs matrices with the distances between different cities before and after the merger of the origin and destination nodes resulting thus in a smaller matrix.

414

J. Nunes, L. Matos, and A. Trigo

Table 1. Distances between cities before the merger

1 2 3 4 5

1 Fixed

2 2 5 6 7

3 4 5 2 3

4 5 6 2 8

5 Fixed 7 3 8 -

Table 2. Distances between cities after the merger

P

P -

2 2

3 4

4 5

2

7

-

5

6

3

3

5

-

2

4

8

6

2

-

4 Methodology NP-hard problems such as the TSP are in the domain of genetic algorithms (GAs) interest [6]. Although a deterministic algorithm, for the particular case, could solve it, it was decided to use a more universal approach that could be used for creating routes with more cities as in the case of a bus that would have to go to 20 cities, where a deterministic algorithm could jeopardize the attainment of a result in time [7]. GAs are part of the area of systems inspired by nature [8] simulating the natural processes and applying them to resolution real problems. Are generalized methods of search and optimization that simulate the natural processes of evolution, applying the idea of Darwin selection. GAs operates on “populations” of potential solutions, usually referred to as “chromosomes”, which represent a set of parameters, composed by a chain of bits and characters. The chromosomes evaluate to represent the best solutions for a recombination process, which produces new chromosomes. The new, improved chromosomes replace those with poorer solutions. In this way, each new generation becomes closer to the optimal solution. This continues for many generations until the termination condition is met [9, 10]. Mutations and different combining strategies ensure that a large range of search space is discovered [11, 12]. GA differs from traditional search and optimization, mainly in four aspects [13]: works with a encoding of set of parameters and not with their own parameters; works with a population rather than a single point; uses cost information or reward, not derivative or other auxiliary knowledge; and uses probabilistic transition rules, not deterministic. Fig. 4 shows the GA flowchart used for the taxi company routing problem, which starts with creation of the initial population, which was done randomly. In this type of initialization of the population, the N individuals are randomly distributed within the space of solutions. After that the GA algorithm, iterative generates new population derived from the previous population through the application of the reproduction, and selection processes, which include crossover and mutation operators until the termination conditions are met.

Taxi Pick-Ups Route Optimization Using Genetic Algorithms

415

Fig. 4. Flowchart of Genetic Algorithm used for taxi pick-ups route optimization

4.1 GA Parameters Fig 5. presents the important class diagram of the algorithm. The class “Chromosome” is responsible for chromosome representing a set of genes that are cities in this case. To represent a city or specific location there is the “City” class. Class “GeneticAlgorithmTSP” has all the logic associated with the algorithm and its dependencies.

Fig. 5. Class diagram for the TSP algorithm

Representation of the Chromosome For the representation of individuals from one population was chosen a chromosome, comprising several genes, where each gene contains the information of each city being visited (pick-up locations). City 1

City 2

City 3

City 4

….

Fig. 6. Chromosome representation

City n

416

J. Nunes, L. Matos, and A. Trigo

Gene Coding A gene is in this case a location and has 3 attributes: Longitude; Latitude; Number of passengers present in the location. The class City (see Fig. 5) is the responsible for representing a gene and operation in the attributes. Reproduction The reproduction was implemented recurring to genetic operators [14]. Operators are drawn after the definition of a code for elements of the search space, where most used are the operators of recombination and mutation. Selection The class chromosome (see Fig. 5) contains the implemented operators for the selection process, namely, proportional selection or route wheel and tournament selection [14]: • Proportional Selection or Roulette Wheel – with this operator the probability of an individual being selected is proportional to its fitness value. This method gives opportunity to the worst individuals so that they too may be selected. To apply this operator we simply apply a random value (inside the population size) and the chromosome present in position is selected. • Tournament Selection - this operator selects a group of individuals in the population at random. The individuals selected are then compared, through its fitness function, and only the one with the highest value is passed to the next generation. In order to apply this operator two random values (inside the population size) are generated and the chromosomes presented in that positions are compared and the one with the higher fitness value is used. Crossover The mutation operators allow introducing new genetic information on individuals who constitute the population. The operators implemented in the application are the ones present below [15]: • Shift Mutation (also called PBM - Position-Based Mutation) - through this operator is selected randomly one chromosome gene, which is then removed and inserted in another position also selected randomly. • Exchange Mutation (also called swap) – by this operator are randomly selected two genes of chromosome and their values are exchanged. • Adjacent Exchange Mutation – with this operator are randomly selected two adjacent genes of a chromosome, and the value within these genes are exchanged. Population Since this GA is to be used in a software application, it is up to the user to choose the size of the population. Nevertheless it was tested for the particular taxi company route optimization problem with N=4 and for a larger population size with N=20.

Taxi Pick-Ups Route Optimization Using Genetic Algorithms

417

4.2 Fitness Function The evaluation function more used is the shortest path [16]. This is the sum of the distance between each pair of cities in the chromosome to be evaluated. It is also taken as a benchmark the best result of the previous generation. Below is shown the pseudo code of the evaluation function: FOR i = 0 TO num_cities - 1 DO distance <- CALL DistanceCities(cities[i], cities [i + 1]) cost <- cost + distance END FOR Algorithm [17] 1. Generate initial population with size N this value have been inputted by the user; 2. The first solution is using the positions selected by the user in the some order that are inputted; 3. Initialize the coefficients of GA; 4. Calculate for each chromosome fitness value; 5. While (termination conditions) a. Same value, for the best fitness value of the population, 120 times in a row. b. Number of generations reaches 1000.

5 Implementation The application presented here was developed in C# using Visual Studio .NET and Bing Maps Winforms User Control [3]. This application allows you to quickly and easily traces the best path through a set of cities to pick-up passengers. It is fully configurable, enabling you to enter the coordinates of the cities (or pick-up locations), through an existing map in the interface and the number of passengers carried on each location. Although the software application produced only shows the computed optimized route for the taxi in question, Fig. 7. shows a version that allows to change the GA parameters and has a tab called “Analyze results” that allows to analyze the computed results, presented for a particular cases (5 and 20 cities) in Fig. 8.

Fig. 7. Application main interface

418

J. Nunes, L. Matos, and A. Trigo

6 Experimental Results To demonstrate the results produced by the application, was selected a usual route done by the company’s taxi.

Fig. 8. Results for 20 cities (x-axes number of generations, y-axes fitness value)

Analyzing the graph is possible to see that early the application walked away from the best result, however as it evolved it approached the best result, even with small variations. With 5 cities the calculation time is about 3 seconds. To verify the performance of algorithm was made a test with a higher number of cities. In this case was tested with 20 cities. Again the graph shows that there are many variations in the results, but here the number of cities is much greater, so the number of combinations increases drastically. The response time was very acceptable (about 30 seconds). It is important to note that quadrupling the number of cities, the execution time was 10 times higher. With a larger number of cities will be even higher, because the growth rate of time will be exponential.

7 Conclusion and Future Work In this paper we have studied and implemented a solution, using the GAs for the TSP adapted to the particular case of transporting patients to and from HUC. The application developed is currently in use in the company of taxis "Gabriel Simões Figueira Lda" contributing to the optimization of routes for taxis and consequently lower fuel consumption and improved passenger comfort, which now make shorter trips. Several tests were performed either in finding the best combination of operators, to improve the speed of convergence or the reliability of results. Upon completion of this project, which had an impact on the taxi company that did not use any software tool for route optimization of taxis, the next step is to evaluate the use of this application and start developing a new application that will allow the management of a taxi fleet routes and not a single taxi route. Academics and practitioners could use this work as a tutorial for developing similar applications, based on GAs.

Taxi Pick-Ups Route Optimization Using Genetic Algorithms

419

References 1. Lawler, E.L., Lenstra, J.K., Rinnooy Khan, A.H.G., Shmoys, D.B.: The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley & Sons, Chichester (1985) ISBN 0-471-90413-9 2. Gutin, G., Punnen, A.P.: The Traveling Salesman Problem and Its Variations. Springer, Heidelberg (2006) ISBN 0-387-44459-9 3. Bing Maps Winforms User Control, https://vearthcontrol.svn.codeplex.com/svn/ 4. Applegate, D.L., Bixby, R.E., Chvatal, V., Cook, W.J.: The Traveling Salesman Problem A Computational Study 5. Bektas, T.: The multiple traveling salesman problem: an overview of formulations and solution procedures. Omega 34, 209–219 (2006) 6. Üçoluk, G.: Genetic Algorithm Solution of the TSP Avoiding Special Crossover and Mutation. Intelligent Automation and Soft Computing 3(8) (2002) 7. Engebretsen, L., Karpinski, M.: Approximation hardness of TSP with bounded metrics. In: Yu, Y., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 201– 212. Springer, Heidelberg (2001) 8. Al-Dulaimi, B.F., Ali, H.A.: Enhanced Traveling Salesman Problem Solving by Genetic Algorithm Technique (TSPGA). In: Proceeding of the World Academy of Science, Engineering and Technology, pp. 296–302, Rome 25th -27th (2008) 9. Zhang, L., Yao, M., Zheng, N.: Optimization and Improvement of Genetic Algorithms Solving Traveling Salesman Problem. In: International Conference on Image Analysis and Signal Processing (2009) 10. Kaur, B., Mittal, U.: Optimization of TSP using Genetic Algorithm. Advances in Computational Sciences and Technology 3(2), 119–125 (2010) ISSN 0973-6107 11. Sallabi, O.M., El-Haddad, Y.: An Improved Genetic Algorithm to Solve the Traveling Salesman Problem. World Academy Of Science, Engineering And Technology 52, 471– 474 (2009) ISSN: 1307-6892 12. Holland, J.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The University of Michigan Press, Ann Arbor (1975) 13. Berman, P., Karpinski, M.: 8/7-Approximation Algorithm for (1,2)-TSP. In: Proc. 17th ACM-SIAM SODA, pp. 641–648 (2006) 14. Gen, M., Cheng, R.: Genetic algorithms and engineering design, pp. 59–64. WileyInterscience, Hoboken (1997) 15. Nearchou, A.C.: The effect of various operators on the genetic search for large scheduling problems. Elsevier-Internation Journal of Production Economics 88, 191–203 (2004) 16. Dantzig, G.B., Fulkerson, R., Johnson, S.M.: Solution of a large-scale traveling salesman problem. Operations Research 2, 393–410 (1954) 17. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., pp. 1027–1033 (Section 35.2: The traveling-salesman problem). MIT Press and McGraw-Hill (2001), ISBN 0-262-03293-7

Optimization of Gaussian Process Models with Evolutionary Algorithms Dejan Petelin1 , Bogdan Filipič1 , and Juš Kocijan1,2 1

2

Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, Slovenia University of Nova Gorica, Vipavska cesta 13, SI-5000 Nova Gorica, Slovenia [email protected]

Abstract. Gaussian process (GP) models are non-parametric, blackbox models that represent a new method for system identiﬁcation. The optimization of GP models, due to their probabilistic nature, is based on maximization of the probability of the model. This probability can be calculated by the marginal likelihood. Commonly used approaches for maximizing the marginal likelihood of GP models are the deterministic optimization methods. However, their success critically depends on the initial values. In addition, the marginal likelihood function often has a lot of local minima in which the deterministic method can be trapped. Therefore, stochastic optimization methods can be considered as an alternative approach. In this paper we test their applicability in GP model optimization. We performed a comparative study of three stochastic algorithms: the genetic algorithm, diﬀerential evolution, and particle swarm optimization. Empirical tests were carried out on a benchmark problem of modeling the concentration of CO2 in the atmosphere. The results indicate that with proper tuning diﬀerential evolution and particle swarm optimization signiﬁcantly outperform the conjugate gradient method. Keywords: Gaussian process models, hyperparameters optimization, evolutionary algorithms.

1

Introduction

Gaussian process (GP) models [8] form an emerging complementary method for nonlinear dynamic system identiﬁcation. A GP model is a probabilistic nonparametric black-box model. It diﬀers from most other frequently used black-box identiﬁcation approaches in that it does not approximate the modeled system by ﬁtting the parameters of the selected basis functions, but rather it searches for the relationship among the measured data. GP models are closely related to approaches such as support vector machines and, especially, relevance vector machines. Because the GP model is a Bayesian model, its output is a normal distribution, expressed in terms of the mean and the variance. The mean value represents the most likely output, and the variance can be viewed as a measure of its conﬁdence. The obtained variance, which depends on the amount of available identiﬁcation data, is important information that distinguishes the GP models A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 420–429, 2011. c Springer-Verlag Berlin Heidelberg 2011

Optimization of Gaussian Process Models with Evolutionary Algorithms

421

from other non-Bayesian methods. GP models can be used for model identiﬁcation when the data are noisy and when there are outliers or gaps in the input data. Another useful attribute of GP models is the possibility to include various kinds of prior knowledge into the model, e.g., local models, static characteristics, etc. The quality of GP models depends heavily on the covariance matrix. To calculate it, a covariance function needs to be selected according to the user’s prior knowledge. Then the model can be further adjusted to the data with an appropriate tuning of the covariance function parameters, known as hyperparameters. This can be done with various optimization algorithms. For this purpose a conjugate gradient method is often used. Due to its deterministic nature, its success depends on the initial values of the hyperparameters, especially for complex systems. In this case stochastic methods seem to be appropriate. The purpose of this paper is to test a variety of evolutionary algorithms as alternative optimization methods for GP models and compare them to the conjugate gradient method. At ﬁrst sight, evolutionary algorithms seem too exhaustive, but for complex systems, such as dynamic systems, the criterion function often has a lot of local optima in which the conjugate gradient method is often trapped. On the other hand, stochastic optimization methods can avoid local optima as they usually search for the optimum using a population of candidates. The paper is further composed as follows. Section 2 brieﬂy describes modeling with GP. The description of the tested algorithms follows in Section 3. The results of the optimization with these algorithms are compared with the results of the conjugate gradient method in Section 4. Section 5 concludes the paper with a summary of the work and indicates some directions for future work.

2

Modeling with Gaussian Processes

A GP model is a ﬂexible, probabilistic, non-parametric model with uncertainty predictions. Its properties and application potentials are reviewed in [8]. A GP is a collection of random variables that have a joint multivariate Gaussian distribution (Fig. 1). Assuming a relationship of the form y = f (x) between the input x and the output y, we have y1 , . . . , yn ∼ N (0, Σ), where Σpq = Cov(yp , yq ) = C(xp , xq ) gives the covariance between the output points corresponding to the input points xp and xq . Thus, the mean μ(x) and the covariance function C(xp , xq ) fully specify the GP. Note that the covariance function C(., .) can be any function having the property of generating a positive semi-deﬁnite covariance matrix. The covariance function C(xp , xq ) can be interpreted as a measure of the distance between the input points xp and xq . It is usually composed of two parts: (1) C(xp , xq ) = Cf (xp , xq ) + Cn (xp , xq ), where Cf represents the functional part and describes the unknown system we are modeling, and Cn represents the noise part and describes the model of the noise.

422

D. Petelin, B. Filipič, and J. Kocijan

0

2 2σ(x1) μ(x1) 2σ(x1) 2σ(x2) μ(x2) 2σ(x2)

−2

−4

−5

x1

0 Input, x

Output, y

Output, y

2

0

−2

−4

x2 5

−5

x1

0 Input, x

x2 5

Fig. 1. Modeling with GP: (a) Gaussian prediction at a new point x1 , conditioned on the training points (.); (b) the predictive mean together with its 2σ error bars for two points, x2 that is close to the training points, and x1 that is more distant

A commonly choosen covariance function is the square exponential covariance function, which is of the following form: D 1 Cf (xp , xq ) = v1 exp − wd (xdp − xdq )2 + δpq v0 (2) 2 d=1

where wd , v0 , v1 are the ’hyperparameters’ of the covariance function, D is the input dimension, and δpq = 1 if p = q and 0 otherwise. The hyperparameters can be written as a vector Θ = [w1 . . . wD v0 v1 ]T . This covariance function is smooth and continuous, with a presumption that the noise is white. Other forms and combinations of covariance functions suitable for various applications can be found in [8]. For a given problem, the hyperparameter values are learned using the data at hand. After the training, one can use the w parameters as indicators of how important the corresponding inputs are: if wd is zero or near zero, it means the inputs in the dimension d contain little information and could possibly be neglected. Consider a set of N D-dimensional input vectors X = [x1 , x2 , . . . , xN ] and a vector of output data y = [y1 , y2 , . . . , yN ]T . Based on the data (X, y), and given a new input vector x∗ , we wish to ﬁnd the predictive distribution of the corresponding output y ∗ . Unlike in other models, there is no model parameter determination within a ﬁxed model structure. In building such a model, most of the eﬀort consists of tuning the hyperparameters of the covariance function. The number of parameters to be optimized is small (D + 2 for the commonly used squared exponential covariance function), which means that optimization convergence might be faster than with parametric models and that the ’curse of dimensionality’, so common to black-box identiﬁcation problems, is circumvented or at least decreased. GP models can be easily utilized for regression calculations. Based on the training set X, a covariance matrix K of size N × N is determined. As already

Optimization of Gaussian Process Models with Evolutionary Algorithms

423

mentioned, the aim is to ﬁnd the distribution of the corresponding output y ∗ for some new input vector x∗ = [x1 (N + 1), x2 (N + 1), . . . , xD (N + 1)]T . For a new test input x∗ , the predictive distribution of the corresponding output is y ∗ |(X, y), x∗ and is Gaussian, with the mean and variance μ(x∗ ) = k(x∗ )T K−1 y, 2

∗

∗

∗ T

σ (x ) = k(x ) − k(x ) K

(3) −1

∗

k(x ),

(4)

where k(x∗ ) = [C(x1 , x∗ ), . . . , C(xN , x∗ )]T is the N × 1 vector of covariances between the test and training cases, and k(x∗ ) = C(x∗ , x∗ ) is the covariance between the test input itself. The obtained model not only describes the dynamic characteristics of the system, but also provides information about the conﬁdence in these predictions by means of a prediction variance. Usually, a variance of the prediction is converted into 2σ, which is about a 95% conﬁdence interval. This conﬁdence region can be seen in the example in Fig. 2 as a gray band. It highlights areas of the input space where the prediction quality is poor, due to the lack of data or noisy data, by indicating the wider conﬁdence band around the predicted mean. 4

Output, y

2 0 −2 −4

−5

0 Input, x

5

Fig. 2. Using GP models: in addition to the mean value (prediction), we obtain a 95% conﬁdence region for the underlying function f (shown in gray)

To accurately reﬂect the correlations present in the training data, the hyperparameters of the covariance function need to be optimized. Due to the probabilistic nature of the GP models, the common model optimization approach where the model parameters and possibly also the model structure are optimized through the minimization of a cost function deﬁned in terms of model error (e.g. mean square error) is not readily applicable. A probabilistic approach to the optimization of the model seems more appropriate. Actually, instead of minimizing the model error, the probability of the model is maximized. The overall problem of learning unknown parameters from data corresponds to the predictive distribution P (y ∗ |y, X, x∗ ) of the new target y, given the training data (y, X) and a new input x∗ . In order to calculate this posterior distribution,

424

D. Petelin, B. Filipič, and J. Kocijan

a prior distribution over the hyperparameters P (Θ|y, X) can ﬁrst be deﬁned, followed by the integration of the model over the hyperparameters P (y ∗ |y, X, x∗ ) = P (y ∗ |Θ, y, X, x∗ P (Θ|y, X))dΘ. (5) The computation of such integrals can be diﬃcult due to the intractable nature of the nonlinear functions. A solution to the problem of intractable integrals is to adopt numerical integration methods such as the Monte-Carlo approach. Unfortunately, signiﬁcant computational eﬀorts may be required to achieve a suﬃciently accurate approximation. An alternative approach based on the Maximum Likelihood optimization method has been developed and is applied to maximize the marginal likelihood. It can be restated as a cost function that is to be maximized. For numerical scaling purposes the log of the marginal likelihood is taken as: 1 1 N L(Θ) = − log(|K|) − yT K−1 y − log(2π). 2 2 2

(6)

A frequently used method for optimizing the cost function is a conjugate gradient method – a Polack-Ribiere version utilized in tandem with the Wolfe-Powell stopping conditions. While this is a deterministic method, its result heavily depends on the initial values of the hyperparameters, especially for complex multidimensional systems where the cost function has a lot of local optima. Therefore, the conjugate gradient method should be run repeatedly with various initial values of hyperparameters. While the space of possible values is huge, the initial values are often chosen randomly. Therefore, stochastic optimization methods can be considered as an alternative approach. To test their applicability in GP model optimization we performed a comparative study of three stochastic algorithms described in the next section: genetic algorithms, diﬀerential evolution and particle swarm optimization.

3

Evolutionary Algorithms

Evolutionary algorithms are generic, population-based, stochastic optimization algorithms inspired by biological evolution. They use mechanisms similar to those known from the evolution of species: reproduction, mutation, recombination and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the ﬁtness function determines the environment within which the solutions ‘live’. The simulated evolution of the population then takes place after the repeated application of the above operators. Evolutionary algorithms [3] perform well, approximating solutions to diverse types of problems because they make no assumptions about the underlying ﬁtness landscape; this generality is shown by their success in the ﬁelds as diverse as science, engineering, economics, social sciences and art.

Optimization of Gaussian Process Models with Evolutionary Algorithms

425

In most real-world applications of evolutionary algorithms, computational complexity is a prohibiting factor. In fact, this computational complicity is due to the ﬁtness function evaluation. In our case, as was shown in the previous section, the evaluation of the ﬁtness function contains an inversion of the covariance matrix, for which the computational time increases with the third power of the amount of data. But this ‘inconvenience’ is unfortunately inescapable without an approximation. Most commonly used evolutionary algorithms for numerical optimization, such as the maximization of the logarithmic marginal likelihood, are genetic algorithm with a real numbers representation, diﬀerential evolution and particle swarm optimization. 3.1

Genetic Algorithm

Genetic algorithm (GA) [3] is a ﬂexible search technique used in computing to ﬁnd exact or approximate solutions to optimization and search problems in many areas. Traditionally, solutions are represented as binary strings of 0s and 1s, but other encodings are also possible. The simulated evolution usually starts from a population of randomly generated individuals and proceeds in generations. In each generation, the ﬁtness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on their ﬁtness), and modiﬁed (recombined and possibly randomly mutated) to form a new population. The new population is then used in the next generation of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations have been iterated, or a satisfactory ﬁtness level has been reached for the population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached. 3.2

Diﬀerential Evolution

Diﬀerential evolution (DE) is a method for numerical optimization without explicit knowledge of the gradients. It was presented by Storn and Price [9] and works on multidimensional, real-valued functions that are not necessarily continuous or diﬀerentiable. DE searches for a solution to a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formula of vector crossover and mutation, and then keeping whichever candidate solution has the best score or ﬁtness for the optimization problem at hand. In this way the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. More details about DE can be found in [7]. 3.3

Particle Swarm Optimization

Particle swarm optimization (PSO) is a method proposed by Kennedy and Eberhart [4] that is motivated by the social behavior of organisms such as bird ﬂocking

426

D. Petelin, B. Filipič, and J. Kocijan

and ﬁsh schooling. Like DE, it is used for numerical optimization without explicit knowledge of the gradients. PSO provides a population-based search procedure in which individuals called particles change their position (state) with time. In a PSO system, particles “ﬂy“ around in a multidimensional search space. During ﬂight, each particle adjusts its position according to its own experience and the experience of a neighboring particle, making use of the best position encountered by itself and its neighbor. Thus, a PSO system combines local search with global search, attempting to balance the exploration and exploitation. Further details about PSO can be found in [5].

4

Experiments and Results

To assess the potential of evolutionary algorithms in the optimization of GP model hyperparameters, a problem concerning the concentration of CO2 in the atmosphere from [8] was chosen. The data consists of monthly average atmospheric CO2 concentrations derived from in-situ air samples collected at the Mauna Loa Observatory, Hawaii, between 1959 and 2009 (with some missing data) [2]. The goal is to model the CO2 concentration as a function of time. Although the data is one-dimensional, and therefore easy to visualize, a complex covariance function is used. It is derived by combining several kinds of simple covariance functions. First, a squared exponential covariance function (cov1 ) is used to model the long-term, smoothly, rising trend. With the product of a periodic and squared exponential covariance function (cov2 ) the seasonal component is modeled. To model the medium-term irregularities a rational quadratic covariance function (cov3 ) is used. Finally, the noise is modeled as the sum of a squared exponential and a constant covariance function (cov4 ). This complex covariance function involves 13 hyperparameters. Note that in [8] the covariance function involves only 11 hyperparameters due to the period of the periodic covariance function being ﬁxed to one year. In the experiments we used the Matlab toolbox for GP modeling (GP toolbox) that is freely available [8]. For Diﬀerential Evolution an implementation from [7], for Particle Swarm Optimization an implementation from [1] and for Genetic algorithms an implementation from [6] were used. These methods were compared to a deterministic conjugate gradient (CG) method implemented in the GP toolbox. All the stochastic algorithms had the same population size, number of generations and number of solution evaluations. The population size was set to 50 individuals, the number of generations to 2,000 and the number of iterations to 10. While the tested algorithms have various parameters, they were tuned in preliminary experiments. With the same number of experiments for each tested algorithm, the fairness of the experimental evaluation was ensured as far as possible. For the comparison of the tested stochastic methods with the conjugate gradient method the same number of evaluations were used. Thus, the conjugate gradient method was executed 10 times with 100,000 solution evaluations available. This means, in one iteration the conjugate gradient methods are repeatedly

Optimization of Gaussian Process Models with Evolutionary Algorithms

427

executed with random initial values and possibly restarted until all the available evaluations are spent. For each algorithm the following statistics were calculated: minimum, maximum, average and standard deviation; they are given in Table 1.

Table 1. Statistic values – maximum, minimum, average and standard deviation – over 10 runs for each algorithm: CG, GA, DE, PSO

Maximum Minimum Average Standard deviation

CG −598.8627 −638.9760 −630.9088 12.5953

GA −639.0114 −703.9570 −645.8582 20.4178

DE PSO −142.2658 −142.4529 −176.2658 −216.9546 −154.5382 −177.6854 11.3083 26.9605

Fig. 3 shows the performance traces of each algorithm averaged over 10 runs. At ﬁrst sight it can be seen that DE and PSO perform similarly and a lot better than GA and CG. DE and PSO really reached very similar maximum values, which were very close to the result from [8]. However, PSO reached a lower minimum value, which means it produces a larger variance than DE. In other words, using DE will more probably ﬁnd an optimal value or at least a value near to it. On the other hand, PSO reached a value near the optimum faster than DE. Therefore, PSO can be used for problems where results are needed as soon as possible, even though they are only sub-optimal. Unsatisfactory results obtained by CG imply the diﬃculty of the chosen problem for this traditional method. However, Fig. 4 shows the predictions of CO2 for the next 20 years based on the model obtained from the best hyperparameter values found with DE, which are shown in Table 2. Note that the best hyperparameter values obtained by the PSO diﬀer from the ones found by DE at most by 10−5 . Consequently, the log marginal likelihood and the predictions of PSO are almost identical to those of DE and therefore are not shown separately. Table 2. The best hyperparameter values from our experiments, obtained with DE

cov1 cov2 cov3 cov4

Θ1 2.6741 8.5250 2.2481 6.9999

Θ2 Θ3 Θ4 Θ5 −0.6808 1.0000 0.5601 1.0000 2.4847 0.4687 −4.9666 4.9999 −1.6696

The obtained predictions are almost identical to the originals from [8], but the log marginal likelihood is smaller, due to a wider range of measurements for training and the wider range of predictions used in our experiment.

428

D. Petelin, B. Filipič, and J. Kocijan

Conjugate gradient method

Genetic algorithm 0

Log marginal likelihood

Log marginal likelihood

0

−500

−1000

−1500

500

1000 Evaluations

1500

−500

−1000

−1500

2000

Differential evolution

1000 Evaluations

1500

2000

Particle swarm optimization

0

0

Log marginal likelihood

Log marginal likelihood

500

−500

−1000

−1500

500

1000 Evaluations

1500

2000

−500

−1000

−1500

500

1000 Evaluations

1500

2000

Fig. 3. Performance traces for the tested algorithms. Solid lines denote a mean values, dashed lines denote maximum and minimum values, gray areas present standard deviations.

440

400 380 360

2

CO concentration, ppm

420

340 320 300

1960

1970

1980

1990

2000

2010

2020

2030

year

Fig. 4. Concentration of the CO2 in the atmosphere and its predictions based on the GP model with the hyperparameters optimized with diﬀerential evolution

Optimization of Gaussian Process Models with Evolutionary Algorithms

5

429

Conclusion

A commonly used method for the optimization of GP model hyperparameter values is a deterministic method of conjugate gradients. While a cost function, log marginal likelihood, can have a lot of local optima, three stochastic optimization methods from the domain of evolutionary computation were tested on this problem. The results from the experimental work indicate that selected evolutionary algorithms, especially DE and PSO, successfully avoid the local optima and ﬁnd near-optimal values. Therefore, they seem useful for optimizing complex GP models such as models of multi-dimensional and dynamic systems, or at least for ﬁnding good initial values of the hyperparameters. Our further work will be directed towards the hybridization of deterministic and stochastic methods for faster convergence and a better ﬁnal result. While evolutionary algorithms can be easily parallelized, they could be used for ﬁnding inﬂuential regressors. This could be done by using automatic relevance detection [8]. In the case of multi-variate systems with a large number of inputs, the number of hyperparameters becomes larger. The consequence of a larger amount of input data is a much longer execution due to the more complex evaluation and, most likely, a higher number of required solution evaluations. Therefore, the algorithm parallelization could represent a way of speeding up the optimization process.

References 1. Birge, B.: Matlab Central: Particle swarm optimization toolbox, http://www.mathworks.com/matlabcentral/fileexchange/ 7506-particle-swarm-optimization-toolbox 2. Carbon Dioxide Information Analysis Center. Atmospheric CO2 values collected at Mauna Loa, Hawaii, USA, http://cdiac.esd.ornl.gov/ftp/trends/co2/ maunaloa.co2 3. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Natural Computing Series. Springer, Heidelberg (2003) 4. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, pp. 1942–1948. IEEE Press, Los Alamitos (1995) 5. Kennedy, J., Eberhart, R., Shi, Y.: Swarm Intelligence. Morgan Kaufmann, San Francisco (2001) 6. Pohlheim, H.: Geatbx – The Genetic and Evolutionary Algorithm Toolbox for Matlab, http://www.geatbx.com/ 7. Price, K., Storn, R., Lampinen, J.: Diﬀerential Evolution. Natural Computing Series. Springer, Heidelberg (2005) 8. Rassmusen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006) 9. Storn, R., Price, K.: Diﬀerential evolution – A Simple and Eﬃcient Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization (11), 341–359 (1997)

Author Index

´ Abrah´ am, Erika I-190 Abundez B., Itzel M. I-51 Alfaro, Rodrigo II-61 Allende, H´ector II-61, II-363 Antunes, M´ ario II-342 Avbelj, Monika II-383 ¨ am¨ Ayr¨ o, Sami I-361 Babi´c, Zdenka II-51 Bakirov, Murat B. I-150 Barszcz, Tomasz II-225 Baumann, Martin R.K. I-140 Beigi, Akram I-391, II-98, II-245 Beliczynski, Bartlomiej I-130 Bielecka, Marzena II-147, II-225 Bielecki, Andrzej II-147, II-225 Bratko, Ivan I-1 Buesser, Pierre II-167 Buli´c, Patricio I-158 Campos, Jo˜ ao I-300 C´ ardenas-Montes, Miguel I-310, I-371 Carvalho, Rui I-300 Constantinopoulos, Constantinos I-169 Correia, Manuel II-342 Costa, Ernesto I-300 Cristianini, Nello II-196, II-322 Cruz R., Rafael I-51 Curk, Tomaˇz II-393 Daolio, Fabio II-167 Daryabari, Mojtaba I-381 Datadien, Arvind I-90 de Almeida, Ana II-31, II-295 de Azevedo da Rocha, Ricardo Luis II-127, II-275 De Bie, Tijl II-196 Deng, Jianming I-320 Ding, Xiao-Feng II-118 Dobnikar, Andrej II-11, II-383 Dokur, Z¨ umray II-81 Donnarumma, Francesco I-250 Duch, Wlodzislaw II-89

Eiben, A.E. II-186 El-Dahb, Mona A. I-400 Ferariu, Lavinia I-290 Figueiredo, Marisa B. II-31 Filipiˇc, Bogdan I-420 Flaounas, Ilias II-322 Frolov, Alexander A. I-100 F´ uster-Sabater, Amparo II-285 Fyson, Nick II-196 Gasca A., Eduardo I-51 G´ ati, Krist´ of II-156 G´ omez-Iglesias, Antonio I-310, I-371 Gong, Fang II-118 Govekar, Edvard I-270 Grochowski, Marek II-89 Groˇselj, Ciril I-80 Haselager, Pim I-90 Hashemi, Ali B. I-340 Heinrich, Enrico II-373 Helmi, Hoda I-391 Hensinger, Elena II-322 Horv´ ath, G´ abor II-156 Husek, Dusan I-100 Ilc, Nejc II-11 ˙ scan, Zafer II-81 I¸ J¨ arvelin, Kalervo I-260 Jerala, Roman II-383 Joost, Ralf II-373 Juhola, Martti I-260 Kaczorek, Tadeusz II-305 Kainen, Paul C. I-12 K¨ arkk¨ ainen, Tommi I-240 Karshenas, Hossein II-98 Kester, Leon J.H.M. II-186 Kiselev, Mikhail I-120 Kocijan, Juˇs I-420, II-312 Kolodziej, Marcin I-280 Kononenko, Igor I-22, I-169, II-21 Korkosz, Mariusz II-147

432

Author Index

K¨ oster, Frank I-140 Kotulski, Leszek II-254 Kovord´ anyi, Rita I-200 Kruglov, Igor A. I-150 Kukar, Matjaˇz I-80 K˚ urkov´ a, Vˇera I-12 Laurikkala, Jorma I-260 L awry´ nczuk, Maciej I-31, I-230 Lemmer, Karsten I-140 Leonardis, Aleˇs II-235 Lethaus, Firas I-140 Likas, Aristidis I-169 Lipi´ nski, Piotr I-330 Lodi, Stefano II-363 Lopes, Noel II-41, II-108 Lotriˇc, Uroˇs I-158 Loyola, Diego I-70 L¨ uder, Marian II-373 Luostarinen, Kari I-240 Majkowski, Andrzej I-280 Marusak, Piotr M. II-177, II-215 Matos, Lu´ıs I-410 Meybodi, Mohammad Reza I-340 Minaei, Behrouz I-381, I-391, II-98 Mishulina, Olga A. I-150 Momi´c, Snjeˇzana II-51 Montone, Guglielmo I-250 Morin, Gabriel I-190 Mozayani, Nasser II-245 Muhonen, Jukka I-240 ˜ Nanculef, Ricardo II-363 Nechval, Konstantin II-136 Nechval, Nicholas II-136 Neme, Antonio I-210 Neruda, Roman I-180 Neto, Jo˜ ao Pedro I-61 Neumann, Heiko I-110 Ni, Qingjian I-320 Nido, Antonio I-210 Nieminen, Paavo I-240 Noroozi, Vahid I-340 Novo, Jorge I-350 Nunes, Jorge I-410 Olszewski, Dominik II-1, II-71 Orchel, Marcin II-332, II-353 Ortman, Robert L. I-220

Osowski, Stanislaw I-41 ¨ ¨ Ozkaya, Ozen II-81 Parsa, Saeed I-381 Parvin, Hamid I-381, I-391, II-98, II-245 Patelli, Alina I-290 Pazo-Robles, Maria Eugenia II-285 Penedo, Manuel G. I-350 Petelin, Dejan I-420, II-312 Pevec, Darko I-22 Polyakov, Pavel Yu. I-100 Potoˇcnik, Primoˇz I-270 Potter, Steve M. I-220 Prevete, Roberto I-250 Purgailis, Maris II-136 Quintas, Ricardo

II-41

Rak, Remigiusz J. I-280 Rend´ on L., Er´endira I-51 Ribeiro, Bernardete II-31, II-41, II-108, II-342 Richter, Pascal I-190 Ringbauer, Stefan I-110 Risojevi´c, Vladimir II-51 ˇ Robnik-Sikonja, Marko I-169 Rodr´ıguez-V´ azquez, Juan Jos´e I-310 Rozevskis, Uldis II-136 Saarikoski, Jyri I-260 Saifullah, Mohammad I-200 Sait, Sadiq M. I-400 Salda˜ na T., Sergio I-51 Salomon, Ralf II-373 S´ anchez G., Jos´e S. I-51 Santos, Jos´e I-350 Sartori, Claudio II-363 Schuessler, Olena I-70 Schut, Martijn C. II-186 S¸edziwy, Adam II-254 Shi, Ai-Ye II-118 Shibata, Danilo Picagli II-127 Shiraishi, Yoichi I-400 Siddiqi, Umair F. I-400 Silva, Catarina II-342 Silva, Fernando I-61 Silva Filho, Reginaldo Inojosa II-275 Sim˜ oes, Anabela I-300 Siwek, Krzysztof I-41 Skoˇcaj, Danijel II-235 Skomorowski, Marek II-147

Author Index Sprinkhuizen-Kuyper, Ida I-90 Stolarek, Jan I-330 Szupiluk, Ryszard II-206 ˇ Ster, Branko II-383 ˇ Strumbelj, Erik I-22, I-169, II-21 Tirronen, Ville I-361 Tomassini, Marco II-167 Toplak, Marko II-393 Trigo, Ant´ onio I-410 Tschechne, Stephan I-110

Vel´ asquez G., Valent´ın I-51 Venayagamoorthy, Kumar I-220 Vidnerov´ a, Petra I-180 Vreˇcko, Alen II-235 Wang, Hui-Bin II-118 Weber, Matthieu I-361 Wojciechowski, Wadim II-147 W´ ojcik, Mateusz II-225 Wojewnik, Piotr II-206 Xu, Li-Zhong

Unold, Olgierd

II-118

II-265

Valdovinos R., Rosa M. I-51 van Willigen, Willem H. II-186 Vega-Rodr´ıguez, Miguel A. I-310, I-371

Zabkowski, Tomasz II-206 Zhang, Xue-Wu II-118 Zieli´ nski, Bartosz II-147 Zupan, Blaˇz II-393

433