Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6274
Gianluca Tempesti Andy M. Tyrrell Julian F. Miller (Eds.)
Evolvable Systems: From Biology to Hardware 9th International Conference, ICES 2010 York, UK, September 6-8, 2010 Proceedings
13
Volume Editors Gianluca Tempesti University of York Department of Electronics Intelligent Systems Group York YO10 5DD, UK E-mail:
[email protected] Andy M. Tyrrell University of York Department of Electronics Intelligent Systems Group York YO10 5DD, UK E-mail:
[email protected] Julian F. Miller University of York Department of Electronics Intelligent Systems Group York YO10 5DD, UK E-mail:
[email protected]
Library of Congress Control Number: 2010932609 CR Subject Classification (1998): C.2, D.2, F.1, F.3, J.3, I.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-15322-4 Springer Berlin Heidelberg New York 978-3-642-15322-8 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
Biology has inspired electronics from the very beginning: the machines that we now call computers are deeply rooted in biological metaphors. Pioneers such as Alan Turing and John von Neumann openly declared their aim of creating artificial machines that could mimic some of the behaviors exhibited by natural organisms. Unfortunately, technology had not progressed enough to allow them to put their ideas into practice. The 1990s saw the introduction of programmable devices, both digital (FPGAs) and analogue (FPAAs). These devices, by allowing the functionality and the structure of electronic devices to be easily altered, enabled researchers to endow circuits with some of the same versatility exhibited by biological entities and sparked a renaissance in the field of bio-inspired electronics with the birth of what is generally known as evolvable hardware. Ever since, the field has progressed along with the technological improvements and has expanded to take into account many different biological processes, from evolution to learning, from development to healing. Of course, the application of these processes to electronic devices is not always straightforward (to say the least!), but rather than being discouraged, researchers in the community have shown remarkable ingenuity, as demostrated by the variety of approaches presented at this conference and included in these proceedings. Held without interruption since 1995, ICES has become the leading conference in the field of evolvable hardware and systems. The 9th ICES conference, held in York, UK, in September 2010, built on the success of its predecessors and brought together some of the leading researchers who combine biologically inspired concepts with hardware. The 33 papers included in this volume, accepted for oral presentation and publication following a rigorous review process by a selected Programme Committee, represent a good sample of some of the best research in the field and clearly illustrate the range of approaches that fall under the label of bio-inspired hardware, defined as electronic hardware that tries to draw inspiration from (and not, it is worth pointing out, to imitate) the world of biology to find solutions for the problems facing the design of computing systems. So a heartfelt note of thanks goes to the authors of the papers presented in these proceedings, who submitted material of remarkably high quality and contributed to make ICES 2010 a successful conference. This success was also a result of the outstanding work from the Organizing Committee, from the Local Chairs, Steve Smith and James Walker, who were instrumental in arranging the details of the venue and all the intricate details involved in running a conference, to the Publicity Chairs, Andy Greensted and Michael Lones, who handled the interface with the world by setting up a great website and by making sure that the conference was advertised widely through the community.
VI
Preface
We wish to show our particular gratitude to our Programme Committee: due to some unforeseen circumstances, we were forced to set a deadline for reviews that was considerably shorter than usual and the committee did a magnificent job in providing us with their invaluable feedback within a very short time. And we should not forget the contribution of the Steering Committee members whose oversight and commitment through the years ensures that the ICES series of conferences has a bright future ahead. Last but not least, we wish to thank our three outstanding Keynote Speakers, Steve Furber, Hod Lipson, and Andrew Turberfield, who stimulated thought and inspired us with their presentations. Of course, the papers in these proceedings represent just a few examples of how bio-inspired approaches are being applied to electronic hardware: analogies between the world of computer engineering and that of biology can be drawn, explicitly or implicitly, on many levels. By showcasing the latest developments in the field and by providing a forum for discussion and for the exchange of ideas, ICES 2010 represented, we hope, a small but significant step towards the fulfillment of some of our ambitions for this developing field and contributed novel ideas that will find fertile ground in our community and beyond. September 2010
Gianluca Tempesti Andy Tyrrell Julian Miller
Organization
ICES2010 was organized by the Intelligent Systems Group of the Department of Electronics, University of York, UK.
Executive Committee General Chair: Programme Chairs: Local Arrangements: Publicity:
Gianluca Tempesti Andy Tyrrell Julian Miller Stephen Smith James A. Walker Andrew Greensted Michael Lones
Steering Committee Pauline C. Haddow Tetsuya Higuchi Julian Miller Jim Torresen Andy Tyrrell
Norwegian University of Science and Technology, Norway AIST, Japan The University of York, UK The University of Oslo, Norway The University of York, UK (Chair)
Programme Committee Andrew Adamatzky Bur¸cin Aktan Tughrul Arslan Elhadj Benkhelifa Peter Bentley Michal Bidlo Stefano Cagnoni Carlos A. Coello Ronald F. DeMara Rolf Drechsler Marc Ebner R. Tim Edwards Stuart J. Flockton John Gallagher Takashi Gomi Garrison Greenwood
Pauline C. Haddow David M. Halliday Alister Hamilton Morten Hartmann Inman Harvey James Hereford Arturo Hernandez-Aguirre Jean-Claude Heudin Masaya Iwata Tatiana Kalganova Paul Kaufmann Krzysztof Kepa Didier Keymeulen Gul Muhammad Khan Gregory Larchev Per Kristian Lehre
VIII
Organization
Wenjian Luo Jordi Madrenas Trent McConaghy Bob McKay Maizura Mokhtar J. Manuel Moreno Arostegui Pierre-Andr´e Mudry Masahiro Murakawa Nadia Nedjah Andres Perez-Uribe Marek A. Perkowski Jean-Marc Philippe Tony Pipe Lucian Prodan Omer Qadir Daniel Roggen Jo¨el Rossier Eduardo Sanchez Cristina Santini
Gilles Sassatelli Thorsten Schnier Luk´ aˇs Sekanina Giovanni Squillero Till Steiner Susan Stepney Uwe Tangen Christof Teuscher Jon Timmis Yann Thoma Adrian Thompson Jim Torresen Martin Trefzer Gunnar Tufte Andres Upegui Fabien Vannel Moritoshi Yasunaga Xin Yao Tina Yu
Table of Contents
Session 1: Evolving Digital Circuits Measuring the Performance and Intrinsic Variability of Evolved Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James Alfred Walker, James A. Hilder, and Andy M. Tyrrell An Efficient Selection Strategy for Digital Circuit Evolution . . . . . . . . . . . Zbyˇsek Gajda and Luk´ aˇs Sekanina Introducing Flexibility in Digital Circuit Evolution: Exploiting Undefined Values in Binary Truth Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . Ricky D. Ledwith and Julian F. Miller Evolving Digital Circuits Using Complex Building Blocks . . . . . . . . . . . . . Paul Bremner, Mohammad Samie, Gabriel Dragffy, Tony Pipe, James Alfred Walker, and Andy M. Tyrrell
1
13
25
37
Session 2: Artificial Development Fault Tolerance of Embryonic Algorithms in Mobile Networks . . . . . . . . . David Lowe, Amir Mujkanovic, Daniele Miorandi, and Lidia Yamamoto Evolution and Analysis of a Robot Controller Based on a Gene Regulatory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin A. Trefzer, T¨ uze Kuyucu, Julian F Miller, and Andy M. Tyrrell A New Method to Find Developmental Descriptions for Digital Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad Ebne-Alian and Nawwaf Kharma Sorting Network Development Using Cellular Automata . . . . . . . . . . . . . . Michal Bidlo, Zdenek Vasicek, and Karel Slany
49
61
73
85
Session 3: GPU Platforms for Bio-inspired Algorithms Markerless Articulated Human Body Tracking from Multi-view Video with GPU-PSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Mussi, Spela Ivekovic, and Stefano Cagnoni
97
X
Table of Contents
Evolving Object Detectors with a GPU Accelerated Vision System . . . . . Marc Ebner
109
Systemic Computation Using Graphics Processors . . . . . . . . . . . . . . . . . . . . Marjan Rouhipour, Peter J. Bentley, and Hooman Shayani
121
Session 4: Implementations and Applications of Neural Networks An Efficient, High-Throughput Adaptive NoC Router for Large Scale Spiking Neural Network Hardware Implementations . . . . . . . . . . . . . . . . . . Snaider Carrillo, Jim Harkin, Liam McDaid, Sandeep Pande, and Fearghal Morgan Performance Evaluation and Scaling of a Multiprocessor Architecture Emulating Complex SNN Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giovanny S´ anchez, Jordi Madrenas, and Juan Manuel Moreno
133
145
Evolution of Analog Circuit Models of Ion Channels . . . . . . . . . . . . . . . . . . Theodore W. Cornforth, Kyung-Joong Kim, and Hod Lipson
157
HyperNEAT for Locomotion Control in Modular Robots . . . . . . . . . . . . . . Evert Haasdijk, Andrei A. Rusu, and A.E. Eiben
169
Session 5: Test, Repair and Reconfiguration Using Evolutionary Algorithms The Use of Genetic Algorithm to Reduce Power Consumption during Test Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaroslav Skarvada, Zdenek Kotasek, and Josef Strnadel
181
Designing Combinational Circuits with an Evolutionary Algorithm Based on the Repair Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Houjun Liang, Wenjian Luo, Zhifang Li, and Xufa Wang
193
Bio-inspired Self-testing Configurable Circuits . . . . . . . . . . . . . . . . . . . . . . . Andr´e Stauffer and Jo¨el Rossier Evolutionary Design of Reconfiguration Strategies to Reduce the Test Application Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ aˇcek, Luk´ Jiˇr´ı Sim´ aˇs Sekanina, and Luk´ aˇs Stareˇcek
202
214
Session 6: Applications of Evolutionary Algorithms in Hardware Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis . . . . . Jo¨el Rossier and Carlos Pena
226
Table of Contents
Automatic Code Generation on a MOVE Processor Using Cartesian Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James Alfred Walker, Yang Liu, Gianluca Tempesti, and Andy M. Tyrrell Coping with Resource Fluctuations: The Run-time Reconfigurable Functional Unit Row Classifier Architecture . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Knieper, Paul Kaufmann, Kyrre Glette, Marco Platzner, and Jim Torresen
XI
238
250
Session 7: Reconfigurable Hardware Platforms A Self-reconfigurable FPGA-Based Platform for Prototyping Future Pervasive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Marc Philippe, Benoˆıt Tain, and Christian Gamrat The X2 Modular Evolutionary Robotics Platform . . . . . . . . . . . . . . . . . . . . Kyrre Glette and Mats Hovin Ubichip, Ubidule, and MarXbot: A Hardware Platform for the Simulation of Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andres Upegui, Yann Thoma, H´ector F. Satiz´ abal, Francesco Mondada, Philippe R´etornaz, Yoan Graf, Andres Perez-Uribe, and Eduardo Sanchez Implementation of a Power-Aware Dynamic Fault Tolerant Mechanism on the Ubichip Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kotaro Kobayashi, Juan Manuel Moreno, and Jordi Madrenas
262 274
286
299
Session 8: Applications of Evolution to Technology Automatic Synthesis of Lossless Matching Networks . . . . . . . . . . . . . . . . . . Leonardo Bruno de S´ a, Pedro da Fonseca Vieira, and Antonio Mesquita A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Farnsworth, Elhadj Benkhelifa, Ashutosh Tiwari, and Meiling Zhu From Binary to Continuous Gates – and Back Again . . . . . . . . . . . . . . . . . Matthias Bechmann, Angelika Sebald, and Susan Stepney Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cristian Ruican, Mihai Udrescu, Lucian Prodan, and Mircea Vladutiu
310
322
335
348
XII
Table of Contents
Session 9: Novel Methods in Evolutionary Design Imitation Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Larry Bull
360
EvoFab: A Fully Embodied Evolutionary Fabricator . . . . . . . . . . . . . . . . . . John Rieffel and Dave Sayles
372
Evolving Physical Self-assembling Systems in Two-Dimensions . . . . . . . . . Navneet Bhalla, Peter J. Bentley, and Christian Jacob
381
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
393
Measuring the Performance and Intrinsic Variability of Evolved Circuits James Alfred Walker, James A. Hilder, and Andy M. Tyrrell Intelligent Systems Group, Department of Electronics, University of York, Heslington, York, YO10 5DD, UK {jaw500,jah128,amt}@ohm.york.ac.uk
Abstract. This paper presents a comparison between conventional and multi-objective Cartesian Genetic Programming evolved designs for a 2-bit adder and a 2-bit multiplier. Each design is converted from a gatelevel schematic to a transistor level implementation, through the use of an open-source standard cell library, and simulated in NGSPICE in order to generate industry standard metrics, such as propagation delay and dynamic power. Additionally, a statistical intrinsic variability analysis is performed, in order to see how each design is affected by intrinsic variability when fabricated at a cutting-edge technology node. The results show that the evolved design for the 2-bit adder is slower and consumes more power than the conventional design. The evolved design for the 2-bit multiplier was found to be faster but consumed more power than the conventional design, and that it was also more tolerant to the effects of intrinsic variability in both timing and power. This provides evidence that in the future, evolutionary-based approaches could be a feasible alternative for optimising designs at cutting-edge technology nodes, where traditional design methodologies are no longer appropriate, providing speed and power information about the standard cell library is used.
1
Introduction
The construction of digital logic circuits has often been used as a method to evaluate the performance of non-standard computing techniques such as algorithms inspired by Darwinian evolution. Cartesian Genetic Programming (CGP), originally developed by Miller and Thomson, is a design technique which has been used to evolve novel logic-circuit topologies and has demonstrated efficiency in computation time and resources over other biologically-inspired methods such as Koza’s Genetic Programming [9,8]. CGP differs from conventional Genetic Programming in its representation of a program, which is a directed graph as opposed to a tree. A key benefit of this representation is the implicit re-use of nodes whereby a node can be connected to the output of any previous node within the graph. The CGP genotype is a fixed-length list of integers encoding both the node-function and its connections within the directed graph. Each node within the directed graph represents a particular function, such as a logic gate, and is encoded by a number of genes; one gene encodes the functionality and the G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 1–12, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
J.A. Walker, J.A. Hilder, and A.M. Tyrrell
remaining genes encode the inputs to the function. The nodes take feed-forward inputs from either previous nodes in the graph or a terminal input. In Miller’s conventional CGP approach an evolutionary run would terminate once a circuit which met the target boolean output-functionality was found. This approach has been successfully used to create numerous novel topologies for building-block logic circuits such as full-adders and multipliers, however the resultant circuits are often significantly larger than optimally-efficient designs need to be. Whilst the circuits are functionally correct in terms of binary output, they will often contain more gates or transistors than conventional human designs, and longer paths between input and output through gates and transistors. In standard logic design one of the primary goals is to minimise both the circuit area and delay; fewer large circuits can be fabricated on a single wafer which results in increased cost, longer delays result in a decrease in the maximum operating frequency of the device. In previous work [7], the conventional CGP algorithm is augmented with a stage which further optimises circuits, once a functionally-correct design has been found. To achieve this goal a two-tiered fitness function is used; the first tier is the conventional boolean-error score based on the binary Hamming distance between the observed output and the target truth-table. For each circuit found which is fully functionally correct, it is then rated for performance over a number of different criteria and sorted into Pareto-fronts using Non-dominating Sorting Algorithm II (NSGA-II) [5]. The fitness criteria used are based on the total gate count of the circuit and the longest gate path, along with the total transistor count of the circuits (which will generally be proportional to the circuit area in a fabricated design), and the longest-transistor path which aims to give an approximation of worst-case transition delay for the circuit. A similar procedure has been followed in recently published work by Wang and Lee, which uses an adapted CGP algorithm which attempts to optimise gate count and gate path-lengths once functionally correct circuits have been found. Their solution, implemented in hardware on a Xilinx-FPGA, does not however consider other important circuit parameters such as transistor count, treating all gates as equal [16]. Although the previous approaches [7,16] have found optimal designs, the fitness criteria used only monitors structural changes to the designs and does not give any feedback about other parameters of the design, such as speed or power consumption, which are crucial in order to enable an evolved design to be feasible and used in industry. Ideally, the optimisation process would have access to these figures but running large circuits through an analogue circuit simulator such as NGSPICE in order to generate these figures is extremely costly in terms of time and would only be feasible on extremely large scale high-performance computing resources. Commercial design tools normally operate with standard cell libraries (gate level) that have been characterised in order to have access to the speed and power figures for a certain working range thereby removing the need for an analogue simulator during the evaluation of a large circuit. However, one downfall of commercial design tools is that they are currently not capable of assessing how intrinsic variability will affect the design at cutting-edge
Measuring the Performance and Intrinsic Variability of Evolved Circuits
3
technology nodes. Recently the scale of transistors has approached the level where the precise placement of individual dopant atoms will affect the output characteristics of the transistor. As these intrinsic variations become more abundant, higher failure rates and lower yields will be observed from conventional designs. Coping with intrinsic variability has been recognised as one of the major unsolved challenges faced by the semiconductor industry [1,4]. In this paper, a conventional design and an evolved design for a 2-bit adder and a 2-bit multiplier (taken from [7]) are implemented at the transistor level and run through the analogue circuit simulator NGSPICE, in order to generate industry standard metrics for the designs, such as propagation delay and dynamic power, and to perform a comparison between the designs based on these metrics. Additionally, a statistical intrinsic variability analysis will be performed on the designs in order to see how intrinsic variability would affect the designs if they were to be fabricated at a cutting edge technology node. It will also be interesting to see if either design shows any signs of variability tolerance over the other. The structure of this paper is as follows: Section 2 discusses the causes and impact of transistor variability, and outlines the methods used to extract accurate data models which incorporate random variations. Section 3 describes the process of converting the conventional and evolved designs from the gate level to the transistor level and defines the performance metrics used. Section 4 provides details of the design comparison based on the performance and intrinsic variability analysis. The conclusions and proposals for future work are summarised in Section 5.
2
CMOS Variability
CMOS devices form the backbone of almost all modern digital circuits. Integrated circuits are assembled from complementary pairs of PMOS and NMOS transistors optimised for high speed and low-power consumption. For many years, the cyclical process of reducing transistor channel length has resulted in devices both faster and lower in power consumption than the previous generation, with modern microprocessors boasting in excess of one billion transistors and gate lengths of under 50nm [14]. The International Technology Semiconductor Road-map (ITRS) published by the Semiconductor Industry Association projects an annual reduction of 11% in gate length, resulting in reduced operating voltages and a decrease in the gate delay of 10% per year [17]. This projected improvement is under threat from the problem of decreased yield caused by heightened variability as devices shrink. 2.1
Causes of Device Variability
The precision of individual device and interconnect parameters has traditionally been dependant on constraints within the manufacturing process, and has been considered deterministic in nature. As channel lengths shrink below 50nm, unavoidable stochastic variability due to the actual location of individual dopant
4
J.A. Walker, J.A. Hilder, and A.M. Tyrrell
(a) 22nm due c.2009
MOSFET (b) 4.2nm due c.2023
MOSFET
(c) Simulated 35nm Device
Fig. 1. Future transistors illustrated at the atomic scale (a & b) and intrinsic parameter fluctuations within a simulated 35nm device (c)[2]
atoms within the device channel is becoming increasingly significant. This is illustrated to scale in figures 1(a) and 1(b), which show that as devices get smaller (22nm to 4.2nm), the ratio of device size to constituent-atom size becomes less favourable, therefore the variable constitution at the atomic scale has an increased effect on device behaviour. Many advances have been made to reduce the loss of precision caused by the manufacturing process, however the fundamental quantum-mechanical limitations cannot be overcome, and their impact will increase as the technology shrinks further [1]. Device variability occurs in both the spatial and temporal domains, and each includes both deterministic and stochastic fluctuations. Spatial variability occurs when the produced device shape differs from the intended design, including uneven doping profiles, non-uniformity in layer thickness and poly-crystalline surfaces. This variability is found at all levels: over the lifetime of a fabrication system, across a wafer of chips, between cells within a VLSI chip, and between individual devices within that cell. Temporal variability includes the effects of electromigration, gate-oxide breakdown and the distribution of negative-bias temperature instability (NBTI). Such temporal variability has been estimated, and can be combined to give an expected lifetime calculation for an individual device, or simulated to determine the compound effect across a whole chip [3,13]. Whilst deterministic variability can be accurately estimated using specific design techniques, intrinsic parameter fluctuations can only be modelled statistically and cannot be reduced with improvements in the manufacturing process [2,10]. 2.2
Intrinsic Parameter Fluctuations
Intrinsic variability is caused by the atomic-level differences in devices that could be considered identical in layout, construction and environment. Summarised below are the principal sources of intrinsic variability, as illustrated in figure 1(c). Random Dopant Fluctuations (RDF) are unavoidable variations caused by the precise number and position of dopant atoms within the silicon lattice, which exist even with a tightly controlled implant and annealing process. This uncertainty results in substantial variability in the device threshold voltage,
Measuring the Performance and Intrinsic Variability of Evolved Circuits
5
sub-threshold slope and drive current, with the most significant variations caused by atoms near the surface and channel of the device [1]. Line Edge Roughness (LER) is the deviation in the horizontal plane of a fabricated feature boundary from its ideal form. LER has both a deterministic nature, caused by imperfections in the mask-manufacturing, photo-resist and etching processes, and also a stochastic nature due to the discrete nature of molecules used within the photo-resist layer, resulting in a random roughness on the edges of blocks etched onto the wafer [2]. Surface Roughness (SR) is the vertical deviation of the actual surface compared to the ideal form. The shrinking of surface layers, in particular the oxide layer, results in variations in the parasitic capacitances between terminals which can add to VT variations [11]. Poly-Silicon Grain Boundary Variability (PSGB) is the variation due to the random arrangement of grains within the gate material due to their polycrystalline structure. Implanted ions can penetrate through the poly-silicon and insulator into the device channel, resulting in localised stochastic variations [6]. 2.3
Modelling Intrinsic Variability
To accurately model the effects of intrinsic parameter fluctuations it is necessary to use statistical 3D simulation methods with a fine-grained discretisation. The Device Modelling Group (DMG) within the University of Glasgow [1,2] has become one of the leading research centres for 3D device modelling using their atomistic simulator, which adapts conventional 3D device modelling tools to incorporate the intrinsic effects described above. To categorise a particular transistor, a large number of current-voltage (I − V ) curves are extracted and then used to calibrate a sub-set of parameters to create a model library representing the device. For the experiments described in this paper, a library of 200 different NMOS and PMOS models, based on a 35nm × 35nm Toshiba device, has been used. To use these models within an open source implementation of the Berkeley SPICE (Simulation Program with Integrated Circuit Emphasis. See http://bwrc.eecs.berkeley.edu/Classes/icbook/SPICE/) circuit simulator, known as NGSPICE (http://ngspice.sourceforge.net/), the DMG has developed a tool, randomspice, which replaces the transistors within a template netlist with models selected randomly from the library. To allow transistors with different widths to be simulated, subcircuits of random transistors connected in parallel are assembled. To estimate the impact of variability, randomspice creates a set of output netlists which are then processed by NGSPICE. Randomspice can also create a single netlist in which only uniform 35nm transistor models are used, without the parameter fluctuations, allowing the variable output to be compared to a uniform ideal output.
6
3
J.A. Walker, J.A. Hilder, and A.M. Tyrrell
Experiment Details
The conventional designs for the 2-bit adder and 2-bit multiplier used in this paper are the standard designs taught to students in most digital electronics courses. The evolved designs are taken from previous published work [7]. These designs were found to be optimal for the structural criteria specified in the paper, namely, it contained the minimal number of gates and transistors compared to the length of the longest gate and transistor paths. However, the speed and power consumption of the designs was not analysed as part of the fitness criteria. In order to assess the designs on these criteria, they need to be converted to a transistor level schematic and simulated in NGSPICE. The conventional and evolved designs for the 2-bit multiplier are shown in figure 2.
(a) Conventional 2-bit Multiplier
(b) Evolved 2-bit Multiplier
Fig. 2. Conventional and evolved designs for a 2-bit multiplier
Converting the designs requires the use of a standard cell library (SCL). SCLs are the industry standard building blocks for constructing large circuits and consist of a number of transistor level implementations of logic and memory functions. In this paper, a number of standard cell layouts from the open-source vsclib library [12] have been used and are shown in figure 3. In order to use the uniform and variability enhanced 35nm models and RandomSPICE discussed in section 2.3, the standard cell layouts and transistor sizes have been translated from there original 130nm process to the 35nm process. In order to convert the conventional and evolved gate-level designs for the 2-bit adder and 2-bit multiplier to transistor level schematics, it is simply a case of replacing each gate with its corresponding transistor implementation from the scaled down 35nm vsclib. Once the gate level designs for the conventional and evolved 2-bit adder and 2-bit multiplier have been translated to the transistor level, an input, supply and load stage are added to the transistor definitions to form the complete netlist, as illustrated in figure 4. This arrangement allows the voltage and current at the inputs, supply, and load to be measured, and allows realistic circuit loads to be connected to produce feasible results. The input signals for testing the designs are created using piece-wise linear (PWL) sources to approximate a transistor response with a given rise/fall time. One input is held logic high for a clock cycle then low for a clock cycle, and then high for a final clock cycle, whilst the
Measuring the Performance and Intrinsic Variability of Evolved Circuits
7
Fig. 3. Cells used from the open-source VSCLib
remaining three inputs are all held logic high. This process is repeated for each of the inputs. A NGSPICE transient analysis is used to observe the voltages and currents over a period of 15 clock cycles for the 2-bit adder and 12 clock cycles for the 2-bit multipier. 3.1
Measuring Speed and Power Consumption
In order to assess whether the evolved designs are faster or lower power than the conventional designs, measure statements were used in the NGSPICE simulation to calculate the propagation delay and the dynamic power of each design. The propagation delay is defined as the time taken from an input reaching the 50% threshold to an output reaching the 50% threshold. As the designs have multiple outputs, it is the slowest time taken for an output to reach the 50% threshold over all input transitions that is used, as this is the delay that would determine the operating frequency of the design. The dynamic power of each design is defined as the integral of supply voltage × supply current for the region of the clock cycle that the design is switching. This switching region is defined from the
8
J.A. Walker, J.A. Hilder, and A.M. Tyrrell
Fig. 4. The testbench used to evaluate the designs in NGSPICE
point when an input starts to switch (rise or fall) to the point when the slowest output has finished switching (falling or rising) and reached a stable state. Once again, as the designs have multiple outputs, it is the output transition(s) that consume the most power that are used. 3.2
Measuring Intrinsic Variability
In order to measure the affects of intrinsic variability, a batch of NGSPICE simulations are performed for both the conventional and evolved designs using a randomised set of 35nm variability enhanced models from randomspice. The speed and power consumption metrics described in the previous section are then calculated using the data from the entire batch of runs and non-parametric statistics are generated to describe how intrinsic variability statistically affects these performance metrics. If the evolved designs show a significant reduction in variability for either of the performance metrics then it is said to be more variability tolerant than the other design.
4
Results
To perform a statistical analysis of the affects of intrinsic variability on the conventional and evolved designs for the 2-bit adder and 2-bit multiplier, 1,000 randomspice simulations are performed for each design and delay and power measurements are calculated for the batch of simulations. The results of the conventional and evolved designs are shown in figure 5, which shows a comparison of the propagation delay and dynamic power for the worst case output when
Measuring the Performance and Intrinsic Variability of Evolved Circuits
(a) Conventional Adder
(b) Evolved Adder
(c) Conventional Multiplier
(d) Evolved Multiplier
9
Fig. 5. Statistical intrinsic variability analysis of the conventional and evolved designs for a 2-bit adder and a 2-bit multiplier. Each point of the scatter plot for each design represents the propagation delay and dynamic power from a single NGSPICE simulation, whilst each cloud of points shows the variation in propagation delay and dynamic power when manipulating an input for each design. The plots above and to the right of each scatter plot show the kernel density estimates of each distribution in terms of propagation delay and dynamic power.
manipulating each input of the designs. The figure also clearly highlights the critical paths for timing and power for both the conventional and evolved designs from which the worst case propagation delay and dynamic power figures in Table 1 are based. Additionally, the structural information also in Table 1 was taken from [7] and was used for the objectives when optimising the evolved designs.
10
J.A. Walker, J.A. Hilder, and A.M. Tyrrell
Table 1. Metrics from the CGP objectives for the conventional and evolved 2-bit adder and 2-bit multiplier designs compared with the NGSPICE measurements 2-bit Adder Metric CGP
Gate Count Transistor Count Longest Gate Path Longest Transistor Path
NGSPICE Propagation Delay Dynamic Power
2-bit Multiplier
Conventional Evolved Conventional Evolved 10 64 4 12
10 60 4 11
2.98e−11 3.79e−11 3.36e−7 1.05e−6
8 54 3 9
7 35 2 5
4.56e−11 3.28e−11 3.73e−15 4.39e−15
From the results, it can be seen that the evolved design for the adder is 27% slower and consumes 312% more power than the conventional design, whereas the evolved design for the multiplier is 28% faster but consumes 17% more power than the conventional design. The improvement in delay of the evolved multiplier corresponds to the reduction in path length between the two designs, whereas for the adder, the path lengths between the two designs are similar, so it is surprising to see the evolved design is so much slower. Both evolved designs consumed more power than the conventional designs, which is surprising considering both evolved designs have a reduction in either gate or transistor count. However, this highlights the fact that the evolved designs were not specifically optimised for power and that some sort of power measure should be incorporated into the CGP objectives. Interestingly, on comparing the statistics of the timing and power distributions for both the evolved and conventional designs, it can be seen that the evolved design for the 2-bit adder shows a greater amount of variability than the conventional design but the 2-bit multiplier has less variability than the conventional design in both distributions. The evolved design for the 2-bit multiplier shows a reduction of 39% in the inter-quartile range (IQR, defined as the middle 50% of the distribution) and a 27% in the range of the timing distribution for the critical path. Also, the power distribution of the critical path of the evolved design for the 2-bit multiplier shows a reduction of 25% in the IQR and a 17% in the range. Therefore, it can be said that the evolved design for the 2-bit multiplier is more variability tolerant than the conventional design, in addition to it being faster. However, this could be attributed to the evolved design consuming more power than the conventional design. This highlights the fact that the optimisation process used in [7] could be a feasible option for designing variability tolerant circuits at cutting-edge technology nodes, when traditional design methodologies are no longer appropriate, providing the objectives used reflect more accurately the speed and power consumption of the designs.
Measuring the Performance and Intrinsic Variability of Evolved Circuits
5
11
Conclusions and Future Work
This paper has presented a comparison between conventional and evolved designs for a 2-bit adder and a 2-bit multiplier based on performance metrics and a statistical intrinsic variability analysis obtained from a batch of 1,000 NGSPICE simulations. The results show that the evolved design for the 2-bit adder was slower and consumed more power than the conventional design and the evolved design for the 2-bit multiplier was faster but consumed more power than the conventional design. The results for the 2-bit multiplier shows some correlation to the original objectives used in the optimisation process, however no correlation can be seen for the 2-bit adder results. This partly supports the claims made in [7] that by optimising designs post-evolution using multiple objectives that consider the gate and transistor counts and path lengths, it is possible to produce fabricateable designs that show real-world improvements in circuit area and operating speed (in some cases). However, it highlights the fact in future work, the objectives used in the optimisation process from [7] need to be expanded to include power and delay measurements from each standard cell. This would enable the optimisation process to perform a similar role to some aspects of commercial design tools. The statistical intrinsic variability analysis showed that the evolved design for the 2-bit multiplier is also more tolerant to the affects of intrinsic variability than the conventional design in both timing and power. This shows that the optimisation process could be a feasible alternative for optimising designs at cutting-edge technology nodes where traditional design methodologies are no longer appropriate (as they cannot account for the affects of intrinsic variability), providing the measures suggested above are incorporated. In future work, it is intended to expand the objectives used in the optimisation process from [7] to consider the affects of intrinsic variability on both timing and power. Additionally, the standard cells themselves could first be optimised for performance and variability tolerance using the approach from [15]. The optimisation process from [7] would then appear as the next design tool in a conventional tool chain that operates at a higher level of abstraction.
Acknowledgements The authors would like to thank all partners of the Nano-CMOS project, especially the Device Modelling Group at the University of Glasgow for providing the variability-enhanced models and the randomspice application. Nano-CMOS is funded by the EPSRC under grant No. EP/E001610/1.
References 1. Asenov, A.: Random dopant induced threshold voltage lowering and fluctuations in sub 50 nm mosfets: a statistical 3D ’atomistic’ simulation study. Nanotechnology 10, 153–158 (1999)
12
J.A. Walker, J.A. Hilder, and A.M. Tyrrell
2. Asenov, A.: Variability in the next generation CMOS technologies and impact on design. In: International Conference on CMOS Variability (2007) 3. Bernstein, J.B., et al.: Electronic circuit reliability modeling. Microelectronics Reliability 46, 1957–1979 (2006) 4. Bernstein, K., et al.: High-performance CMOS variability in the 65-nm regime and beyond. Advanced Silicon Technology 50 (2006) 5. Deb, K.A.P., Agarwal, S., Meyarivan, T.: A fast and elitist multi-objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6, 181–197 (2002) 6. Eccleston, W.: The effect of polysilicon grain boundaries on MOS based devices. Microelectronic Engineering 48, 105–108 (1999) 7. Hilder, J.A., Walker, J.A., Tyrrell, A.M.: Use of a multi-objective fitness function to improve cartesian genetic programming circuits. In: NASA/ESA Conference on Adaptive Hardware and Systems, AHS-2010 (2010) 8. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992) 9. Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 10. Design for Variability in Logic, Memory and Microprocessor. In: Mizuno, M., De, V. (eds.) VLSI Circuits Proc. Kyoto, Japan (2007) 11. Moroz, V.: Design for manufacturability: OPC and stress variations. In: International Conference on CMOS Variability (2007) 12. Petley, G.: VLSI and ASIC technology standard cell library design, http://www.vlsitechnology.org 13. Rubio, J., et al.: Physically based modelling of damage, amorphization and recrystallization for predictive device-size process simulation. Materials Science and Engineering B, 114–115 (2004) 14. Streetman, B.G., Banerjee, S.: Solid State Electronic Devices. Prentice-Hall, Englewood Cliffs (2000) 15. Walker, J.A., Sinnott, R., Stewart, G., Hilder, J.A., Tyrrell, A.M.: Optimising electronic standard cell libraries for variability tolerance through the Nano-CMOS grid. Philosophical Transactions of the Royal Society A (2010) 16. Wang, J., Lee, C.: Evolutionary design of combinational logic circuits using vra processor. IEICE Electronics Express 6, 141–147 (2009) 17. Wyon, C.: Future technology for advanced MOS devices. Nuclear Instruments and Methods in Physics Research B 186 (2002)
An Efficient Selection Strategy for Digital Circuit Evolution Zbyˇsek Gajda and Luk´ aˇs Sekanina Brno University of Technology, Faculty of Information Technology Boˇzetˇechova 2, 612 66 Brno, Czech Republic
[email protected],
[email protected]
Abstract. In this paper, we propose a new modification of Cartesian Genetic Programming (CGP) that enables to optimize digital circuits more significantly than standard CGP. We argue that considering fully functional but not necessarily smallest-discovered individual as the parent for new population can decrease the number of harmful mutations and so improve the search space exploration. This phenomenon was confirmed on common benchmarks such as combinational multipliers and the LGSynth91 circuits.
1
Introduction
Cartesian Genetic Programming (CGP) exhibits many interesting features, especially for circuit design. When CGP is applied to reduce the number of gates in digital circuits it starts with the fitness function which evaluates the circuit behavior only. Once one of candidate circuits conforms to the behavioral specification the number of gates becomes important and reflected in the fitness value. This method which will be called the standard CGP in this paper, is widely adopted in literature [1, 2, 3, 4]. We have shown in our previous work [5] that area-efficient digital circuits can be evolved even if the requirement on the gate reduction is not specified explicitly. The method is based on modifying the selection mechanism and fitness function of the standard CGP. In this paper, we provide further experimental evidence for this phenomenon. In addition to testing the method using popular benchmarks such as multipliers we will perform experimental evaluation using the LGSynth91 benchmark circuits. We hypothesize that the neutral search and redundancy of encoding of CGP (as demonstrated in [6, 7, 8]) are primarily responsible for this phenomenon. We argue that considering fully functional but not necessarily smallest-discovered individuals as parents improve the search space exploration in comparison with the standard CGP. The rest of the paper is organized as follows. Section 2 surveys the basic (standard) version of CGP. Benchmark problems are presented in Section 3. Proposed modification of CGP is formulated in Section 4. The results of experiments are summarized in Section 5. Section 6 deals with the analysis of results on the basis of measurement of non-destructive mutations. Finally, conclusions are given in Section 7. G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 13–24, 2010. c Springer-Verlag Berlin Heidelberg 2010
14
2
Z. Gajda and L. Sekanina
Cartesian Genetic Programming
Cartesian Genetic Programming is a widely-used method for evolution of digital circuits [9, 1]. In CGP, a candidate entity (circuit) is modeled as an array of nc (columns) × nr (rows) of programmable nodes (gates). The number of inputs, ni , and outputs, no , is fixed. Each node input can be connected either to the output of a node placed in previous l columns or to one of the program inputs. The l-back parameter, in fact, defines the level of connectivity and thus reduces/extends the search space. For example, if l=1 only neighboring columns may be connected; if nr = 1 and l = nc then full connectivity is enabled. Feedback is not allowed. Each node is programmed to perform one of na -input functions defined in the set Γ (nf denotes |Γ |). Each node is encoded using na + 1 integers where values 1 . . . na are the indexes of the input connections and the last value is the function code. Every individual is encoded using nc .nr .(na + 1) + no integers. Figure 1 shows an example of a candidate circuit and its chromosome.
0
1
3
nor
5
5
2
7
xor
not
1 2
2 xor
4
3
6
and
5
8
not
1,2,1; 1,2,2; 4,2,5; 3,4,3; 6,1,2; 0,5,5; 7,6
Fig. 1. An example of a candidate circuit in CGP and its chromosome: l = 3, nc = 3, nr = 2, ni = 3, no = 2, na = 2, Γ = {NOR (1), XOR (2), AND (3), NAND (4), NOT (5)}
CGP operates with the population of 1 +λ individuals (typically, λ is between 1 and 20). The initial population is constructed either randomly or by a heuristic procedure. Every new population consists of the best individual of the previous population and its λ offspring. The offspring individuals are created using a point mutation operator which modifies h randomly selected genes of the chromosome, where h is the user-defined value. There is one important rule for selection of the parent. In case when two or more individuals can serve as the parent, the individual which has not served as the parent in the previous generation will be selected as the new parent. This strategy is important because it ensures the diversity of population [7]. The algorithm is terminated when the maximum number of generations is exhausted or a sufficiently working solution is obtained. Because we will deal with digital circuit evolution, let us consider the fitness function for that case only. The goal is to obtain a perfectly working circuit
An Efficient Selection Strategy for Digital Circuit Evolution
15
(all assignments to the inputs have to be tested) with the number of gates as low as possible. Additional criteria can be included; however, we will not deal with them in this paper. The most effective strategy to the fitness calculation proposed so far is as follows: The fitness value of a candidate circuit is defined as [3]: b when b < n o 2 ni , (1) f it1 = b + (nc nr − z) otherwise, where b is the number of correct output bits obtained as response for all possible assignments to the inputs, z denotes the number of gates utilized in a particular candidate circuit and nc .nr is the total number of available gates. It can be seen that the last term nc nr − z is considered only if the circuit behavior is perfect, i.e. b = bmax = no 2ni . We can observe that the evolution has to discover a perfectly working solution firstly while the size of circuit is not important. Then, the number of gates is optimized. The encoding used in CGP is redundant since there may be genes that are entirely inactive. These genes do not influence the phenotype, and hence the fitness. This phenomenon is often referred to as neutrality. The role of neutrality has been investigated in detail [10, 6, 7]. For example, it was found that the most evolvable representations occur when the genotype is extremely large and in which over 95% of the genes are inactive [7]. But for example, Collins has shown that for some specific problems the neutrality-based search is not the best solution [11]. Miller has also identified that the problem of bloat is insignificant for CGP [12].
3
Benchmark Problems
Design of small multipliers is the most popular benchmark problem for the gate level circuit evolution. Because the direct CGP approach is not scalable it works only for 4-bit multipliers (i.e. 8-input/8-output circuits) and smaller. Table 1 summarizes the best known results for various multipliers according to [1, 2]. CGP was used with two-input gates, l = nc , λ = 4, h = 3, remaining parameters are given in Table 1. CGP was seeded using conventional designs. The fitness function was constructed according to equation 1. CGP is capable of creating innovative designs for this class of circuits. However, it is important to carefully initialize CGP parameters. For example, in order to reduce the search space the function set should contain just the logic functions that are important for multipliers (the solutions denoted as Best CGP in Table 1 were obtained using Γ = {x AN D y, x XOR y, (not x) AN D y}). However, the gate (not x) AN D y is not usually considered as a single gate in digital design. Its implementation is constructed using two gates: AND and NOT. Hence we also included ‘Recalc. CGP’ to Table 1 which is the result recalculated when one considers (not x) AN D y as two gates in the multipliers shown in [2].
16
Z. Gajda and L. Sekanina Table 1. The number of two-input gates in multipliers according to [1, 2] Multiplier 2b×2b 3b×2b 3b×3b 4b×3b 4b×4b
Best conv. 8 17 30 47 64
Best CGP 7 13 23 37 57
Recalc. CGP 9 14 25 44 67
nr × nc 1×7 1 × 17 1 × 35 1 × 56 1 × 67
Max. gener. 10k 200k 20M 200M 700M
For further comparison of the standard CGP and proposed method we have selected 16 circuits from the LGSynth91 benchmark suite [13] (see Table 4). In this case we have utilized CGP in the postsynthesis phase, i.e. CGP is employed to reduce the number of gates in already synthesized circuits. In this paper, we have used the ABC tool to perform (conventional) synthesis [14]. Each circuit is represented as a netlist of gates in the BLIF format (Berkeley Logic Interchange Format).
4
The Proposed Modification of CGP
From the perspective of this paper, the fitness function and selection strategy are the most interesting features of the standard CGP. Because (1 + λ) strategy is used, the highest-scored individual p (whose fitness value will be denoted fp ) is always preserved. The result of evolution is then just the highest-scored individual of the last generation in the standard CGP. Consider a situation in which a fully working circuit has already been obtained (b = bmax ) and the number of gates is optimized now. If the mutation operator creates an individual x with the fitness value fx and fx ≥ fp then x will become a new parental solution p (assuming that there is no better result of mutation in the population). However, if the mutation operator creates individual y with the fitness value fy and (fy < fp ) ∧ (fy ≥ bmax ) then p will be selected as parent for the new population and y will be discarded (assuming that the fitness values of other solutions are lower than fy ). In this way, many new fully functional solutions, however slightly worse than the parent, are lost. We will demonstrate in Section 5 that considering individual y for which the property (fy < fp )∧(fy ≥ bmax ) holds as a new parent is beneficial for the efficient search process. The new selection strategy and fitness function is proposed only for the situation when the number of gates is optimized, i.e. the fitness value of the best individual is higher than or equal to bmax . Otherwise, the algorithm works as the standard CGP. As the best individual found so far will not be copied to the new population automatically, it is necessary to store it in an auxiliary variable. Let β denote the best discovered solution and let fβ be its fitness value. In the first population, β is initialized using p. Assume that x1 . . . xλ are individuals (with fitness values fx1 . . . fxλ ) created from the parental solution p using the mutation operator and fβ ≥ bmax (i.e. we are in the gate reduction phase now). Because the best individual β and parental
An Efficient Selection Strategy for Digital Circuit Evolution
17
individual p are not always identical we have to determine their new instances β and p separately. The best-discovered solution is defined as: β when fβ ≥ fxi , i = 1 . . . λ, β = (2) otherwise, xj where xj is the highest-scored individual for which fxj > fβ holds. If multiple individuals exist that have higher fitness than fβ in {x1 . . . xλ }, randomly choose the best one of them. The new parental individual is defined as: p when ∀i, i = 1 . . . λ : fxi < bmax p = (3) xj otherwise, where xj is a randomly selected individual from those in {x1 . . . xλ } which obtained the fitness score higher than or equal to bmax . In other words, the new parent must be a fully functional solution; however, the number of gates is not important for its selection. Note that the result of evolution is no longer p but β. The proposed strategy will be denoted fit2.
5 5.1
Results Experimental Setup
CGP is used according to its definition in Section 2. In this paper, we always use nr = 1 and l = nc . The initial population is generated either randomly or using a solution obtained from a conventional synthesis method. If CGP is applied as a postsynthesis optimizer then the number of gates of the result of conventional synthesis is denoted as m (it is assumed that each of the gates has up to γ inputs). Then CGP will operate with the parameters nc = m, nr = 1, l = nc , na = γ. In all experiments λ = 14, γ = 2 and h is between 1 and 14 (the mean value is 7). We have used Γ = {and, or, not, nand, nor, xor, identity, const1 , const0 } where not and identity are unary functions (taking the first input of the gate) and constk is constant generator with the value k. Each experiment is repeated ten times with the 100 million generation limit. In all experiments the standard fitness function of CGP (denoted fit1) is compared with the method presented in Section 4 (denoted fit2). 5.2
Evolution from a Random Population
In the first experiment, we have evolved multipliers with up to four-bit operands from randomly generated initial population. According to recommendations of [7], we intentionally allowed relatively long chromosomes to be used by CGP. The nc values were set on the basis of ABC synthesis (see Table 3, the seed). Table 2 summarizes the number of gates (the best and mean values), mean number of generations to reach bmax and the success rate for fit1 and fit2. As
18
Z. Gajda and L. Sekanina
Table 2. The best-obtained and mean number of gates for the multiplier benchmarks when CGP starts from randomly generated initial population Circuit 2b × 2b 3b × 2b 3b × 3b 4b × 3b 4b × 4b
Alg. fit1 fit2 fit1 fit2 fit1 fit2 fit1 fit2 fit1 fit2
nc 7 16 57 125 269
gates (best) 7 7 13 13 25 23 46 37 110 60
gates (mean) 7 7 13 13 27.7 23.4 52.7 43.1 128.3 109.4
mean # gener. 2 738 2 777 651 297 741 758 476 812 625 682 2 714 891 4 271 179 29 673 418 37 573 311
succ. runs 100% 100% 100% 100% 100% 100% 100% 100% 90% 70%
design of 2b×2b and 3b×2b multipliers is easy for CGP, we will mainly analyse the results for larger problem instances (here and in next sections). It can be seen that fit2 gives better results than fit1. However, the mean number of generations is higher for fit2. We have obtained almost identical minimum number of gates when compared with [2] (also in Table 1, Best CGP) even when CGP is randomly initialized and a non-problem specific set of gates is utilized. 5.3
Post-synthesis Optimization
The second set of experiments compares fit1 and fit2 when CGP is applied to reduce the number of gates in already functional circuits. We compared three approaches to seeding the initial population in case of multipliers. The resulting multipliers of the ABC tool are taken as seeds in the first group of experiments (denoted ’seed:ABC’ in Table 3). The second group of experiments is seeded using the best multipliers reported in paper [2] (denoted ’seed:Tab. 1’ in Table 3). The seeds of the third group of experiments are created manually as combinational carry save multipliers according to [15] (denoted ’seed: CM’ in Table 3). Table 3 shows that fit2 can produce more compact designs (see the ’best’ column) than fit1. The mean number of gates is given in generation 1M, 2M, 5M, 10M, 20M, 50M and 100M (M=106 ). It can be seen that the best solution is improving over time. The best-evolved multiplier (4b × 4b) is composed of 56 gates (taken from Γ which does not consider the AND gate with one input inverted as a single gate). The best circuit presented in [2] consists of 57 gates taken from Γ (i.e., 67 gates when Γ is used). We can also express the implementation cost in terms of transistors used. While the 56-gate multiplier is composed of 400 transistors the multiplier reported in [2] consists of 438 transistors. It is assumed that the number of transistors required to create a particular gate is as follows: nand (4 tr.), nor (4 tr.), or (6 tr.), and (6 tr.), not (2 tr.) and xor (10 tr.) [15].
An Efficient Selection Strategy for Digital Circuit Evolution
19
Table 3. The best-obtained and mean number of gates in generations 1M...100M for the multiplier benchmarks when CGP is seeded by functional solutions of different type seed: ABC 2b × 2b
Alg. seed best 1M 2M 5M 10M 20M 50M 100M fit1 17 7 7 7 7 7 7 7 7 fit2 7 7 7 7 7 7 7 7 3b × 2b fit1 16 13 13 13 13 13 13 13 13 fit2 13 13 13 13 13 13 13 13 3b × 3b fit1 57 26 38.2 36.1 34.3 32.6 31 29.8 28.7 fit2 23 31.5 28.8 27.2 25 24.5 24.2 23.5 4b × 3b fit1 125 54 93.2 88.3 79.3 75.6 71.6 66.6 64.4 fit2 37 80 68 55.9 49.9 46.9 44.1 41.1 4b × 4b fit1 269 140 212.4 190.6 178.9 170.9 165.2 158.5 152.4 fit2 68 218.2 182.2 151.3 136.5 121.2 107 93.3 seed: Tab. 1 seed best 1M 2M 5M 10M 20M 50M 100M 2b × 2b fit1 9 7 7 7 7 7 7 7 7 fit2 7 7 7 7 7 7 7 7 3b × 2b fit1 14 13 13 13 13 13 13 13 13 fit2 13 13 13 13 13 13 13 13 3b × 3b fit1 25 23 25 25 24.7 23.9 23.5 23.2 23.1 fit2 23 25 25 24.7 24.4 24.2 23.5 23.1 4b × 3b fit1 44 36 38.5 37.8 37.1 36.8 36.8 36.4 36.3 fit2 35 37.9 37.1 36.5 36.4 36.2 36.2 36.1 4b × 4b fit1 67 57 59.6 58.8 58 57.8 57.5 57.3 57.1 fit2 56 59.5 59.2 58.7 58.3 57.2 56.8 56.8 seed: CM seed best 1M 2M 5M 10M 20M 50M 100M 2b × 2b fit1 8 7 7 7 7 7 7 7 7 fit2 7 7 7 7 7 7 7 7 3b × 2b fit1 17 13 13 13 13 13 13 13 13 fit2 13 13 13 13 13 13 13 13 3b × 3b fit1 30 23 28 28 28 27.8 27.6 26.5 25.8 fit2 23 28 28 27.6 26.8 25 24.4 23.4 4b × 3b fit1 45 37 43 43 43 42.4 41.9 40.6 39.2 fit2 37 43 43 42.6 42.2 41.5 39.9 38.4 4b × 4b fit1 64 59 62.9 62.6 62.6 62.3 61.5 60.6 60.2 fit2 59 62.9 62.9 62.8 62.4 62 61.3 60.8
Table 4 gives the best-obtained and mean number of gates for the LGSynth91 benchmark circuits when CGP is seeded by already working circuits. The working circuits (of the size given by nc ) were obtained using ABC initialized with the original LGSynth91 circuits (in the BLIF format) and mapped on two-input gates of Γ . The ’exp. gates’ is the estimated number of two-input gates (after the conventional synthesis) given in [13]. It can be seen that fit2 is more successful than fit1. In general, CGP gives better results than ’exp. gates’ because it does not employ any deterministic synthesis algorithm; all the optimizations are being done implicitly, without any structural biases.
20
Z. Gajda and L. Sekanina 4x4 Multiplier seeded by ABC
4x4 Multiplier seeded by ABC 300
fit2 fit1
250
Number of gates
Number of gates
300 200 150 100 50 0
fit2 fit1
250 200 150 100 50 0
1e+08
b)
9e+07
a)
8e+07
Generations
7e+07
6e+07
5e+07
4e+07
3e+07
2e+07
1e+07
0
1e+08
9e+07
8e+07
7e+07
6e+07
5e+07
4e+07
3e+07
2e+07
1e+07
0
Generations
Fig. 2. a) The number of gates of the parent individual (from the best run for 4b×4b multiplier) b) The mean number of gates of the best-obtained individuals β (from 10 runs for 4b×4b multiplier) Table 4. The best-obtained and mean number of gates for the LGSynth91 benchmarks when CGP starts from the initial solution (of size nc ) synthesized using ABC Circuit
ni
no
9symml C17 alu2 alu4 b1 cm138a cm151a cm152a cm42a cm82a cm85a decod f51m majority x2 z4ml
9 5 10 14 3 6 12 11 4 5 11 5 8 5 10 7
1 2 6 8 4 8 2 1 10 3 3 16 8 1 7 4
exp. gates 43 6 335 681 13 17 33 17 27 38 22 43 9 42 20
nc seed 216 6 422 764 11 19 34 24 20 12 41 34 146 10 60 40
gates fit1 (best) 53 6 134 329 4 16 24 22 17 10 23 30 29 8 27 15
gates fit2 (best) 23 6 73 274 4 16 23 21 17 10 22 26 26 8 27 15
gates fit1 (mean) 68.5 6 149 358 4 16 24 22.1 17 10 24.1 30 32.9 8 29.6 15
gates fit2 (mean) 25.5 6 89.4 279 4 16 23 21.8 17 10 22 26.1 27.3 8 27.4 15
Figure 2a shows the number of gates of the parent individual p in every 1000th generation during the progress of evolution of the 4b×4b multiplier using fit1 and fit2 (taken from the best runs; seeded by ABC). It can be seen that the parent is different from the best-obtained solution for fit2 (the curve is not monotonic). We can also observe that fit1 provides better result than fit2 in the early stages of the evolution. However, fit2 outperforms fit1 when more generations are allowed for evolution. Figure 2b shows the mean number of gates of the best-obtained individuals (averaged from 10 independent runs).
An Efficient Selection Strategy for Digital Circuit Evolution
6
21
Analysis
We have seen so far that selecting of the parent individual on the basis of its functionality solely (and so neglecting the number of gates) provides slightly better results at the end of evolution (when the goal is to reduce the phenotype size) than the standard CGP. How is it possible that the approach really works? Recall that the fitness landscape is rugged and neutral in case of digital circuit evolution using CGP [6, 8]. Hence relatively simple mutation-based search algorithms are more successful than sophisticated search algorithms and genetic operators such as those developed in the field of genetic algorithms and estimation of distribution algorithms. In the standard CGP, generating the offspring individuals is biased to the best individual that has been discovered so far. The best individual is changed only if a better or equally-scored solution is found. In the proposed method, the changes of the parent individual are more frequent because the only requirement for a candidate individual to qualify as the parent is to be fully functional. Hence we consider the proposed algorithm as more explorative than the standard CGP. Our hypothesis is that if a high degree of redundancy is present in the genotype the proposed method will generate more functionally correct individuals than the standard CGP. And because the fitness landscape is rugged and neutral the proposed method is more efficient in finding compact circuit implementations than the standard CGP. In order to verify this hypothesis we have measured the number of mutations that lead to functionally correct circuits. When CGP is seeded with a working circuit, we have in fact measured the number of neutral and useful mutations. Figure 3 compares the results for fit1 and fit2 in the experiments that are reported in Table 2 and Table 3. The y-axis is labeled as MNM which stands for ’Millions of Non-destructive Mutations’. For small multipliers (2b×2b, 3b×2b) fit1 always yields higher MNM which contradicts with our hypothesis. However, we have already declared that these really small multipliers are not interesting because the problem is easy and an optimal solution can be discovered very quickly. In case of more difficult circuits, fit2 provides higher MNM in most cases, especially when sufficient redundancy is available (see Fig. 3 a, d). When the best resulting multipliers of paper [2] are used to seed the initial population, fit1 is always higher than fit2 (see Fig. 3 b). It corresponds with a theory that CGP (with almost the zero redundancy in the genotype) has got stuck at a local extreme and fit2 does not have a space to work. The number of non-destructive mutations was counted in every 1000 generations and the resulting value was plotted as a single point to Fig. 4a (3b×3b multiplier) and Fig. 4b (4b×4b multiplier). The best run seeded using ABC is shown in both cases. It is evident that significantly more correct individuals have been generated for fit2 on average. It can also be seen that while fit1 tends to create a relatively stable number of correct individuals in time (the dispersion is approx. 200 individuals for the 4b×4b multiplier), great differences are observable in the number of correct individuals for fit2 (the dispersion is approx. 1000 individuals for the 4b×4b multiplier). That also supports the idea of biased search of fit1.
22
Z. Gajda and L. Sekanina Seed:Table 1 50 fit1 fit2
fit1 fit2
40 MNM
MNM
Seed:ABC 400 350 300 250 200 150 100 50 0
30 20 10 0
2x2
3x2
3x3 4x3 Multiplier
4x4
2x2
3x2
a)
Random Intial Population
fit1 fit2 MNM
MNM
40 35 30 25 20 15 10 5 0 3x2
4x4
b)
Seed:Comb. Mult.
2x2
3x3 4x3 Multiplier
3x3 4x3 Multiplier
400 350 300 250 200 150 100 50 0
fit1 fit2
4x4
2x2
c)
3x2
3x3 4x3 Multiplier
4x4
d)
fit2 fit1
4bx4b multiplier seeded by ABC 7000 6000 5000 4000 3000 2000 1000 0
fit2 fit1
1e+08
9e+07
8e+07
7e+07
6e+07
5e+07
4e+07
3e+07
2e+07
1e+07
0
1e+08
9e+07
8e+07
7e+07
6e+07
5e+07
4e+07
3e+07
2e+07
1e+07
Generations
a)
Non-dest. muts. per 1000 gens.
3bx3b multiplier seeded by ABC 4000 3500 3000 2500 2000 1500 1000 500 0 0
Non-dest. muts. per 1000 gens.
Fig. 3. Millions of Non-destructive Mutations (MNM) for different experiments (mean values given)
Generations
b)
Fig. 4. The number of non-destructive mutations per 1000 generations for: a) 3b×3b multiplier; b) 4b×4b multiplier
7
Conclusions
In this paper, we have shown that the selection of the parent individual on the basis of its functionality instead of compactness leads to smaller phenotypes at the end of evolution. The method is especially useful for the optimization of
An Efficient Selection Strategy for Digital Circuit Evolution
23
nontrivial circuits when a sufficient redundancy is available in terms of available gates and a sufficient time is allowed for evolution. In the future work we plan to test the proposed method to reduce the size of phenotype in symbolic regression problems.
Acknowledgments This work was partially supported by the grant Natural Computing on Unconventional Platforms GP103/10/1517, the BUT FIT grant FIT-10-S-1 and the research plan Security Oriented Research in Information Technology MSM 0021630528.
References [1] Miller, J.F., Job, D., Vassilev, V.K.: Principles in the Evolutionary Design of Digital Circuits – Part I. Genetic Programming and Evolvable Machines 1(1), 8–35 (2000) [2] Vassilev, V., Job, D., Miller, J.: Towards the Automatic Design of More Efficient Digital Circuits. In: Proc. of the 2nd NASA/DoD Workshop on Evolvable Hardware, pp. 151–160. IEEE Computer Society, Los Alamitos (2000) [3] Kalganova, T., Miller, J.F.: Evolving more efficient digital circuits by allowing circuit layout evolution and multi-objective fitness. In: The First NASA/DoD Workshop on Evolvable Hardware, pp. 54–63. IEEE Computer Society, Los Alamitos (1999) [4] Gajda, Z., Sekanina, L.: Reducing the number of transistors in digital circuits using gate-level evolutionary design. In: 2007 Genetic and Evolutionary Computation Conference, pp. 245–252. ACM, New York (2007) [5] Gajda, Z., Sekanina, L.: When does cartesian genetic programming minimize the phenotype size implicitly? In: Genetic and Evolutionary Computation Conference. ACM, New York (2010) (accepted) [6] Vassilev, V.K., Miller, J.F.: The advantages of landscape neutrality in digital circuit evolution. In: Miller, J.F., Thompson, A., Thompson, P., Fogarty, T.C. (eds.) ICES 2000. LNCS, vol. 1801, pp. 252–263. Springer, Heidelberg (2000) [7] Miller, J.F., Smith, S.L.: Redundancy and Computational Efficiency in Cartesian Genetic Programming. IEEE Transactions on Evolutionary Computation 10(2), 167–174 (2006) [8] Miller, J.F., Job, D., Vassilev, V.K.: Principles in the Evolutionary Design of Digital Circuits – Part II. Genetic Programming and Evolvable Machines 1(3), 259–288 (2000) [9] Miller, J., Thomson, P.: Cartesian Genetic Programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) [10] Yu, T., Miller, J.F.: Neutrality and the evolvability of boolean function landscape. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 204–217. Springer, Heidelberg (2001) [11] Collins, M.: Finding needles in haystacks is harder with neutrality. In: GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, pp. 1613–1618. ACM, New York (2005)
24
Z. Gajda and L. Sekanina
[12] Miller, J.: What bloat? cartesian genetic programming on boolean problems. In: 2001 Genetic and Evolutionary Computation Conference Late Breaking Papers, pp. 295–302 (2001) [13] Yang, S.: Logic Synthesis and Optimization Bechmarks User Guide, Version 3.0 (1991) [14] Berkley Logic Synthesis and Verification Group (ABC: A System for Sequential Synthesis and verification) [15] Weste, N., Harris, D.: CMOS VLSI Design: A Circuits and Systems Perspective, 3rd edn. Addison-Wesley, Reading (2004)
Introducing Flexibility in Digital Circuit Evolution: Exploiting Undefined Values in Binary Truth Tables Ricky D. Ledwith and Julian F. Miller Dept. of Electronics, The University of York, York, UK
[email protected],
[email protected]
Abstract. Evolutionary algorithms can be used to evolve novel digital circuit solutions. This paper proposes the use of flexible target truth tables, allowing evolution more freedom where values are undefined. This concept is applied to three test circuits with different distributions of “don’t care” values. Two strategies are introduced for utilising the undefined output values within the evolutionary algorithm. The use of flexible desired truth tables is shown to significantly improve the success of the algorithm in evolving circuits to perform this function. In addition, we show that this flexibility allows evolution to develop more hardware efficient solutions than using a fully-defined truth table. Keywords: Genetic Programming (GP), Evolutionary Algorithms, Cartesian Genetic Programming (CGP), Evolvable Hardware, “Don’t Care” Logic.
1 Introduction The design of digital circuits using evolutionary algorithms has attracted interest [1, 2, 3, 14, 15]. In this paper the evolutionary design of digital combinational circuits is considered using the established technique Cartesian Genetic Programming (CGP) [4]. However, for the first time as far as the authors are aware, this paper takes account of unspecified logic terms. These unspecified values are referred to as “don’t cares”, and often occur in design of finite state machines, and logic synthesis for machine learning [5]. In CGP, genotypes are represented as a list of integers mapped to directed graphs, as opposed to the more typical tree mapping structure. This provides a general framework for solving a range of problems, which has been proven effective in multiple areas including for evolution of combinational digital circuits. The evolution of digital circuits utilises a version of CGP where the behaviour of nodes are characterised by Boolean logic equations. A genotype is mapped to a phenotype by realisation of the digital circuit constructed from the nodes (and connections) encoded within the genotype. Since not all of the nodes will have connections that influence the outputs, either directly or indirectly, some of the nodes do not contribute to the resulting circuit. This introduces a level of neutrality to CGP, whereby multiple genotypes are mapped to the same phenotype and hence have equal fitness values. In this paper extrinsic evolution is employed, whereby circuit phenotypes are evaluated in software. An assemble-and-test approach is used, where the phenotype circuit is constructed from its components and simulated. The binary truth table of the assembled G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 25–36, 2010. © Springer-Verlag Berlin Heidelberg 2010
26
R.D. Ledwith and J.F. Miller
circuit is then compared with the desired circuit truth table. The fitness function performs this comparison, with the fitness being the number of correct output bits in the table. Extrinsic evolution is accepted by many to be most suited to digital circuit evolution, as it has the advantage of providing symbolic solutions that can be implemented on a variety of devices. This method is used by Miller et al in [1] and [3]. Limitations of this system arise when attempting to evolve a circuit for which there are outputs whose value is not specified for a given input pattern. Since an assembleand-test strategy is being used, the entire truth table must be encoded and provided to the program at run-time to be available for the comparison tests. This requires each output value to be specified for all possible input combinations, and hence “don’t care” values must be assigned a value. Arbitrarily selecting a value for these situations restricts the evolution of the circuit by forcing the program to evolve solutions which satisfy the entire truth table, including those values which are unspecified in the real-world. This investigation looks at the potential improvements that can be achieved with the use of “don’t care” logic in the desired truth table, by modifying the fitness function to allow this flexibility. Small test circuits are studied in this paper to provide a first investigation of the utility of “don’t care” values in the evolutionary design of digital circuits. This paper is organised as follows. Section 2 details how digital circuits are encoded and evolved using CGP. Section 3 introduces example application areas of “don’t care” logic, and provides the test problems for use in evolution trials. In Section 3 the changes to CGP in order to allow exploitation of undefined truth table values are described. Results of the changes on evolution of the example circuits are given in Section 5, and conclusions drawn in Section 6.
2 Digital Circuit Evolution 2.1 Genotype Encoding The digital circuit encoding used in this paper has been developed and improved over a number of years by Miller et al, as seen in [1][3]. A digital circuit is considered as a specific case of the general acyclic directed graph model used in Cartesian Genetic Programming [4]. A graph in CGP is seen as a rectangular array of nodes, characterised by the number of columns, the number of rows, and levels-back. The number of nodes in use by the algorithm is the product of the graph dimensions number of columns and number of rows. The levels-back parameter specifies the maximum number of columns to the left of a node its inputs can originate from. This also controls how many columns from the furthermost right hand side of the grid outputs can be taken from. Nodes cannot be connected to nodes within the same column. The graph has a feed-forward structure, whereby a node may not receive inputs from nodes in columns to its right. Fig. 1 displays these values diagrammatically, showing an example of a 5 by 4 array with levels-back of 3, where node 21 receives inputs from nodes 10 and 12 both within 3 columns to the left.
Introducing Flexibility in Digital Circuit Evolution
27
levels-back = 3
Inputs
n0 n1 n2 n3
4
8
5 6 7
n10
n12
12
16
20
9
13
17
10
14
18
22
11
15
19
23
n21
21 number of rows = 4
number of columns = 5
Fig. 1. Visual representation of an example array of nodes as used in CGP. Example has 4 inputs, 5 columns, 4 rows, levels-back value of 3 (shown as dotted box relative to node 21).
Each individual node is described by its inputs, output and function. The output from each node, and the provided input data, is sequentially indexed from zero as seen in Fig. 1. All nodes utilised in this paper require 2 inputs, and their single-output functions are described by the Boolean logic equations in Table 1. The allowed node functions were selected fairly arbitrarily, although provided they are kept constant over all tests this is sufficient for comparisons to be made. All possible functions for a node are independently indexed. This separate sequential integer indexing for outputs and functions allows a single node to be fully described by its output index and 3 integer values: input1, input2, function. The genotype encoding maps the 2 dimensional graph to flat list of integers. It is specified that node output indexing is sequential within this list, beginning with the first integer index after the inputs. This removes the need to index each node within the genotype encoding, since it is inherent in the node location within the list. The outputs are specified at the end of the genotype as a list of integers specifying the node outputs to be used. Table 1. Allowed node functions, subset of those used by Miller [1]
AND
OR
XOR
ܽήܾ
ܽത ܾത
ܽ ْ ܾ
ܽ ή ܾത
ܽ ْ ܾത
28
R.D. Ledwith and J.F. Miller
2.2 Fitness Evaluation To calculate the fitness of a genotype the evolved circuit’s outputs are compared with the desired outputs as specified in a truth table. To perform this comparison, the CGP program makes efficient use of the processor by carrying out comparisons on multiple lines of a truth table simultaneously. This technique was introduced by R Poli [6], and considers a 32-bit processor as 32 individual 1-bit processors for simple logic functions. Since bit comparison can be achieved by utilising simple logic functions (See Section 4.1), this technique can be exploited to carry out comparisons of up to 32 lines of a truth table in just a few single-cycle operations. The genotype fitness is then defined as the total number of correct output bits in the resulting phenotype. In order for this to be achieved, the desired truth table must be provided in a 32-bit representation within the configuration file which describes the intended system. 2.3 The Evolutionary Algorithm A form of the (1 + λ)-ES evolutionary algorithm discussed by Bäck et al [7] is used throughout this paper. This strategy has also been used by Miller et al [1][3] and been shown to produce good results. The algorithm implements neutral search whereby if a parent and offspring have equal fitness, the offspring is always chosen in the interests of finding neutral solutions. Neutral search has been shown to be crucial to the efficiency of CGP [4][8]. The algorithm can be described by the following steps: 1. Randomly initialize a population of λ valid genotypes, where constraints discussed in Section 2.1 are adhered to. 2. Evaluate fitness of each genotype. 3. Identify fittest genotype, giving priority to offspring if parent and offspring have equal fitness. Copy fittest genotype into the new population to become the new parent. 4. Produce (λ – 1) offspring to fill population by creating mutated versions of parent genotype. 5. Destroy old population and return to step 2 using new population, unless a perfect solution or maximum number generations has been reached.
3 Problem Space This investigation into the use of “don’t care” logic will be tested by attempting to evolve the circuits for three problem areas. 3.1 Quotient and Remainder Hardware Divider Division in microprocessors is most often performed by algorithms such as “shift and subtract” [9] or SRT (Sweeney, Robertson, and Tocher). Faster algorithms can also be used such as Newton-Raphson and Goldschmidt, both of which are implemented in some AMD processors [10]. This paper, however, looks at developing a simple divider implemented entirely in hardware by standard logic gates. This circuit is selected as it demonstrates a clearly apparent and understandable existence of undefined
Introducing Flexibility in Digital Circuit Evolution
29
outputs; since calculations involving a division by zero are mathematically undefined. The divider will take the form of a quotient and remainder divider, with a single status output for the divide by zero (DIV/0) error. For 2 inputs A and B, where B is non-zero, this circuit will compute outputs Q and R to satisfy the following equation: (1) For the case where B is equal to zero the solution is undefined and the status output D goes active (defined as ‘1’ for this case). At this point all of the bits in the both output buses Q and R are undefined. As an initial investigation into the potential performance gains of utilizing “don’t care” logic, and in order to keep the complexity of the tests low, this paper only considers a 2-bit divider. The 2-bit divider has 4 single-bit inputs (A1, A0, B1, B0), and 5 single-bit outputs (Q1, Q0, R1, R0, D). The efficiency of evolution making use of “don’t care” logic will be compared against using fully-defined logic. 3.2 Finite State Machine Logic “Don’t care” states often arise when designing next state and output logic for a finite state machine (FSM). Each state in the FSM must be assigned a binary value, and hence if the number of states is not an exact power of two there will be unused binary values. These unused values will result in entire “don’t care” rows in the truth table. The FSM used in this paper is of a Mealy structure, where the output(s) depend on both the current state and current input pattern. The logic to be designed will be required to produce both the next state value and the output. The design for the FSM was chosen from the benchmarks for the 1991 International Workshop on Logic Synthesis, referred to as the LGSynth91 benchmarks [11]. The selected FSM benchmark dk27 has 7 states, 1 input and 2 outputs. The state assignment is therefore 3-bit, and one value is unused (chosen as 000). To keep the complexity low, the 2 outputs in the dk27 circuit were flattened into a single-bit output. With the single-bit input and 3-bit state assignment this results in a circuit with 4 inputs and 4 outputs, and two rows of “don’t care” values. 3.3 Distributed Don’t Cares The previous test cases both result in clusters of “don’t care” values, where all or most of a row is undefined for specific input patterns. In order to ensure the experimental results are reflective of a range of circuits, this test case comprises a truth table designed under the constraints of a maximum of one “don’t care” value per truth table row. The circuit was chosen to have 4 inputs and 4 outputs to match the finite state machine. The outputs were randomly generated, with ones and zeros having equal probability. The “don’t care” states were also generated randomly, with probability of a “don’t care” within any row being 50%, and equal probabilities for each output. The resulting truth table is shown in Table 2.
30
R.D. Ledwith and J.F. Miller
Table 2. Truth table for the distributed “don't care” circuit, showing maximum of one undefined output per row
A
Inputs B C
D
W
Outputs X Y
Z
A
Inputs B C
D
W
Outputs X Y
Z
0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1
0 1 0 1 0 1 0 1
1 0 1 1 1 X X 0
X 0 1 X 1 0 0 X
1 0 X 1 1 1 0 0
1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1
0 1 0 1 0 1 0 1
0 X 1 1 0 X 1 1
1 1 0 1 0 0 0 0
0 1 0 1 1 1 X 0
0 0 1 1 0 0 1 1
1 1 1 1 0 0 1 1
0 0 1 1 0 0 1 1
X 0 1 0 0 1 1 0
4 Implementation of Don’t Care Flexibility 4.1 Simple Don’t Care Bitmask In order to implement “don’t care” logic, it was necessary to add a method of describing undefined states. In order to maintain efficient fitness evaluation, no changes were made to the 32-bit truth table representation method. Instead, an additional section was added describing a 32-bit bitmask for each value in the table. In this bitmask, a value of ‘1’ indicates the truth table value is valid and fixed, and a value of ‘0’ indicates flexibility (an undefined value). Before the comparison between the actual and desired truth table value is carried out, both undergo a logical AND operation with the bitmask. This process ensures that all undefined states appear as ‘0’ in both the actual and desired truth tables, and hence match. This method allows for minimal changes to the fitness evaluation code, and thus minimises extra computational time. The fitness comparison for a single value thus changes from that in equation (2) to equation (3); where A is the actual output from the phenotype under evaluation, D is the desired output, and b the bitmask. (2) (3) Extra efficiency can be gained if it is ruled that all undefined values are assigned the value of ‘0’ in the desired truth table, thus the logical AND with the bitmask is not required for the desired truth table, resulting in equation (4). This comparison requires only one addition logical operation from the original, and hence should not slow the fitness evaluation by more than one clock cycle per 32-bit comparison. (4) 4.2 Extended Don’t Care Method The simple “don’t care” method allows evolution the flexibility of exploiting all of the undefined states. The concept can however be extended further to allow evolution even more control over exactly how to utilise the undefined outputs.
Introducing Flexibility in Digital Circuit Evolution
31
This is achieved by appending additional genes to the chromosome, describing how to interpret each of the available undefined outputs. A simple binary gene representing whether or not to use each “don’t care” was first considered, however this method would then restrict evolution to the values encoded in the configuration file truth table. The extended version instead uses genes with 3 possible values: 0, 1, or 2. A value of zero or one specifies that the desired output should be interpreted as a ‘0’ or ‘1’ respectively. This effectively removes the “don’t care” from the desired truth table and replaces it with a zero or one. A value of two represents the desired output should be considered as a “don’t care” state, and treated as in the simple method. The fitness evaluation is then the same as for the simple method; however the desired truth table row and “don’t care” bitmask must be constructed for each evaluation using the “don’t care” genes in the current chromosome.
5 Evolved Data 5.1 Test Structure and Parameters The size of the node array was not kept constant for each test case, since the differing complexities require different array sizes. However the maximum number of generations was fixed for all tests at 100,000. For each test circuit the mutation rate was varied, with 100 runs for each mutation rate executed using: the fully defined truth table, the truth table with “don’t care” bitmask using the simple strategy, and the truth table with “don’t care” bitmask using the extended strategy. 5.2 Success of Evolving 2-bit Hardware Divider The following parameters were used for evolution of the 2-bit hardware divider detailed in Section 3.1: number of rows and columns was 4 and levels-back was also. The resulting genotype contains 53 genes, and therefore the minimum mutation rate for mutations to occur is 2% (1 gene per generation). The mutation rate was increased from 2% in steps of 2.0% until all runs failed to reach a perfect solution. At each mutation rate 100 runs were executed using the fully defined truth table and each strategy for the incompletely defined truth table. Fig. 2 clearly shows the improved performance of evolution using the flexible truth table compared with the fully defined truth table. It also demonstrates the superior performance of the simple strategy compared to the extended version for this circuit. 5.3 Success of Evolving FSM Next State Logic The FSM next state and output logic is detailed in Section 3.2. The CGP grid was 6x6 with levels back equal to 6. The resulting genotype contains 112 genes, and hence the minimum mutation rate is 1%. The mutation rate was increased in steps of 1.0% until all runs failed to reach a perfect solution. Once again, for each mutation rate, 100 runs were executed using the fully defined truth table and each strategy for the incompletely defined truth table.
32
R.D. Ledwith and J.F. Miller
The results are displayed in Fig. 3, which also shows the improved performance of evolution using the flexible truth table compared with the fully defined truth table. Once again, the simple “don’t care” strategy outperforms the extended version for this circuit. 100
Percentage of runs achieving perfect solutions (%)
90 80 70
Fully-defined
60 50
Don't care (simple)
40 30
Don't care (extended)
20 10 0 0
2
4
6
8
10 12 Mutation rate (%)
14
16
18
20
Fig. 2. Graph of the number of perfect solutions reached (out of 100 runs) by using standard and “don’t care” truth tables for the 2-bit hardware divider 80
Percentage of runs achieving perfect solutions (%)
70
60
Fully-defined 50
Don't care (simple)
40
30
Don't care (extended)
20
10
0 0
1
2
3
4
5
6
7
8
9
Mutation rate (%)
Fig. 3. Graph of the number of perfect solutions reached (out of 100 runs) by using standard and “don’t care” truth tables for the FSM next state and output logic
Introducing Flexibility in Digital Circuit Evolution
33
Percentage of runs achieving perfect solutions (%)
70
60
50
Fully-defined 40
Don't care (simple)
30
Don't care (extended)
20
10
0 0
1
2
3
4 5 Mutation rate (%)
6
7
8
9
Fig. 4. Graph of the number of perfect solutions reached (out of 100 runs) by using standard and “don’t care” truth tables for the distributed “don’t care” circuit
5.4 Success of Evolving Distributed Don’t Cares Circuit Since this circuit was designed to mimic the complexity of the FSM logic, the same experimental parameters were used. The mutation rate was also varied from 1% upwards in steps of 1.0%. The results are displayed in Fig. 4, which once again supports previous results of improved performance using the flexible truth table compared with the fully defined truth table. The simple “don’t care” strategy also outperforms the extended version for this circuit. 5.5 Efficiency of Evolved Circuits Whilst it is advantageous to consider the computational benefits of the flexible truth table, perhaps more exciting is to consider the hardware efficiency of the evolved solutions. To enable evolution to continue beyond the initial perfect solution and attempt to reduce hardware requirements, the genotype fitness for perfect circuits must be modified. The simple modification defines fitness for perfect genotypes as the maximum fitness plus the number of redundant nodes (nodes which do not contribute to the outputs). This causes the algorithm to continuing executing until the maximum number of generations is reached, attempting to reduce the number of active nodes. This algorithm was executed on each test case with the parameters and varying mutation rates given in previous sections. The array size was also varied, up to a maximum of 100 available nodes.
34
R.D. Ledwith and J.F. Miller
Conventional methods such as the Karnaugh map allow minimised Boolean equations to be obtained from a desired truth table (See [13] for a good explanation). Karnaugh maps cannot utilise the XOR operator and as such the circuits evolved in the previous sections are expected to require less gates regardless of the “don’t care” modifications. However, with this in mind, the Karnaugh map can still be used to identify a benchmark for the hardware requirements of the test circuits. A Karnaugh map was constructed for each of the outputs of each circuit, and the sum-of-products Boolean equations obtained. Considering only the use of 2-input gates, the required number of gates to synthesise each circuit is shown in Table 3. Table 3. Number of 2-input gates required to synthesis test circuits from Karnaugh map minimised sum-of-products
Circuit 2-bit Divider FSM Logic Distributed Don’t Cares
Number of 2-input gates required 16 41 35
Hardware divider: The most efficient solution in terms of hardware requirements for the hardware divider was found to require 8 gates, a hardware saving of 50% compared with that found by conventional methods in Table 3. This solution used the simple “don’t care” strategy. Without the “don’t care” modification, the most efficient solution required 10 gates, and so a hardware saving of 20% was achieved over standard CGP. Finite State Machine Logic: The most hardware efficient design for the finite state machine next state and output logic required 14 gates. This solution was also found using the simple “don’t care” strategy, and gives a hardware saving of 26% over the most efficient solution without truth table flexibility, requiring 19 gates. Distributed Don’t Cares: Once again the simple strategy outperformed the extended version for finding efficient solutions, with the least number of gates required being 15. Without any truth table flexibility a solution requiring 18 gates was achieved, giving a hardware saving of 17% by the “don’t care” strategy. Clearly, the extended strategy for “don’t care” utilisation does not offer any benefits to the simple version for finding efficient circuits. The use of flexible truth tables does however have a clear advantage over standard CGP, resulting in at least a 17% reduction in hardware for all three test circuits.
6 Conclusion The motivation behind introducing flexibility in the desired truth table has been discussed in this paper, and a method for implementing this technique using a “don’t care” bitmask has been shown. Two strategies have been introduced for making use of available undefined states, although the simple strategy outperformed the extended
Introducing Flexibility in Digital Circuit Evolution
35
version for all test circuits presented. Using three circuits with incompletely defined truth tables, the use of unfixed output values has been demonstrated to increase the performance of CGP, as well as producing more hardware efficient designs. Allowing “don’t care” logic in the truth table can be thought of as increasing the potential number of perfect truth tables, and hence perfect phenotypes. Since CGP already has a many-to-one genotype-phenotype mapping, increasing the number of perfect phenotypes significantly increases the number of perfect fitness genotypes. This greatly increases the level of neutrality in the search space, and therefore agrees with the findings of Miller et al [4][8]. The chosen test circuits were kept small in order to keep complexity and required processing time low. Now the technique has been proven it could be extended to application for larger circuits. Note that for every addition of a “don’t care” state in a binary truth table, the total number of possible truth tables which satisfy the requirement is doubled. This implies that with larger circuits, and possibly higher numbers of “don’t cares”, the potential benefits of this technique could be even greater.
References 1. Miller, J.F., Job, D., Vassilev, V.K.: Principles in the Evolutionary Design of Digital Circuits - Part I. Journal of Genetic Programming and Evolvable Machines 1, 8–35 (2000) 2. Perez, E.I., Coello, C.C.: Extracting and re-using design patterns from genetic algorithms using case-based reasoning. Engineering Optimization 35(2), 121–141 (2003) 3. Miller, J.F., Thomson, P., Fogarty, T.: Designing Electronic Circuits Using Evolutionary Algorithms. In: Quagliarella, D., Periaux, J., Poloni, C., Winter, G. (eds.) Arithmetic Circuits: A Case Study, Genetic Algorithms and Evolution Strategies in Engineering and Computer Science, pp. 105–131. Wiley, Chichester (1997) 4. Miller, J.F., Thomson, P.: Cartesian Genetic Programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 5. Perkowski, M., Foote, D., Chen, Q., Al-Rabadi, A., Jozwiak, L.: Learning hardware using multiple-valued logic-Part 1: introduction and approach. IEEE Mirco 22(3), 41–51 (2002) 6. Poli, R.: Sub-machine-code GP: New results and extensions. In: Langdon, W.B., Fogarty, T.C., Nordin, P., Poli, R. (eds.) EuroGP 1999. LNCS, vol. 1598, pp. 65–82. Springer, Heidelberg (1999) 7. Bäck, T., Hoffmeister, F., Schwefel, H.P.: A survey of evolution strategies. In: Belew, R., Booker, L. (eds.) Proceedings of the 4th International Conference on Genetic Algorithms, pp. 2–9. Morgan Kaufmann, San Francisco (1991) 8. Miller, J.F., Smith, S.L.: Redundancy and Computational Efficiency in Cartesian Genetic Programming. IEEE Trans. on Evolutionary Computation 10, 167–174 (2006) 9. Shaw, R.F.: Arithmetic Operations in a Binary Computer. The Review of Scientific Instruments 21(8) (1950) 10. Oberman, S.F.: Floating Point Division and Square Root Algorithms and Implementation in the AMD-K7 Microprocessor. In: Proc. IEEE Symposium on Computer Arithmetic, pp. 106–115 (1999) 11. Yang, S.: Logic synthesis and optimisation benchmark user guide version 3. MCNC (1991) 12. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
36
R.D. Ledwith and J.F. Miller
13. Holder, M.E.: A modified Karnaugh map technique. IEEE Transactions on Education 48(1), 206–207 (2005) 14. Sekanina, L.: Evolutionary Design of Digital Circuits: Where Are Current Limits? In: Proceedings of the First NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2006), pp. 171–178. IEEE CS, Los Alamitos (2006) 15. Stomeo, E., Kalganova, T., Lambert, C.: Generalized Disjunction Decomposition for Evolvable Hardware. IEEE Trans. Syst., Man, and Cyb. Part B 36(5), 1024–1043 (2006)
Evolving Digital Circuits Using Complex Building Blocks Paul Bremner1 , Mohammad Samie1 , Gabriel Dragffy1 , Tony Pipe1 , James Alfred Walker2 , and Andy M. Tyrrell2 1
Bristol Robotics Laboratory, University of the West of England, Bristol, BS16 1QY 2 Intelligent Systems Group, Department of Electronics, University of York, Heslington, York, YO10 5DD
Abstract. This work is a study of the viability of using complex building blocks (termed molecules) within the evolutionary computation paradigm of CGP; extending it to MolCGP. Increasing the complexity of the building blocks increases the design space that is to be explored to find a solution; thus, experiments were undertaken to find out whether this change affects the optimum parameter settings required. It was observed that the same degree of neutrality and (greedy) 1+4 evolution strategy gave optimum performance. The Computational Effort used to solve a series of benchmark problems was calculated, and compared with that used for the standard implementation of CGP. Significantly less Computational Effort was exerted by MolCGP in 3 out of 4 of the benchmark problems tested. Additionally, one of the evolved solutions to the 2-bit multiplier problem was examined, and it was observed that functionality present in the molecules, was exploited by evolution in a way that would be highly unlikely if using standard design techniques.
1
Introduction
A proposed approach to tackling the issue of fault-tolerance, and hence reliability issues for digital systems, is a bio-inspired prokaryotic cell array. Under this paradigm a circuit is made up of interconnected, identical, cells that are configured, using bit strings of genes, to fulfill the necessary routing and functional properties to make up a digital system [1]. The cells in our proposed array have been designed with a great deal of functionality. A consequence of this is that it is a complex task to specify genes to fully exploit the functionality of the cells, when implementing digital systems using standard digital design techniques. An alternative to a deterministic technique of gene specification, Genetic Programming, has therefore been investigated. Genetic Programming (GP) has been shown to be capable of producing novel [2][3], compact [4] solutions to digital design problems; often these result in circuits that are unlikely to be conceived using standard design techniques. It therefore seems possible that some form of GP might be used to produce circuits, using the proposed cells, that would exploit their functionality in ways that a deterministic technique might not. Cartesian Genetic Programming, developed G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 37–48, 2010. c Springer-Verlag Berlin Heidelberg 2010
38
P. Bremner et al.
by Miller and Thomson [5], is a method that could be readily adapted to allow the use of cells within the evolution process. In our case the standard 2 input logic gates that are normally used as nodes in CGP for digital circuit evolution, will be replaced by a cut down version of the proposed cells. The proposed cells are able to process 8 inputs to perform a variety of routing and functional roles. The routing functionality of the cells is not suitable to be included within the framework of CGP, so a cut down version dubbed a molecule will be used; hence the name of the proposed technique, Molecular Cartesian Genetic Programming (MolCGP). However, a key issue with taking this approach is the efficiency with which a solution might be found. CGP has been shown to produce solutions to a number of benchmark problems with a useful degree of efficiency; by increasing the complexity of the nodes, the amount of design space that must be explored similarly increases. As a consequence, as well as the efficacy of MolCGP to exploit the functionality of the molecules being investigated, the efficiency and efficacy with which it is able to solve benchmark digital problems will be investigated.
2
Related Works
It has been shown that Evolutionary Algorithms can be used to successfully evolve digital circuits [2][3]. Miller and Thompson proposed Cartesian Genetic Programming as one such method of approaching this problem. It differs from the original Genetic Programming (GP) technique proposed by Koza [6] in that a program is represented as an acyclic directed graph rather than a tree like structure; each node in the graph representing a digital function. They demonstrated the ability to evolve novel solutions to 2-bit and 3-bit multipliers, and the even 4-bit parity problem. Although the technique affords exploration of the design space in ways that produces solutions that are beyond the remit of traditional design techniques, the building blocks of those solutions are restricted to the range of functions that are defined for the nodes. The work presented here proposes to extend CGP by using nodes with more functional flexibility. Other work has also sought to expand on the capabilities of CGP. Sekanina proposed the use of unconventional functions in the form of polymorphic logic gates [7]. He found that it was possible to evolve multi-functional digital circuits using a combination of standard and polymorphic logic gates as node functions. Thus, by expanding the range of available node functions, a wider range of design space can be successfully explored. Walker and Miller looked to improve the capabilities of CGP by allowing the automatic creation of multi-node modules, which could be inserted into the graph in place of a standard node during mutation [8]. Thus, the function set could be increased by adding other useful functions beyond the base primitives; this facilitated more efficient evolution, particularly in the case of modular designs. Haddow et al. have also attempted to find a method to produce the configuration bits for a Look Up Table (LUT) based cell array [9]. However their technique is totally different from that presented here, they use a set of evolved growth
Evolving Digital Circuits Using Complex Building Blocks
39
rules to propagate configuration bits to the array rather than evolving the configuration bits directly. Thus, they have shifted the complexity from the genotype to the method of conversion from genotype to phenotype. This produces some interesting results but they have not attempted to use their technique to evolve digital circuits.
3
Description of Molecule Functionality
A potential design paradigm for fault tolerant digital systems is the bio-inspired prokaryotic array. This sort of system is made up of an array of identical cells, capable of being configured to perform a variety of functions [1]. The functional part of the cells, for the array that we are currently developing, is made up of two functional units. Each unit can be configured to operate independently, or cooperatively, to process 8 input lines, realising a variety of processing and routing functions on the data. The configuration of these cells is carried out using a bit string of genes. In order to constrain the design space of the system to a degree whereby a GP method can operate with some efficiency, the routing and cooperative functionality of the cells has been ignored. Hence, each cell has been broken down into two molecules which are a cut down version of a functional unit. Similarly, a segment of the complete gene string is used to control the functionality of the molecule. The cell and gene string have been decomposed in such a way that cells (and their requisite genes) could be reconstructed from the evolved molecules. Each molecule has four inputs and two outputs. The function realised at the primary output (PO ) is driven by an 8 bit wide LUT, so can produce any arbitrary three input boolean function; its inputs (PI 1−3 ) are in1, in2 and either in3, in4, in1 or logic 0 (these last 2 result in a 2 input function). The secondary output (SO ) is primarily for use as a carry output when the molecule is configured as a full or half adder; otherwise, it either routes PI 3 , or produces in1 .in2 + PI 3 .in1 ⊕ in2 Although the functionality of the secondary output is relatively limited it is allowed as a valid connection, it is a fixed part of the cell design, and any functionality available should be allowed to be exploited by the evolution. A schematic of the molecule is shown in Fig. 1. The functionality of the molecule is controlled by an 11 bit long binary string. The first 8 bits of which constitute the LUT, the other 3 bits control which value is passed as PI 3 , and the function executed by SO . When PI 3 is selected in such a way that results in a 2 input function only half of the LUT bits are used; which bits of the LUT are used is determined by whether in1 or logic 0 are selected.
4
CGP and Its Extension to MolCGP
Miller and Thomson developed Cartesian Genetic Programming in order to facilitate the evolution of digital circuits [5]. Molecular Cartesian Genetic Programming (MolCGP) is an extension of CGP using more complex nodes than in
40
P. Bremner et al.
Fig. 1. A schematic of the molecule. LUT1-8 comprise the LUT part of the gene string, C1-3 are the control genes that define the remaining functionality.
the original implementation. In CGP a digital circuit is represented as a directed graph, each node representing some form of digital processing element. It is described as Cartesian because the nodes are laid out in a grid, so the Cartesian coordinates of a node are used to identify the connections of the edges of the graph. A benefit of this type of representation is that the outputs of a given node can be connected to any other, allowing implicit reuse of the processing performed (Fig. 2). CGP has been shown to be most efficient when only a single column (or row) of nodes is used, rather than a grid of nodes as suggested in the original implementation [10]; this single dimension approach is followed here. Additionally, the graph is acyclic as all the functions to be evolved are feed-forward, combinational logic. Thus, a node may only have input connections from preceding nodes in the graph and program inputs; the outputs are further restricted in that they may not be connected (directly) to program inputs.
Fig. 2. Acyclic Directed Graph, 3 nodes each with 2 inputs and 1 output, 2 program inputs (A,B) and one program output (C)
The genotype in MolCGP, as in CGP, is made up of a number of sets of integers, one set for each node in the graph. The genotype length is fixed, a specified number of nodes is defined for every member of the population. However, the genotype-phenotype mapping is such that each node need not necessarily contribute to the value produced at any of the outputs. Thus, although the genotype is bounded, the phenotype is of variable length. The unconnected nodes represent redundant genetic information that may be expressed if a mutation results in their inclusion in an input to output path. Therefore the effect of a single point mutation on the genotype can have a dramatic effect on the phenotype, an example of this is shown in Fig 3. In order to ensure these neutral mutations
Evolving Digital Circuits Using Complex Building Blocks
41
influence the evolution, new population members that have the same fitness as the parent are deemed fitter than parents. This phenomenon is often referred to as neutrality, as mutations in the redundant sections of the genome have no effect on the fitness of the individual; it has been shown to be beneficial to the operation of CGP [10][5][11]. A degree of redundancy as high as 95% is suggested by Miller and Smith [11] as providing optimum increase in performance. The optimum number of nodes for producing similar levels of improvement in efficiency in MolCGP is investigated in section 5.
Fig. 3. Point mutation occurs changing which node the output C is connected to. Redundant nodes are indicated by dotted lines.
In CGP, each node consists of one number representing the function of the node, and the remaining numbers defining the sources of the input connections, the number of these connections is dependent on the arity of the function that the node implements; in MolCGP there are always 4 inputs. In CGP the function of each node is represented by an integer, allowing functions to be drawn from a predefined list; in [5] this list consisted of a range of primitive boolean functions as well as some MUX nodes to allow inversion of various inputs. MolCGP is an extension of CGP in that the functionality of a node is defined by a bit string, which allows generation of arbitrary 3 input logic functions from the primary output, and the full range of possible functionality from the secondary output, through mutation of this bit string; thus the nodes in MolCGP are significantly more flexible in the functions they are able to implement. Additionally, nodes in CGP only have one output, nodes in MolCGP have 2 outputs, hence each connection gene is, instead, a pair of numbers, defining the node connected to, and the output of that node (in a similar manner to the genotype used in ECGP and MCGP proposed by Walker and Miller [8]). Typically, CGP uses only mutation as a genetic operator, and that concept is followed in MolCGP. A number of nodes are mutated for each generation, defined by a mutation rate that is a percentage of the number of nodes specified for the current population; a mutation rate of 3% was found to give good performance, and is used throughout the work presented here. For each node mutated, either the function, or one of the connection gene pairs, may be mutated. If the function is mutated, a random number of bits in the function gene-string are flipped. If a connection is mutated, a new random, valid, connection pair is generated.
42
P. Bremner et al.
The fitness function used is simply the total hamming distance of the resultant bit strings (one from each output) that result from evaluating the truth table of the given problem, with those specified by said truth table. Thus a lower fitness score is better, and evolution is stopped when a fitness of zero is reached.
5
Evolution Strategy and Population Size Experiments
As a consequence of the increased complexity of the nodes, the design space to be explored is a great deal larger, and mutations on the node functions are likely to have a greater effect on the fitness of an individual. It therefore seems prudent to investigate whether the evolution strategy (1+4, i.e., each new population consists of the best individual from the previous generation and 4 offspring produced by mutating it), and genome redundancy used by [5] is appropriate for MolCGP. In order to investigate this, a 2-bit multiplier was chosen as a sample program to be evolved. It is sufficiently complex that effects of parameter changes should be observable, while not being so complex that solutions take a very long time to evolve. As a measure of efficiency, to allow direct comparison between the different parameter settings, Individuals Processed to find a Solution (IPS) is used. IPS is calculated using equation (1), where M is the number of individuals in a population, and i is the median number of generations to find a solution. IPS can be seen to have some similarities to Computational Effort (CE) proposed by Koza [6]; it is used instead due to the inaccuracies of CE for low run, high generation experiments [12]. Each parameter set is used for 50 independent runs, and a Box & Whisker plot is generated for analysis. IP S = M ∗ i;
(1)
To test evolution strategies (λ + x) the number of nodes is set at 50. To test genotype lengths the evolution strategy is set as 1+4. In both cases the evolution is always run until success. 5.1
Discussion
It is clear from Fig. 4 that increasing the number of nodes, and therefore the redundancy, has (as in standard CGP) a beneficial effect on the efficiency of evolution. Pairwise Mann-Whitney U-tests, and Kolmogorov-Smirnov tests, were carried out on the data, and showed that the observed improvement in efficiency is largely not significant (at the 5% level) beyond 20 nodes. This is contrary to the findings in [11], where as node numbers were increased so to does the efficiency of evolution (for all values tested). A potential reason for this is that far fewer nodes than in standard CGP are required for the 95% neutrality suggested by Miller; the precise degree of neutrality present is non-trivial to calculate, given the implicit neutrality in nodes, i.e., nodes expressed in the phenotype that do not actually contribute to the functionality. To try to approximate the neutrality in the genome (explicit neutrality) the number of nodes was severely restricted
Evolving Digital Circuits Using Complex Building Blocks
43
Fig. 4. Box & Whisker Plot of Variable Numbers of Nodes in Each Individual. 50 Independent Runs Performed for Each Value
Fig. 5. Box & Whisker Plot of Variable Evolution Strategy. 50 Independent Runs Performed for Each Value
44
P. Bremner et al.
and evolution ran until success, a solution can be found with as few as 5 nodes, implying that with 20 nodes there is at least 75% neutrality. In addition, there is also a trade off to be made in the apparent improvement in IPS and the complexity of the individual; individuals in populations with more nodes are likely to have larger phenotypes than those with less nodes [11], and thus tend to take longer to process. Therefore, the difference in processing time is not necessarily improved with an increased number of nodes. Consequently, selecting the correct number of nodes for a given problem appears critical to shorter evolution times. It is clear from Fig. 5 that a 1+4 strategy gives, as suggested for standard CGP, maximum efficiency. Pairwise Mann-Whitney U-tests, and KolmogorovSmirnov tests, were carried out on the data, and showed that the observed improvement in efficiency is only significant (at the 5% level) between the smaller and larger population sizes. However, although the observed improvements in efficiency between the low population size experiments are not significant, the Box & Whisker plot shows that the variance increases as the strategy deviates from 1+4; thus, a 1+4 evolutionary strategy will give consistently more efficient evolution.
6
Applying MolCGP to Benchmark Problems
In order to test the efficacy of MolCGP, 4 benchmark problems have been attempted that are commonly used to test new techniques; the 4 and 8 bit even parity problems, and, 2 and 3 bit multipliers [13]. Despite the inaccuracies of CE as a measure for MolCGP, it is used as a standard measure for many GP approaches; therefore, to allow comparison of performance on benchmark problems, in particular with standard CGP, it is used in this section. Walker et al. state that CE is a point statistic and that, in order to improve the validity of comparisons with other techniques (mitigating the inaccuracies of CE for our high generation, low run approach), a confidence interval (CI) should be calculated [13]. These intervals are calculated here using the methodology presented in [13], and these values are included in Table 1, along with the values for standard CGP taken from [12] and used for comparison. In all cases, 50 independent runs were conducted, with an evolution strategy of 1 + 4, a mutation rate of 3%, and a genotype length of 100 nodes. 6.1
Discussion
Looking at the multiplier problems, it can be seen that CGP outperforms MolCGP by approximately 2 times for the simpler 2-bit multiplier problem, but the reverse is true for the 3-bit multiplier where a 5 fold decrease in CE can be seen for MolCGP. The results for the even parity problems are even more dramatic; however, the function set for CGP is limited to AND, OR, NAND, and NOR, which are known to require very complex solutions when using only these functions [2], thus the vast improvement is, perhaps, partly attributable
Evolving Digital Circuits Using Complex Building Blocks
45
Table 1. The computational effort (in number of generations) for the 4 benchmark problems tested, for both MolCGP and CGP. Also included are the true CE confidence interval (CI) lower and upper bounds.
Benchmark Problem 2-Bit Multiplier 3-Bit Multiplier Even 4-Bit Parity Even 8-Bit Parity
MolCGP CIlower CE CIupper 53,814 73,282 109,313 4,283,402 5,832,962 8,700,865 12,021 24,071 31,325 83,687 120,324 167,575
CGP CIlower CE CIupper 24,675 33,602 50,123 16,448,737 24,152,005 33,867,501 106,546 151,683 210,235 22,902,612 31,187,842 46,522,022
to this. However, it is clear that despite the increased design space that is being explored by MolCGP, significant improvements in CE are demonstrated, especially for more complex problems. A caveat to this finding is that one of the limitations of CE as a performance indicator is that the calculation does not take into account the complexity of each individual solution. The nodes in CGP typically require 1 or 2 bitwise operations on the input data, molecules require many times more than this.
7
Examining an Evolved Solution to the 2-bit Multiplier Problem
In order to investigate how the resources of the nodes are being utilised, and how the solution differs from what might have been created by a human designer, one of the evolved solutions to the 2-bit multiplier problem has been examined. The solution was evolved using only 10 nodes, as each node is so complicated that a larger solution would be very difficult to analyse meaningfully; although reducing the neutrality increased the number of individuals that had to be processed, a solution was still found fairly quickly (less than a minute). Fig. 6 shows how the nodes were connected up in the evolved solution, it clearly shows that (as expected) node outputs are being reused; 3 of the nodes have both their primary and secondary outputs connected, resulting in 65% of the available resources being used. The functionality of each node is shown in table 2. The outputs C3, C2 and C1 are solved in such a way that follows fairly closely that which would be produced using a Karnaugh map. C3 and C2 use more nodes than is actually necessary, in some places taking advantage of the input routing nature of the secondary outputs (SO ), in others combining inputs in redundant ways. However, nodes still perform logically obvious operations. What is particularly interesting is the use of the functionality of the secondary outputs to calculate C1 in such a way that is very different from standard design techniques. Using a Karnaugh map the function deduced for C1 (the sum of minterms) is shown in equation (2), it requires 6 nodes (as no previous functionality can be directly reused); alternatively the multiplier can be constructed
46
P. Bremner et al.
Fig. 6. Connectivity of the 10 nodes in the examined 2-bit multiplier solution
using half-adders, the function for which is shown in equation (3), it requires 3 nodes (output 4S can be reused). There are 4 nodes unique to C1 but they did not result in producing anything like equation (3), instead equation (2) is produced using a convoluted combination of nodes (verified through extensive Boolean algebra not reproduced here). This exploitation of the unusual functionality of the secondary outputs, combined with the three input logic function of the primary outputs, in a way that standard design techniques would not lead to, highlights the benefit of using an evolutionary technique to produce circuits for the proposed array; i.e., exploration of areas of the design space that would not normally be used, giving rise to the potential for more efficient solutions to be evolved than could be designed. However, the circuit produced is not as efficient as that produced when constructing the multiplier using half-adders (requiring 6 nodes); this is due to some undesired exploitation of the routing capabilities of SO , and some redundant recombination of inputs. Thus, alteration of the fitness evaluation to include parsimony should increase functional exploitation, and minimise routing exploitation and redundant nodes; adding parsimony to the fitness function is one of the ideas discussed in section 8. A¯1 .A0 .B1 + A0 B1 .B¯0 + A1 .B¯1 .B0 + A1 .A¯0 .B0
(2)
(A0 .A1 .B1 .B0 ) ⊕ A1 B1
(3)
Evolving Digital Circuits Using Complex Building Blocks
47
Table 2. Functions of Nodes in the Examined 2-Bit Multiplier Solution Node Number PO Function SO Function 0 0 A1.B1 ¯ B0 ¯ + B1.B1 ¯ B1. + A1.B0.B1 1 B1 ¯ + A1.A0.B0 ¯ 2 A1.B0 A1.A0 + B0.A1 ⊕ A0 ¯ 3 0PO .A0 0P O ¯ 4 B0.A0 B0.A0 3S¯O + 4SO 5 3SO .4SO 6 5SO .B1 5SO .3PO + B1.5SO ⊕ 3PO 3P¯O .4P¯O .2SO + 3P¯O .4PO .2S¯O + 3PO .4PO .2SO 3PO .4PO + 2SO .3PO ⊕ 4PO 7 8 0PO + 5P¯O 0 ¯ 1P¯O .A0.7S¯O + 7SO .A0 9 7 SO
8
Conclusions and Further Work
In this paper MolCGP, an extension of CGP, has been presented, and its capabilities investigated. It has been shown that, for a set of standard benchmark problems, it is able to find solutions with a practicable amount of computational effort, particularly when compared to standard CGP; thus demonstrating that it is a potentially valuable technique for evolving useful digital circuits on a bioinspired prokaryotic cell array. In order to facilitate scaleing MolCGP to more complex problems (than those presented here) further development of the algorithm to improve the efficiency of evolution, will be investigated. One possible approach for this is automatic module acquisition, as described in [8]. In this approach, collections of nodes with a useful function (a module) are added to the list of possible functional mutations of any given node, thus a mutation could replace a node with a module instead of a standard functional change. This facilitates further exploitation of the modular nature of many digital circuits. Owing to the multi-functional nature of the nodes in MolCGP, useful node functions (i.e., gene strings) could also be acquired, and added to the list of possible mutations. Having established that MolCGP is capable of evolving useful digital circuits, it will be developed to maximise the exploitation of the functionality of molecules. Upon examining one of the evolved solutions to the 2-bit multiplier problem, it can be seen that the functionality of the nodes is being relatively well exploited, and in such a way that would not normally be carried out by a human designer. This exploitation gives rise to the potential to evolve solutions that are more effecient than those that would typically be designed. Hence, in order to capitalise on this, and minimize routing exploitation and redundant nodes, the fitness function will be modified to include parsimony. An approach for doing so with CGP is suggested in [4]. Successfully evolved solutions (including standard designed solutions) are allowed to evolve further to see whether more compact solutions can be found. Alternatively, parsimony could be included in the fitness function to begin with, resulting in multi-objective evolution. Both approaches will be investigated. Should sufficiently efficient solutions be able to
48
P. Bremner et al.
be evolved, they will be examined for possible design techniques that can exploit the functionality of the nodes. Acknowledgments. This research work is supported by the Engineering and Physical Sciences Research Council of the United Kingdom under Grant Number EP/F062192/1.
References 1. Samie, M., Dragffy, G., Popescu, A., Pipe, T., Melhuish, C.: Prokaryotic bioinspired model for embryonics. In: NASA/ESA Conference on Adaptive Hardware and Systems, pp. 163–170 (2009) 2. Miller, J.F., Job, D., Vassilev, V.K.: Principles in the evolutionary design of digital circuits—part i. Genetic Programming and Evolvable Machines 1(1-2), 7–35 (2000) 3. Coello Coello, C.A., Aguirre, A.H.: Design of combinational logic circuits through an evolutionary multiobjective optimization approach. Artif. Intell. Eng. Des. Anal. Manuf. 16(1), 39–53 (2002) 4. Vassilev, V.K., Job, D., Miller, J.F.: Towards the automatic design of more efficient digital circuits. In: EH 2000: Proceedings of the 2nd NASA/DoD workshop on Evolvable Hardware, vol. 151 (2000) 5. Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 6. Koza, J.: Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge (1996) 7. Sekanina, L.: Evolutionary design of gate-level polymorphic digital circuits. In: Rothlauf, F., Branke, J., Cagnoni, S., Corne, D.W., Drechsler, R., Jin, Y., Machado, P., Marchiori, E., Romero, J., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2005. LNCS, vol. 3449, pp. 185–194. Springer, Heidelberg (2005) 8. Walker, J.A., Miller, J.F.: The automatic acquisition, evolution and reuse of modules in cartesian genetic programming. IEEE Trans. Evolutionary Computation 12(4), 397–417 (2008) 9. Haddow, P.C., Tufte, G., van Remortel, P.: Shrinking the genotype: L-systems for evolvable hardware? In: Liu, Y., Tanaka, K., Iwata, M., Higuchi, T., Yasunaga, M. (eds.) ICES 2001. LNCS, vol. 2210, pp. 128–139. Springer, Heidelberg (2001) 10. Yu, T., Miller, J.F.: Neutrality and the evolvability of boolean function landscape. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 204–217. Springer, Heidelberg (2001) 11. Miller, J.F., Smith, S.L.: Redundancy and computational efficiency in cartesian genetic programming. Transactions on Evolutionary Computation 10(2), 167–174 (2006) 12. Walker, J.A.: The Automatic Aquisition, Evolution and Re-use of modules in Cartesian Genetic Programming. PhD Thesis 13. Walker, M., Edwards, H., Messom, C.: Confidence intervals for computational effort comparisons. In: Ebner, M., O’Neill, M., Ek´ art, A., Vanneschi, L., EsparciaAlc´ azar, A.I. (eds.) EuroGP 2007. LNCS, vol. 4445, pp. 23–32. Springer, Heidelberg (2007)
Fault Tolerance of Embryonic Algorithms in Mobile Networks David Lowe1 , Amir Mujkanovic1 , Daniele Miorandi2 , and Lidia Yamamoto3 1
Centre for Real-Time Information Networks University of Technology Sydney, Australia
[email protected],
[email protected] 2 CREATE-NET, v. alla Cascata 56/D, 38123, Povo, Trento, IT
[email protected] 3 Computer Science Department, University of Basel, Switzerland
[email protected]
Abstract. In previous work the authors have described an approach for building distributed self–healing systems – referred to as EmbryoWare – that, in analogy to Embryonics in hardware, is inspired by cellular development and differentiation processes. The approach uses “artificial stem cells” that autonomously differentiate into the node types needed to obtain the desired system–level behaviour. Each node has a genome that contains the full service specification, as well as rules for the differentiation process. This approach has inherent self-healing behaviours that naturally give rise to fault tolerance. Previous evaluations of this fault tolerance have however focused on individual node failures. A more systemic fault modality arises when the nodes become mobile, leading to regular changes in the network topology and hence the potential introduction of local node type faults. In this paper we evaluate the extent to which the existing fault tolerance copes with the class of faults arising from node mobility and associated network topology changes. We present simulation results that demonstrate a significant relationship between network stability, node speed, and node sensing rates.
1
Introduction
In this paper, we consider the issue of fault-tolerance in self–healing distributed networks that incorporate mobile devices and hence rapidly changing network topologies. Inspired by related work on Embryonics [1, 2], in our earlier work [3] we proposed EmbryoWare, an “embryonic software” architecture for robust and self-healing distributed systems. Like Embryonics, the EmbryoWare approach is based on the assumption that each node in the system contains a genome that includes a complete specification of the service to be performed, as well as a set of differentiation rules meant to ensure that each node differentiates into the node type required to provide required overall system–level behaviour. A particular feature of both Embryonics and EmbryoWare is that there is no G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 49–60, 2010. c Springer-Verlag Berlin Heidelberg 2010
50
D. Lowe et al.
distinction between the fault1 handling behavior and the normal behavior of a node. The ability of a node to restore from a faulty to a normal state is a sideeffect of the system’s normal process of differentiating into the locally correct node type. Therefore, no special fault-handling routines are needed, which can make the system potentially more robust to unforeseen disruptions. In [3] we examined the general behaviour and performance of the EmbryoWare approach and demonstrated its validity as well as its inherent robustness and self–healing ability. That previous work however focused on individual node failures with a uniform probability distribution of failures occurring in any node. There does exists the likelihood of more complex patterns of node failure. One of the more significant of these occurs when we have mobile nodes, leading to regular changes in the network topology. When the topology changes the local neighbourhood for nodes is affected. Given that nodes differentiate into different types based, in part, on the sensed information from nodes in their local neighbourhood, when this neighbourhood changes it can mean that the node types are no longer correct. This can be interpreted as the introduction of faults into the system. An example of this situation would be an ad hoc network of mobile devices (such as cell phones) that form a distributed processing network. As devices move, they establish and then lose temporary connections, and hence the network topology is constantly changing. This has implications for ensuring the validity of the system–level functionalities – particularly where the correct behaviour of each node is dependent upon the behaviours in its neighbourhood. In this paper we evaluate the fault–tolerance behaviour of EmbryoWare under mobility, by measuring the extent to which the patterns in EmbryoWare can be maintained in a valid state in spite of mobility. In particular, we are interested in the relationships between the rate of fault generation (which will correspond to the speed of the nodes and hence the rate of change in the network topology) and those factors that affect the rate at which faults are addressed. In essence we are considering how quickly the nodes in an embryonic system can re-differentiate to ensure that the individual nodes are in a valid state. In section 2 we discuss the background to our approach and related work. Then in section 3 we provide a brief overview of the basic EmbryoWare architecture and the changes we have made to incorporate node mobility into our simulations. We then describe our analysis approach and results in section 4. Finally, in section 5 we describe our conclusions and future work.
2
Background
The motivation for our work comes from the increasing utilisation of distributed services, i.e. services whose outcomes depend on the interaction of different components possibly running on different processors. Distributed services typically 1
We refer to a fault as any circumstance in which a node is not operating in a steady state but rather a state in which subsequent sensing is likely to lead to a differentiation of the node type. This should be distinguished from a node failure, where the node has failed to operate correctly due to some other operational reason.
Fault Tolerance of Embryonic Algorithms in Mobile Networks
51
require complex design with regard to the distribution and coordination of the system components. They are also prone to errors related to possible faults in one (or more) of the nodes where the components execute. This is particularly significant for applications that reside on open, uncontrolled, rapidly evolving and large–scale environments, where the resources used for providing the service may not be on dedicated servers (as the case in many grid or cloud computing applications) but rather utilise spare resources, such as those present in user’s desktops or even mobile devices. (Examples of such scenarios are the various projects making use of the BOINC or similar platforms2 .) Other examples of distributed applications where each node takes on specific functionality include: peer-to-peer file sharing; distributed databases and network file systems; distributed simulation engines and multiplayer games; pervasive computing [4] and amorphous computing [5]. With all of these applications there is a clear need to employ mechanisms that enhance robustness and reliability, ensuring the system’s ability to detect faults and recover automatically, restoring system–level functionalities in the shortest possible time. In this work, we deal with problems arising when the topology changes due to nodes mobility. While as of today the vast majority of distributed services are meant to run over static nodes, the increasing penetration of powerful mobile devices (smartphones) has the potential of boosting the adoption of similar approaches in the mobile computing field. Even when the devices themselves are not mobile there still exists the potential for changes to the network topology due to approaches such as intelligent routing. We report the following example of applications, which help in better positioning our work. Example 1 Wireless Grid Computing: One example of the kind of applications our framework applies to is the so–called wireless grid computing [6, 7, 8]. This applies the same principles underpinning grid computing research to mobile phones. Sharing the load for performing heavyweight computational tasks across a plurality of devices can provide advantages in terms of completion time and load balancing. The possibility that the network topology can change dynamically introduces an additional level of complexity with respect to grid computing scenarios, due to the need to ensure that tasks will get completed even in the presence of disconnections. Example 2 Distributed Sensing Platforms: Current state-of-the-art smartphones are sensor–rich. They typically include at least a camera (video and image sensor), a microphone (audio sensor) and short–range communication capabilities (such as Bluetooth and WiFi). Smartphones carried around by users could therefore be used as a distributed wireless sensing platform [9, 10]. Such a platform could be used to gather environmental information. An example is the distributed search engine considered in [11]. Example 3 Mobile Data Sharing: As smartphones are commonly equipped with some form of short–range wireless communications, they could be used to exchange data and content in a peer–to–peer fashion [12,13,14]. Going beyond pure 2
http://boinc.berkeley.edu/
52
D. Lowe et al.
flooding–based strategies (` a la Gnutella) requires the introduction of a distributed indexing/caching services, which should be able to ensure some system–level performance (related, e.g., to the ability of locating and retrieving given content) even in the presence of device mobility. We are particularly interested in distributed services whereby the desired system–level behaviour (or: system–level configuration, meaning the mapping of devices to ’types’, where different node types carry out different behaviours) can be expressed in terms of spatial constraints between the nodes and their types. An example could be “A node of type A has to be no more than two hops away from a node of type B” or “Any node of type C shall have no more than two 3–hop neighbours of type D”. Robustness in distributed computing systems is a well–studied topic. Classical fault–tolerance techniques include the use of redundancy (letting multiple nodes perform the same job) and/or the definition of a set of rules triggering a system reconfiguration after a fault has been detected [15]. In many cases however it is not feasible to pre–engineer all possible failure patterns and the consequent self-healing actions to be taken for restoring global functionalities. In previous work by two of the authors [16], we considered the potential for using bottomup approaches inspired by embryology to the automated creation and evolution of software. In these approaches, complexity emerges from interactions among simpler units. It was argued that this approach can also inherently introduce self–healing as one of the constituent properties without the need to introduce separate fault–handling behaviours. The ability of a node to restore from a faulty to a normal state is a side-effect of the system’s normal process of differentiating into the locally correct node type.
3
EmbryoWare Architecture
EmbryoWare [3] applies concepts inspired by cellular development to the design of self–healing distributed software systems, leveraging off previous research conducted in the evolvable hardware domain. Such approaches, which gave rise to the embryonics research field [1, 2], are based on the use of “artificial stem cells” [17, 18], in the form of totipotent entities that can differentiate – based on sensing of the state of neighbouring cells – into any component needed to obtain the desired system–level behaviour. In general, we define an embryonic system as a system composed of networked entities that: 1. Are able to sense the state (or: type) expressed by neighbouring entities, i.e., those immediate neighbours with which direct communication is possible, or those entities for which information is provided by immediate neighbours; 2. Are able to differentiate their behaviour into a given type, depending on the type expressed by neighbouring entities and according to a set of well-defined rules; 3. Are able to replicate to neighbouring entities (i) the definition of all types (ii) the set of differentiation rules.
Fault Tolerance of Embryonic Algorithms in Mobile Networks
53
Fig. 1. EmbryoWare Architecture, showing two neighbouring nodes
Our specific architecture is shown in Figure 1 for the case of two neighbouring nodes. Nodes are organised in a network, and each node contains the following components: – Genome: defines behaviour of the system as a whole, and determines the type to be expressed based on local context (i.e., neighbour cell types). – Sensing agent: component that periodically communicates with neighbours regarding their current type. We consider in this work pull sensing, in which each node periodically polls its neighbours to inquire about their currently expressed type (as distinct from push sensing, in which each node ‘pushes’ information on its type to its neighbours). – Replication agent: component that periodically polls the neighbours about the presence of a genome; if a genome is not present then the current genome is copied to the “empty” cell. – Differentiation agent: component that periodically decides, based on the cell’s current type and the knowledge about the types of the neighbouring cells, which functions should be performed by the node. In our earlier work we discussed some possible design choices and considered the overall system performance – including the impact of network characteristics such as latency and dropped data packets [3]. However, whilst the algorithms themselves are independent on the network topology, we did not measure the impact of mobile nodes, and hence of a changing network topology. When the topology changes the local neighbourhood for nodes is affected, and this can mean that the node types are no longer correct. This can be interpreted as the introduction of faults into the system, and hence have significant implications for the ongoing validity of the system. 3.1
Case Study: Coordinated Data Sensing and Logging
The following example scenario will be used throughout this paper: a number of mobile wireless sensor devices are deployed over an area for the purpose of
54
D. Lowe et al.
environmental monitoring. Each device collects sensor information from its surroundings, and the data collected must be logged within the local neighbourhood (to minimise longer range communication overheads). This means that each monitoring node should be within only a few hops of a logging node. In this case study we set this distance to two hops. When a monitoring node, through sensing its neighbourhood, discovers that it is not within two hops of a logger, then it will probabilistically differentiate into a logger. The differentiation behaviours are given in Algorithm 1 and the pattern that results is illustrated in Figure 2. It is worth remarking that this specific example could be regarded as a clustering problem, in which cluster heads need to be at a maximum distance of four hops. Similar problems have received attention in the ad hoc network community, in particular related to the problem of computing the connected dominating set (CDS) [19]. This problem could be addressed in a traditional way by, e.g., first computing the CDS of the original network and then computing the CDS on the resultant overlay. However we believe that the EmbryoWare solution is much simpler, more compact, and able to handle faults in an intrinsic way. A comparison with existing cluster construction algorithms is a good topic for future work.
– Stem cell: with probability PT toM : T ype ← Monitor – Monitor cell: no 2-hop logger ⇒ with probability PM toL : T ype ← Logger with probability PM toT : T ype ← Stem – Logger cell: 2-hop logger ⇒ with probability PM LtoT : T ype ← Stem with probability PLtoT : T ype ← Stem
Algorithm 1. Differentiation behaviour for Genome for simple environment logging application
In the subsequent sections, we will evaluate the impact on the system validity (i.e. the ability to return to a correct state from a state that include faults), in the case of a time–varying network topology due to nodes mobility, of different choices for the sensing period, i.e., the time elapsed between consecutive polls of a nodes neighbour. Furthermore, we will consider two options related to the timing of when a node becomes aware of a change in the topology. The baseline behaviour would be that nodes operate completely independently except for the periodic sensing. In our earlier work, with a fixed topology, this sensing only gave information on the current type of neighbouring nodes. With mobile nodes becoming a possibility, the sensing will give not only information on nodes types but also node connections – i.e. the local neighbourhood topology. This means that if the topology changes due to node movement (or failure) then each node will only become aware of that, and respond to it through appropriate differentiation, after its next sensing operation. We refer to this sensing behavior
Fault Tolerance of Embryonic Algorithms in Mobile Networks
55
Fig. 2. Example differentiated node pattern: The red (darker) circles represent monitoring nodes and the green (lighter) circles are logging nodes. Nodes with a black centre are currently in an invalid state
as connection unaware. The alternative to this is if the node maintains a continuous awareness of its connections to other nodes (through relevant lowerlevel communications mechanisms, such as the loss or gain of a carrier signal and/or reception of appropriate beacon messages) then it could become aware of a changed topology much sooner than the next sensing cycle. In this situation it would be able to react much more quickly. We call this mode of operation connection aware. The implications of these two different sensing behaviours will be analysed in the following section.
4
Performance Evaluation under Node Mobility
We now evaluate the impact of mobility on the fault-tolerance properties of the scenario described in Section 3.1. Initially, the overall system may be in a valid state (i.e. all monitoring nodes within 2 hops of a logger). However, as nodes move, and the topology changes, the validity of the system can be affected. Consider the cluster of monitoring (red) nodes around (1, 7) in Figure 2. If these nodes were to move upwards then they would become isolated from the associated logging node at (1, 5), and hence they would be in an fault state. This fault would persist until the nodes were able to sense the lack of a neighbourhood logger, and one of the nodes in this cluster differentiated into a logger. A key performance characteristic to evaluate the system’s self-healing ability is the percentage of time that the system is in an invalid state (i.e. a fault in the system is persisting). Two factors will affect this: the frequency with which faults arise, and the speed with which they are then corrected. The former should be predominantly related to the rate of change in the topology, and hence the speed at which the nodes are moving. The latter will be related to the speed with which the fault is detected, and hence the sensing behaviour.
56
D. Lowe et al.
Understanding the relationship between system validity, node speed, and sensing behaviour is important insofar as it allows us to appropriately tune the behaviours of the system. Sensing the state of neighbours (or, as discussed above, the existence of network connections) incurs both processing and bandwidth overheads. If we have a sensing behaviour that performs more rapid sensing than is necessary, then we are wasting resources. To evaluate the extent to which each of these factors plays a role we extended the Matlab simulations from our previous work in order to incorporate node mobility. The basic algorithms for implementing the embryonic behaviours are outlined in [3]. These were modified in several ways. Firstly, the nodes have been made mobile. They have an initial location and a random (uniform distribution) velocity that only changes when the the node reaches the edge of the containing area (a lossless reflection). All nodes continuously move, with connections existing between nodes only when they are within a specified range of each other. The node network shown in Figure 2 was generated using N = 40 nodes initially randomly distributed in a 10m × 10m area, with nodes being connected when they within 2m of each other. We then undertook two main fault-tolerance evaluations – using each of the two primary sensing behaviours described above. To evaluate the fault-tolerance, we varied the maximum node velocity over the range 0...2m/s, and the sensing period over the range 0.05...0.8secs. For each pair of velocity and sensing period values, we ran 10 simulation passes, with each pass running for an initial period to allow node replication to occur, and then a 60sec evaluation period where we measured the proportion of time during which no fault was present in the network. 4.1
Connection Aware versus Connection Unaware Sensing
The first set of analyses were carried out for the two sensing behaviours described previously. Figure 3 graphs the overall system fault rate (i.e. the percentage of time for which the system contains at least one faulty node, i.e. a node is not within 2 hops of a logging node and hence needs to differentiate to return to a valid node type) against node speed and sensing period for the two cases discussed above (i.e. where the nodes do, and do not, retain awareness of the existence and loss of network connections). As can be seen from these results, in both cases there is a noticeable, though expected, increase in the percentage time that the system contains at least faulty node as the node mobility increases. Of interest is that this increase is gradual and relatively linear, and there does not appear to be a point at which the ability of the system to recover collapses. This is an important observation insofar as the implications for varying node speeds that can be tolerated. Somewhat more surprising is the result with regard to variations in the sensing period. In the “connection aware” case, variations in the sensing period appear to have only marginal effect on the fault recovery. This can be explained as follows: when a connection between two nodes is broken because of node movement, the direct neighbouring nodes will become aware of this immediately and any
Fault Tolerance of Embryonic Algorithms in Mobile Networks
57
Fig. 3. Fault recovery: Results showing the percentage of time that the system contains at least one faulty node for varying node maximum speed and varying sensing times: (a) where the nodes retain awareness of the existence or failure of network connections; and (b) where the nodes do not monitor the state of the network connections. (Generated by Matlab files GenomeTester v3j.m and GenomeTester v3k.m)
information that either node obtained from the other node is removed from its list of sensed data. This means that the node differentiation can then occur immediately, rather than needing to wait for the next sensing period. The details of the implementation of this are given in Algorithm 2. The only occasions when an immediate re-differentiation does not occur is where the directly impacted nodes are still in a valid state, and it is nodes further away in the neighbourhood that are the only ones that enter a faulty state. In this case the re-differentiation that corrects the fault must wait for a sensing cycle to occur. Overall, this particular behaviour leads to a more rapid response to changes in the network topology and a relative Independence of the sensing period, but does require that all nodes retain constant awareness of their connectivity to nearby nodes (often this would be available through the presence of a carrier signal) with the associated resource overheads that this implies. In the “connection free” case, there is a slightly stronger relationship with the sensing period. As can be seen, as the sensing period gets longer, the percentage for all i ∈ Nodes do if Node movement leaves region then reverse Node velocity update Node location for all i, j ∈ Nodes do calculate distance(i, j) if distance(i, j)< commRange then connected(i, j)=true for all i, j ∈ Nodes do if !connected(i, j) then delete Node i sensed data obtained from Node j
Algorithm 2. Node movement
58
D. Lowe et al.
Fig. 4. Connection unaware sensing: Results showing the average percentage of time that a node is faulty for varying node maximum speed and varying sensing times, where the nodes do not monitor the state of the network connections. (Generated by Matlab file GenomeTester v3k.m)
of time that the system contains faulty nodes increases. We can understand this relationship more clearly by looking not only at the time that the whole system is valid (i.e. no nodes at all that are in a faulty state), but at the average validity of each individual node. Figure 4 shows these results. As can be seen, there is a much more significant relationship to the sensing period. Several other observations arise from this data. Firstly, it appears that there is a baseline fault rate that even extremely rapid sensing cannot improve – for example, with the system configuration used in these simulations3 , at a node maximum speed of 0.25m/s, it does not appear possible to reduce the average percentage of time that nodes are in a fault state below 1% irrespective of how quickly the sensing occurs. We believe that this is an artifact of the algorithmic sequencing in our simulation – though even if this is the case, similar behaviours would be likely to emerge in real-time code executing on live mobile devices. A second observation arising from the data shown in Figure 4 is the increasing volatility of the a verage node fault rate as the sensing period increases. The processes being evaluated are inherently stochastic, both in terms of the speed and associated movement of the nodes (and hence the changes to network topology), and in terms of the node differentiation decisions. At low sensing periods the baseline fault rate (as discussed above) tends to dominate the behaviour. At slower sensing rates however the delay in returning to a valid state from a fault state appears to be significantly more variable. This may be an issue that needs to be taken into account with applications that cannot afford extended periods of unavailability of individual nodes – though it is worth acknowledging that embryonic systems are designed explicitly so that they do not rely on the behaviour, or indeed even availability, of individual nodes. 3
Relevant factors in the configuration are likely to be area size, number of nodes and hence node density, and the probabilities that affect the differentiation behaviours.
Fault Tolerance of Embryonic Algorithms in Mobile Networks
5
59
Conclusions and Further Work
In this paper we report performance measurements with regard to the fault tolerance of a distributed processing architecture, based on embryonic principles, where the nodes in the system are mobile. The node mobility inherently leads to constant changes in the network topology for the system, and hence changes in the local neighbourhood for individual nodes. This in turn can lead to those nodes being temporarily in a fault state. This fault state in inherently rectified by the self-healing differentiation processes in the nodes – but this process does take time. We have evaluated the relationship between node speed, node sensing period, and fault recovery. Interestingly, we found that rather than reaching a “knee” in the performance curve where above a certain node speed the system performance collapsed and became unable to recover from the increasing number of faults, the relationship between node speed and fault recovery was relatively linear. This is likely to be an important finding in terms of dynamic adaptation of the sensing periods in the nodes in ensuring that the performance remained above a specified level. We also have shown that the fault recovery performance becomes much less dependant upon the sensing period if nodes are able to continuously monitor the existence (or loss) of the network connections. This monitoring is unlikely to be feasible in systems involving, for example, sensor networks where the communication is intentionally very sporadic in order to minimise resource utilisation (i.e. most commonly power and/or bandwidth). However in other domains where the connection is maintained (or at least there is a constant carrier) this finding will be significant in that it indicates a much lower sensing rate, and hence lower processing and bandwidth overheads, will be tolerable. One aspect that we have not considered, and which is a fruitful source for future investigation, is the possibility of replacing (or even supplementing) state sensing with pro-active state broadcasting. In this scenario when a node changes its state it would broadcast its changed state to its neighbour. This may circumvent the need for monitoring of the connection (as described in the previous paragraph) as a simpler way of making the performance less dependant on the sensing period. However this could also introduce excessive messages when mobility is high, and a compromise would have to be found. Our measurements are performed over a particular case study: the logging scenario. Ideally, one would like to know the general fault-tolerance properties of the EmbryoWare approach. For this purpose, as a future work, it would be interesting to evaluate several different cases, and see whether they share common fault handling patterns.
References 1. Ortega-Sanchez, C., Mange, D., Smith, S., Tyrrell, A.: Embryonics: a bio-inspired cellular architecture with fault-tolerant properties. Genetic Programming and Evolvable Machines 1(3), 187–215 (2000)
60
D. Lowe et al.
2. Tempesti, G., Mange, D., Stauffer, A.: Bio-inspired computing architectures: the embryionics approach. In: Proc. of IEEE CAMP (2005) 3. Miorandi, D., Lowe, D., Yamamoto, L.: Embryonic models for self-healing distributed services. In: Proc. ICST Bionetics, Avignon, France (2009) 4. Saha, D., Mukherjee, A.: Pervasive computing: A paradigm for the 21st century. Computer 36(3), 25–31 (2003) 5. Abelson, H., Allen, D., Coore, D., Hanson, C., Homsy, G., Thomas, F., Knight, J., Nagpal, R., Rauch, E., Sussman, G.J., Weiss, R.: Amorphous computing. Communications of the ACM 43(5), 74–82 (2000) 6. McKnight, L.W., Howison, J., Bradner, S.: Wireless grids — distributed resource sharing by mobile, nomadic, and fixed devices. IEEE Internet Computing 8 (2004) 7. Ahuja, S.P., Myers, J.R.: A survey on wireless grid computing. J. Supercomput. 37, 3–21 (2006) 8. Palmer, N., Kemp, R., Kielmann, T., Bal, H.: Ibis for mobility: solving challenges of mobile computing using grid techniques. In: Proc. of HotMobile, pp. 1–6 (2009) 9. Akyildiz, I.F., Melodia, T., Chowdhury, K.R.: A survey on wireless multimedia sensor networks. Computer Networks, 921–960 (2006) 10. Campbell, A., Eisenman, S., Lane, N., Miluzzo, E., Peterson, R., Lu, H., Zheng, X., Musolesi, M., Fodor, K., Ahn, G.S.: The rise of people-centric sensing. IEEE Internet Computing 12, 1–21 (2008) 11. Yan, T., Ganesan, D., Manmatha, R.: Distributed image search in camera sensor networks. In: Proc. of ACM SenSys. (2008) 12. Ding, G., Bhargava, B.: Peer-to-peer file-sharing over mobile ad hoc networks. In: Proc. of IEEE PerCom Workshops, pp. 104–108 (2004) 13. Marossy, K., Csucs, G., Bakos, B., Farkas, L., Nurminen, J.: Peer-to-peer content sharing in wireless networks. In: Proc. of IEEE PIMRC., vol. 1, pp. 109–114 (2004) 14. Kel´enyi, I., Cs´ ucs, G., Forstner, B., Charaf, H.: Peer-to-peer file sharing for mobile devices. In: Fitzek, F.H.P., Reichert, F. (eds.) Mobile Phone Programming — Application to Wireless Networking, pp. 311–324. Springer, Heidelberg (2007) 15. Coulouris, G., Dollimore, J., Kindberg, T.: Distributed systems: concepts and design. Addison-Wesley Longman, Amsterdam (2005) 16. Miorandi, D., Yamamoto, L., De Pellegrini, F.: A survey of evolutionary and embryogenic approaches to autonomic networking. Computer Networks (2009) (in press), doi:10.1016/j.comnet.2009.08.021 17. Mange, D., Stauffer, A., Tempesti, G.: Embryonics: a microscopic view of the molecular architecture. In: Sipper, M., Mange, D., P´erez-Uribe, A. (eds.) ICES 1998. LNCS, vol. 1478, pp. 185–195. Springer, Heidelberg (1998) 18. Prodan, L., Tempesti, G., Mange, D., Stauffer, A.: Embryonics: artificial stem cells. In: Proc. of ALife VIII, pp. 101–105 (2002) 19. Wan, P.J., Alzoubi, K.M., Frieder, O.: Distributed construction of connected dominating set in wireless ad hoc networks. Mobile Networks and Applications 9(2), 141–149 (2004)
Evolution and Analysis of a Robot Controller Based on a Gene Regulatory Network Martin A. Trefzer, T¨ uze Kuyucu, Julian F. Miller, and Andy M. Tyrrell Department of Electronics, University of York, UK {mt540,tk519,jfm7,amt}@ohm.york.ac.uk
Abstract. This paper explores the application of an artificial developmental system (ADS) to the field of evolutionary robotics by investigating the capability of a gene regulatory network (GRN) to specify a general purpose obstacle avoidance controller both in simulation and on a real robot. Experiments are carried out using the e-puck robot platform. It is further proposed to use cross-correlation between inputs and outputs in order to assess the quality of robot controllers more accurately than with observing its behaviour alone.
1
Introduction
Biological development encompasses a variety of complex dynamic systems and processes at different levels, ranging from chemical reactions at the molecular level, to single cells or groups of cells dedicated to specific tasks to complex multicellular organisms that are capable of adapting to changing environments and exhibit remarkable capabilities of scalability, robustness and damage recovery [1]. It is remarkable how this large number of complex mechanisms work together in nature over long periods of time in an effective and productive manner. This makes biological development a source of inspiration for research into modelling its principles and applying them to engineered real-time systems. Current research in the area of artificial developmental systems (ADS), gene regulatory networks (GRNs) and artificial life (ALife) concentrates on both studying developmental processes from a complex dynamic systems point of view and regarding their versatility in providing an indirect mapping mechanisms between genotype and phenotype. In the first case, the properties of ADSs, particularly GRNs, are investigated by identifying transient states and attractors of such systems [2,3]. Hence, these approaches offer a more theoretical approach to modelling biological development. In the second case, there are examples where GRNs are utilised to grow neural networks or nervous systems for artificial agents [4]. Research that is undertaken into growing large, complex organisms that can represent a variety of things, such as patterns [5] morphogenesis in general [6,7] or designs [8] also fits into the second category. A third research thread seeks to exploit inherent properties of ADSs, such as the ongoing interaction between cell/organism and environment, multicellularity, chemical based gene regulation and homoeostasis, in order to achieve G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 61–72, 2010. c Springer-Verlag Berlin Heidelberg 2010
62
M.A. Trefzer et al.
(a) Mechanisms of the GRN
(b) Structure of genes and GRN
Fig. 1. Protein interaction and regulatory feedback mechanisms of the ADS are shown on the left. On the right, it is illustrated how genes are divided into precondition and postcondition. Proteins can occur in both pre- and postcondition whereas molecules can only occur in the precondition.
adaptive, robust and scalable control mechanisms for robotic systems. Research is undertaken into emergent, autonomous, collaborative behaviours [9,10,11] and modular robotics [12]. In [13] it is shown that GRNs are a viable architecture for the on-line, real-time control of a robot. This paper introduces a GRN based robot controller, similar to the one presented in [13]. It is investigated whether the chemical regulation based GRN mechanisms of the ADS introduced in [14] are suitable to specify a general purpose obstacle avoidance controller for the e-puck robot platform1 . Evolutionary experiments have been conducted in simulation using the Player/Stage simulator and are validated on a real robot. Furthermore, the paper proposes using cross-correlation between inputs and outputs of the GRN controller to assess its quality and ability to adapt beyond observing behaviour alone. Here, the term adaptivity refers to a controller’s ability to automatically calibrate itself to perform the task on which it was trained in both known and unknown environments (in simulation and hardware).
2
The Artificial Developmental System
The GRN based model for artificial development used in this paper is based on the one that has been introduced, and is described in more detail, in [14]. The design considerations of the original ADS are retained, namely the use of data structures and operations in the GRN core that are suitable for embedded systems, i.e. boolean, integers, no division and to keep the mechanisms of the ADS as close as possible to their biological counterparts within the boundaries of the chosen data types. Whilst not crucial for the experiments in this paper, the choice of the data structures imposes no loss of generality and is therefore unchanged. However, some improvements are made to the ADS for this paper: first, a dedicated diffusion layer is added and only chemicals that are released to this layer by the cells are subject to diffusion. Chemicals need to be absorbed by the cells from the diffusion layer first, before they affect gene regulation. This is motivated by natural development. Second, a genetic representation that allows 1
http://www.e-puck.org/
Evolution and Analysis of a Robot Controller
63
for variable length GRNs is used in this paper, which allows for a more flexible and compact encoding of the genes, as shown in Figure 1(b). An overview of the mechanisms of the ADS is provided in the following sections. The term chemicals refers to both proteins and molecules. As the experiments in this paper are performed using one single cell, the description of cell signalling mechanisms and growth are omitted. A more detailed description of the ADS can be found in [14]. 2.1
Representation and Gene Regulation
The core of the developmental model is represented by a GRN, as shown in Figure 1(a). The genotype is implemented as a string of symbols that encode start and end of genes, separation of pre- and postcondition within genes, binding sites and chemicals as shown in Figure 1(b). Genes interact through chemicals and form a regulatory network. There is at least one major difference between the artificial model and biology: in the ADS used, the binding sites match exactly one chemical, whereas in natural genes binding sites are defined by certain upstream and downstream gene sequences that accept a number of proteins to bind and transcribe their genetic code. The binding sites in natural DNA therefore allow for smooth binding, i.e. the probability that certain chemicals (transcription factors) bind to the DNA is given by how well the binding sites of the chemical matches the one of the DNA. The current GRN works with four proteins (A . . . D) and eight molecules (a . . . h). Proteins are directly produced by the GRN, whereas molecules are only a product of a gene function as a result of a measurement or interaction that is performed by a protein. In addition to gene regulation, proteins implement dedicated functions and mechanisms of the ADS. Protein A (structuring/functional ) defines the cell type, B (sensory) translates sensory inputs into molecules, C (diffusion) manages chemical diffusion and D (Plasmodesmata) controls chemical sharing/exchange between adjacent cells and growth. Note that the additional roles of chemicals for robot control are described in Sections 2.3 and 3. 2.2
Evolution of the Genotype
The genotype is derived from a genome that is evolved using a 1 + 4 evolutionary strategy (ES). The genome is represented by a string of integers and mutation takes place by replacing them with a new random value at a rate of 2% of the genome length. The GRN is obtained by mapping the string of integers to GRN symbols using the modulus operation on the genome. Variable length genes are achieved via (in)active flags encoded in the genes. 2.3
Developing Organisms That Control Robots
The application in this paper is to control a robot via a GRN. Therefore, the GRN has to be able to process input signals from the robot’s infra-red (IR) range sensors and the outputs have to be translated into motor commands.
64
M.A. Trefzer et al.
Since the GRN operates on chemical concentrations, this is achieved by mapping the distance measures to input chemical concentrations and by computing speed and turning rate of the robot from output chemical concentrations. Molecules are suitable to present the sensory inputs to the GRN since they affect gene regulation and can be consumed, but not directly produced. Hence, it is not possible for the GRN to directly generate input signals which are not actually present in the environment. However, molecules can be indirectly produced via the sensory protein B (Section 2.1). Contrary to the molecules, the GRN is able to quickly change the levels of the proteins (A-D) as they can be both consumed and directly produced, hence, proteins naturally represent the outputs of the system. Thus, values for speed and turning rate of the robots are calculated from protein levels. Furthermore, as proteins occur in the precondition, they provide feedback of the states of the outputs to the GRN, which can be exploited by the organism for adaptation and self-regulation. Due to the fact that one GRN with a sufficient number of proteins is able to process the inputs of one robot, a single-cell organism is used to control the robot in the experiments described.
3
E-Puck, Player/Stage and GRN
The experiments presented in this paper are carried out using the e-puck robot platform. Evolution of the ADS that controls the robot and testing on different maps is performed using the open-source robot simulation platform Player/Stage2. Verification of the controller is achieved on a real e-puck robot. As described in Section 2.3, sensory inputs and motor signals are mapped to chemical concentrations which can be processed by the GRN. Due to the 16bit processor available on the e-puck, the maximum protein level is 65535 and the mapping functions for input and output signals are designed in such a way that the full protein value range is utilised. An important and still open question is how the time scales of development and the robot should be related to each other. In biology, for instance, neural networks operate at a greater speed than gene regulation, which inherently constrains those systems to certain tasks. In the case of engineering and computer science, those boundaries are not existent and therefore subject to research. In this paper, one developmental step of the controller (GRN) corresponds to one sensor/motor update cycle of 10 Hz. The latter value is given by the e-puck robot and is set accordingly in simulation. 3.1
Mapping Sensory Inputs
The e-puck provides 8 IR distance sensors, which are positioned around the outside of the robot at 10◦ , 45◦ , 90◦ , 150◦, −150◦ , −90◦, −45◦ and −10◦. The range of the IR sensors is theoretically about 10 cm for the real robot. However, measuring and calibrating the actual IR sensor ranges of the e-puck used for these experiments shows linear behaviour only up to a maximum range of 5 cm, which 2
http://playerstage.sourceforge.net/
Evolution and Analysis of a Robot Controller
65
is assumed as an approximation for the Player/Stage simulation, whereas the different sensors are assigned different maximum ranges in the case of the real e-puck according to the measurements taken (see max range below). For simplicity, linear behaviour is also assumed in the case of the e-puck, despite the fact that an accurate calibration would have to take the exponential characteristics of the IR diodes into account. This leads to the following equation for mapping IR sensor readings to input chemical levels: ⎧ ⎨65535 × (1 − sensori ) if sensori < max rangei max rangei chem leveli = (1) ⎩0 if sensori ≥ max rangei with max range = 0.05 for all i in the uncalibrated case and max range0..7 = 0.03, 0.05, 0.05, 0.05, 0.005, 0.03, 0.015, 0.005 in the calibrated case. 3.2
Deriving Motor Command Signals
Both the e-puck and its simulation model provide an interface that enables the speed and turning rate to be set. While speed is a value without unit between -1 and 1 (maximum reverse speed/ forward speed), the turning rate is expected in radian. Hence, computing a value for speed and turning rate from the output chemical levels can be achieved in a straight forward manner: newspeed =
0.15 × ((proteinA − proteinB )/65535) + 0.05
(2a)
newturnrate =
3.0 × ((proteinC − proteinD )/65535).
(2b)
where maximum speeds between −0.1 · · · + 0.2, the forward speed bias of +0.05 and possible turning rates between −171◦ · · · + 171◦ are arbitrarily chosen. The factors (proteinA,C − proteinB,D )/65535 are normalised to [-1,1].
4
Evolution and Analysis of a GRN Based Robot Controller
The task is to optimise a GRN based controller for an e-puck robot via an EA. This experiment is carried out using a simulation model of the e-puck in Player/Stage. The aim is to achieve obstacle avoidance and area coverage in the map shown in Figure 2(a). A relatively basic map is chosen, since the aim is to achieve a low-level controller, which interacts directly with the robot’s sensors and actuators rather than operating on a higher abstraction level using predefined actions or behaviours. The size of the map is 1.6 m × 1.6 m with x/y coordinates between −0.8 m.. + 0.8 m. 4.1
Fitness Function
The fitness is the averaged score of three rounds of a maximum of 1000 time steps (= developmental steps) each. The chemical levels are initialised to 0 before the
66
M.A. Trefzer et al.
Algorithm 1. Pseudo-code of the fitness function used for three rounds do reset score reset previous distance randomise starting position and angle of the e-puck with x,y in the range of −0.65.. − 0.75 m (lower left corner) angle in the range of 0..360◦ for 1000 time steps do perform sensor reading map distance values to molecule levels (a-h) perform one developmental step calculate new speed and turning rate from protein levels (A-D) send motor commands // stimulating covering distance: if current distance to starting point > previous distance to starting point then score = score + distance end if // stimulating obstacle avoidance: if robot bumps into obstacle or wall then end this round (and the chance to increase score) end if end for add score to fitness end for divide fitness by number of rounds
first round. For subsequent rounds, the state of the developmental system (chemical levels) is retained in order to allow for the ADS to adapt to the environment. The fitness is calculated as shown in Algorithm 1. Note that rounds are only terminated during optimisation when hitting a wall, but not when assessing the behavioural performance in different environments later on. 4.2
Assessing Task Based Performance
In the case of a robot controller, it is possible to qualitatively assess its performance by observing the behaviour of the robot for a period of time and count the number of times it fails to avoid walls or obstacles. The ability of the robot to explore the map and reach the opposite end of the map can be observed by tracking its path. This can be easily achieved in simulation by enabling path highlighting, which is a feature of Player/Stage. In the case of the real robot this becomes more difficult as generally either a video recording or a tracking system is required. A controller with a good performance is re-run for 6000 time steps and the resulting path of the robot is shown in Figure 2(a). The starting point of the robot is in the lower left corner of the map. At the beginning of the run, the robot
Evolution and Analysis of a Robot Controller
(a) Cave 1
(b) Course of Protein Levels
(c) Correlation Matrix
(d) Course of Correlation
67
Fig. 2. The maze that is used to evolve the GRN robot controller, protein levels and correlation matrix are shown in the figure. a − b are inputs, A − D are outputs
bumps into walls twice, indicated by the star symbols. After that, it manages to navigate through the cave with no further collisions. Also, it can be seen from Figure 2(a) that the robot roams the entire cave by following the wall on its left hand side. However, the controller achieves slightly more than just wall-following, as it automatically starts turning and returns to the left wall in case it loses track of it. As can be seen from the tracks at the turning point in Figure 2(a), where the robot turns left in one round and right in another, it is not default behaviour to stop and always turn right when approaching a wall. From this it can be concluded that the GRN achieves control of the robot in a manner that satisfies the requirements of the fitness function: the robot avoids walls and navigates as far away as possible from the starting point. The fact that the robot hits a wall only twice in the beginning of the run suggests some
68
M.A. Trefzer et al.
kind of adaptivity of the GRN based controller. Hence, the controller’s ability to adapt is further investigated in Section 5. 4.3
Measuring Performance Using Cross-Correlation
Although tracking the path of the robot and counting the number of collisions are suitable to verify whether the evolved controller satisfies the behavioural requirements of the fitness function, this provides no information of the complexity of the states and the dynamics of the supposedly adaptive, GRN based controller. It would be particularly useful to have information about how the controller makes use of the input sensor data and in what way the inputs are related to the outputs, i.e. the actions of the robot, since a common problem [15] (although not analysed and published very often) with evolved controllers is the fact that they are likely to ignore inputs to the system but still manage to find partially optimal solutions. In this paper, it is proposed to use cross-correlation as a measure of dependency between sensory inputs and motor outputs. Cross-correlation is a measure of similarity of two continuous functions, which in general also considers a timelag applied to one of them. In this case, it is assumed that the time-shift between input and output is 0. In order to obtain values in a bounded range, normalised cross-correlation is used for the experiments presented: (f ∗ g) =
(f (t) − f¯) · (g(t) − g¯) 1 · , n−1 t σf · σt
(3)
where f¯, g¯ are the mean values, σf , σg are standard deviations and n is the number of samples of the time series. Note that the usage of mean and standard deviation might be problematic, as the statistical distribution of the samples is unknown. However, using Equation 3 is convenient as the output value range is −1 . . . 1, where −1/1 denote maximum negative/positive correlation and 0 means the signals are uncorrelated. The measured input chemical levels (a-h) and output chemical levels (A − D) for 6000 time steps are shown in Figure 2(b) and the development of the cross-correlation of the chemical levels is shown in Figure 2(d). As can be seen from Figure 2(b) and 4(c), input sensors f,g,h (front, left) show almost constant activity, a,b,c (front, right) show only occasional peaks and d,e (rear) are almost always zero. This corresponds to the observed behaviour where the robot follows the left wall and only occasionally encounters a wall on its right hand side. In order to answer the question whether the inputs are actually considered by the controller when generating the output chemical levels A, B, C, D—which define speed and turning rate of the robot according to Equation 1—, the course of the cross-correlation values over time (at each point in time from 0 . . . t) for each input/output chemical pair is shown in Figure 2(d). At the beginning of the run, the cross-correlation values keep changing before settling to particular values (although there are still slight adjustments taking place at later iterations, e.g. in the case of a,c,h). Again, this suggests that, to a certain extent, an adaptation
Evolution and Analysis of a Robot Controller
(a) Cave 2
(b) U-Maze
(c) Distributed Obstacles
(d) Cross-corr. Cave 2
(e) Cross-corr. U-Maze
(f) Cross-corr. Obstacles
69
Fig. 3. The figure shows a comparison of the behaviour of the GRN controller in different maps
process is taking place at the beginning of the run. However, this needs to be further consolidated by the following sections, when the controller is tested on different maps and on a real robot. For a better overview, the cross-correlation matrix of time step 5000 is shown in Figure 2(c), including the differences A − B (∝ speed) and C − D (∝ turning rate). Looking at the correlation matrix (black = max. absolute correlation, white = no correlation), it can be confirmed that there is a correlation between speed and turning rate and the sensors at the front and the sides. It is interesting to see that A−B appears to be correlated to B, but not A, and C − D appears to be more correlated to C than D. This suggests that the evolved controller keeps A and D relatively constant and achieves changes in speed and turning rate by adjusting the chemical levels of B and C.
5
Test and Analysis on Different Maps
In order to investigate whether the controller exhibits—at least to a certain extent—adaptive behaviour, it is tested in simulation on three different maps, shown in Figure 3(a,b,c). As can be seen from the recorded tracks, the controller successfully navigates the robot through the three maps, which feature different characteristics: the first map (Figure 3(a)) is similar to the one in which the controller is evolved. Hence it is expected that the robot does not collide with walls in this case. Since the primary behaviour of the evolved controller is wall-following,
70
M.A. Trefzer et al.
(a) Trace of the real e-puck.
(b) Correlation Matrix
(c) E-Puck sensors
Fig. 4. Results from experiments with the real e-puck are shown in the figure
the robot explores the additional branches present in this map, rather than going straight to the opposite side of the map. The second map (Figure 3(b)) represents a simple U-maze. The important feature of this map are the straight edges and the sharp 90◦ turns, which impose challenges on the controller. It is observed that, although the robot manages to navigate through the entire map, it hits the wall in all cases where it approaches the wall under a right angle. In those cases, the controller is unable to decide in which direction to turn. The third map (Figure 3(c)) is significantly different one to the one used for the evolution of the controller. Despite that, the robot successfully navigates around the obstacles and explores the map. It is interesting to see that in this case the robot bumps into obstacles only in the beginning of the run (starting point is in the lower right corner of the map) and manages to successfully avoid all obstacles as time goes on. This is again a hint that an adaptation process is actually happening. When comparing the cross-correlation matrices in Figures 3(d,e,f), it can be observed that the cross-correlation values at which the controller settles at time step 5000 are different for different environments. Particularly in the case of the third map (Figure 3(c)), the correlation between the outputs and sensors c,d,e,f has increased. This indicates that those sensors are playing a more important role in the case of the third map. The results show that the cross-correlation matrix looks different for different maps, which indicates that the controller indeed features different states of operation, depending on the environment.
6
Test and Analysis on an E-Puck Robot
In order to show the relevance of the presented experiments for real-world applications, the evolved GRN robot controller is tested on an e-puck robot. The only modifications that are made for the real robot is using the calibrated maximum sensor range values rather than the same for each sensor, as described in Section 3 and Equation 1. The results obtained with the real robot are shown in Figure 4. For visualisation, the path of the robot has been manually traced in Figure 4(a) using Player/Stage. It is observed that the e-puck is trapped for about the first 2500 out of 6000 time steps in the lower-right corner of the map shown in Figure 4(a), before it successfully resumes its primary wall-following behaviour, from then on without
Evolution and Analysis of a Robot Controller
71
getting stuck in equal situations again, and navigate through the map. The fact that this behaviour is then similar to the one observed in simulation (see Figure 3(b)) after it manages to escape the corner indicates that there might be some kind of adaptation to the new environment (the real e-puck) taking place. It can be seen from the cross-correlation matrix that the controller settles in a state that looks similar to the one from simulation, but the development of the cross-correlation values over time is significantly noisier. In order to quantitatively compare the cross-correlation matrices, it will be necessary to define a distance measure or visualise the state space given by the cross-correlation matrices as part of future work.
7
Discussion
This paper3 has explored the application of an ADS to the field of evolutionary robotics by investigating the capability of a GRN to control an e-puck robot. A GRN controller has been successfully evolved that exhibits a general ability to avoid obstacles in different maps as well as when transferred to a real robot. It has been shown that GRN based controllers have the potential to adapt to different environments, due to the fact that the robot successfully managed to navigate through previously unknown maps and could be successfully transferred to a real robot without further modification of the controller. Hence, it is concluded that GRNs are a suitable approach for real-time robot control and can cope with variations inferred by changing environments and sensor noise of a real robot. The results further suggest that it is possible to specify a general purpose obstacle avoidance behaviour via a GRN. It is proposed that cross-correlation between inputs and outputs is a suitable measure to quantitatively assess the quality of robot controllers (particularly evolved ones) beyond observing whether the robot exhibits the desired behaviour only. It has been shown that the cross-correlation settles at different values for different environments. On the one hand, this simply confirms that the level of activity and importance of the sensors changes for different environments. On the other hand, in conjunction with the observation that the robot still exhibits the desired behaviour, different cross-correlation matrices for different environments indicate that the controller features different stable states of operation and shows the ability of the controller to autonomously adapt to a certain extent. As the experiments show, this is the case for different maps in simulation and when transferring the controller to a real robot. However, it is an open question and subject to future work how to investigate whether this emergent adaptivity is a general, inherent property of ’soft’ controllers, rather than ones based on thresholds and decisions, or whether it is a specific feature of the GRN based, developmental controllers like the one introduced in this paper. One of the the greatest challenges in evolutionary computation (EC) is the design of the fitness function. This is particularly true in the case of behavioural fitness functions and real-world systems which can be extremely noisy, i.e. good solutions have a significant probability of being discarded during the optimisation process simply because of unlucky initial conditions at one iteration of the EA. Therefore, 3
This work is part of a project that is funded by EPSRC - EP/E028381/1.
72
M.A. Trefzer et al.
we will explore the possibility of including the cross-correlation measure in the fitness function, in order to provide an additional quality measure which is independent of the behaviour. Even if the robot does not solve the task, it will be possible to emphasise correlation between inputs and outputs which will prevent evolution from ignoring the inputs and might offer a means to overcome sub-minimally competent controllers, particularly in the beginning of the optimisation process.
References 1. Wolpert, L., Beddington, R., Jessell, T., Lawrence, P., Meyerowitz, E., Smith, J.: Principles of development. Oxford University Press, Oxford (2002) 2. Kauffman, S.A.: Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology 22, 437–467 (1969) 3. De Jong, H.: Hybrid modeling and simulation of genetic regulatory networks: a qualitative approach. In: ERCIM News, pp. 267–282. Springer, Heidelberg (2003) 4. Astor, J.C.: A Developmental Model for the Evolution of Artificial Neural Networks: Design, Implementation and Evaluation. Artificial Life 6, 189–218 (1998) 5. Miller, J.: Evolving developmental programs for adaptation, morphogenesis, and self-repair. In: Banzhaf, W., Ziegler, J., Christaller, T., Dittrich, P., Kim, J.T. (eds.) ECAL 2003. LNCS (LNAI), vol. 2801, pp. 256–265. Springer, Heidelberg (2003) 6. Eggenberger, P.: Evolving morphologies of simulated 3d organisms based on differential gene expression. In: Fourth European Conference on Artificial Life, pp. 205–213. The MIT Press, Cambridge (1997) 7. Bentley, P., Kumar, S.: Three ways to grow designs: A comparison of embryogenies for an evolutionary design problem. In: Proc. of the Genetic and Evolutionary Computation Conf., Orlando, Florida, USA, pp. 35–43. Morgan Kaufmann, San Francisco (1999) 8. Hornby, G.: Generative representations for evolving families of designs. In: Cant´ uPaz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2724, pp. 209–217. Springer, Heidelberg (2003) 9. Quick, T., Nehaniv, C.L., Dautenhahn, K., Roberts, G.: Evolving Embodied Genetic Regulatory Networks-driven Control Systems. In: Banzhaf, W., Ziegler, J., Christaller, T., Dittrich, P., Kim, J.T. (eds.) ECAL 2003. LNCS (LNAI), vol. 2801, pp. 266–277. Springer, Heidelberg (2003) 10. Floreano, D., Mondada, F.: Evolution of Homing Navigation in a Real Mobile Robot. IEEE Trans. on Systems, Man, and Cybernetics–Part B, 396–407 (1996) 11. Ziegler, J., Banzhaf, W.: Evolving Control Metabolisms for a Robot. Artificial Life 7, 171–190 (2001) 12. Groß, R., Bonani, M., Mondada, F., Dorigo, M.: Autonomous self-assembly in swarmbots. IEEE Trans. Robot, 1115–1130 (2006) 13. Kumar, S.: A Developmental Genetics-inspired Approach to Robot Control. In: Proc. of the Workshops on Genetic and Evolutionary Computation (GECCO), pp. 304–309. ACM Press, New York (2005) 14. Trefzer, M.A., Kuyucu, T., Miller, J.F., Tyrrell, A.M.: A Model for Intrinsic Artificial Development Featuring Structural Feedback and Emergent Growth. In: Proc. of the IEEE Congress on Evolutionary Computation (CEC), Norway (2009) 15. Tarapore, D., Lungarella, M., Gomez, G.: Quantifying patterns of agent-environment interaction. Robotics and {A}utonomous {S}ystems 54(2), 150–158 (2006)
A New Method to Find Developmental Descriptions for Digital Circuits Mohammad Ebne-Alian and Nawwaf Kharma Computational Intelligence Lab, Electrical and Computer Engineering Department Concordia University, Montreal, Québec, Canada
[email protected],
[email protected]
Abstract. In this paper we present a new method to find developmental descriptions for gate-level feed forward combinatorial circuits. In contrast to the traditional description of FPGA circuits in which an external bit stream explicitly describes the internal architecture and the connections of the circuit, developmental descriptions form the circuit by synchronously running an identical developmental program in each building block of the circuit. Unlike some previous works, the connections are all local here. Evolution is used to find the developmental code for the given problem. We use an innovative fitness function to increase the performance of evolution in search for the solutions, and also relax the position and order of the inputs and output(s) of the circuit to increase the density of the solutions in the search space. The results show that the chance of finding a solution can be increased up to 375% compared to the use of traditional fitness function. The preliminary studies show that this method is capable of describing basic circuits and is easily scalable for modular circuits. Keywords: Developmental Program, Evolutionary Hardware Design, Fitness Function, Scalability.
1 Introduction Evolvable hardware design (EHW) uses Evolutionary Algorithms (EAs) to find an optimum design of digital circuits in terms of surface, speed and fault tolerance. They can also use the physical characteristics of the underlying chip to improve its performance [1][2][3]. Miller [4][5] showed that EHW is also capable of finding innovative designs which outperform the traditional human design in terms of used resources. While EHW can address issues like efficient surface usage, fault tolerance and innovation, they suffer from an instinctively drawback of Evolutionary Algorithms: the solution is usually not scalable. This means that having the solution to the problem of the smaller size usually does not help to find the solution to the problem of the bigger size any faster. Instead, the runtime of the EA usually exponentially grows by the linear increase of the problem size. A solution to overcome the scalability issue in EAs is to break the direct mapping between the genotype and the phenotype. If the genotype has a one-to-one mapping to the phenotype, searching for more complex individuals will be equal to searching a larger and probably higher dimensional space. This eventually will make the EAs to G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 73–84, 2010. © Springer-Verlag Berlin Heidelberg 2010
74
M. Ebne-Alian and N. Kharma
fail finding the solutions to the large problems unless there exists a very efficient encoding. Developmental Programs that grow into a final circuit do not have this problem. The size of the circuit is not bounded by the size of the developmental program (DP), and it is possible to have one DP growing into fully functional circuits of vastly different sizes. In approaches like CGP[6][7], although the solution is a developmental code which tells the connections between the cells but still needs an external module to do the routing between cells on a physical configurable circuit. We try to eliminate this step by making all the connections local (i.e. cells are only allowed to connect to the immediate neighbor cells). If a primary input or an intermediate signal needs to be routed to a cell far away, the neighborhood cells themselves should form a router to pass that input or signal to the destination cell. Each cell by itself should decide to either be a router or performs a logical operation on its input. In this paper we present a method to implement any combinatorial digital circuit in gate level on a grid of configurable hardware elements. The main contribution of this work is that the resulting circuit includes sufficient information to build the functional circuit, including the gate arrangement and the routings. Keeping in mind that a considerable amount of resources on the configurable hardware (e.g. FPGAs) and the circuit compilation time is dedicated to the routing and connections, this property of our method tends to be attractive for practical problems. Also we try to improve the traditional fitness function used in EHW (for example the fitness function used in [5] and [8] or the basic component of the fitness function in [9]) to move toward the optimum solution more efficiently. The improvement to the fitness function is described in details in section 3.3.
2 Circuit Structure and the Developmental Program 2.1 Circuit Structure A circuit here is a two dimensional array of configurable cells. The inputs are provided through the left-most cells and the outputs are read from the rightmost cells. This means that the direction of the signals is from left to right in a high level abstract view (fig. 1.a). To implement this, each cell[i][j] (a cell in row i and column j of the circuit) can only accept inputs from either cell[i-1][j-1], cell[i][j-1] or cell[i+1][j-1] (fig 1.b). This limit on the connections enables the circuit to form without the need of any external processing module for the routings, as in CGP. In CGP, each cell (m) in the row can be connected to each cell (n) as long as n < m. While that description is enough for the circuit to be implemented, there needs to be a routing mechanism for the circuit to physically connect the cell inputs to the other cell outputs. The circuit resulted from our method does not have such a demand. This means that after each cell sets its own function and input connection to the adjacent cells, the routing is already done, without the need of any external central routing mechanism. Each cell in the circuit has an identical developmental program and 5 properties, each of which can be set to an integer. Fig.2 represents an abstract view of the cell. The cells at the borders of the circuit are named border cells and all their properties are set to -1. For all other cells, the initial value for all properties is 0. Table 1 lists the cell properties and their possible assigned values for non-border cells and table 2 lists the equivalent cell function for each value of the “function” property.
A New Method to Find Developmental Descriptions for Digital Circuits
75
Fig. 1. (a) The grid of cells in a circuit; (b) Potential inputs for the cell (i,j); (c) Naming of the neighbors
Rule#1 Rule#2 . . . Rule#n
Function Input1 Input2 row col
Fig. 2. An abstract view of one cell
Fig. 3. The structure of one rule
Table 1. The cell properties and their valid values Parameter Maximum Value
Function 0-7
Input 1 0-2
Input 2 0-2
row No limit
col No limit
Table 2. The output of each cell based on its function value “function” value Cell’s output
0 0
1 Input1
2 ~Input1
3 Input1 AND Input2
4 Input1 OR Input2
5 Input 1 XOR Input 2
6 Input 1 XNOR Input 2
7 Input 1 NAND Input 2
2.2 Developmental Program The developmental program is stored in the genome. The circuit size is fixed to a certain size at the beginning and there is no growth in terms of increasing the number of cells in the circuit. The genome is simply a variable number of ordered IF-THEN rules (Fig 3.). The IF part can check any property of any neighborhood cell. Based on the values of that property, the rule can set or update any property in the calling cell. The general format of a rule as shown in Fig.3 is as follows:
76
M. Ebne-Alian and N. Kharma
IF the property p of the neighbor n has the relation r to the value a THEN either assign the value a or do the action s on the value b and assign it to property p’ of the cell In which p and p’ can be any property of a cell (e.g. function, first input connection, etc), n is the index of the neighbor (0 to 7, for any of the 8 adjacent cells in fig.1.c), r is one of the possible relation from table 3 and a and b are the possible values for p and p’, respectively. The list of possible actions on the parameter b is listed in table 4. Only the parameters n, p, r, a, s, b, p’ are stored in the genome. For example, the third rule in the table 4 (1 0 1 -1 3 0 0) reads as follows: If the function of the neighbor 1 is equal to -1, then the row property of the cell should be set to 0. It is important to keep in mind that the row and col properties of the cell follow the very same regime as of other properties of the cell; i.e. they are initiated to 0 and are only changed by the developmental program. It is possible for them to gain any value at the end of development of the circuit, and not necessarily hold the coordination of the cell in the circuit. Table 3. The possible relations to be used in each rule of the genome 0 ≠
Value of “r” in the rule Corresponding relation
1 =
2 <
3 >
Table 4. The possible actions in the THEN part of each rule Value of “s” in the rule Corresponding action
0, 1, 2 Assign b
3 Assign a
4 Assign a+1
5 Assign a-1
There are 4 pre-written rules in the genome which affect the row and col properties of the cell. These rules are manually designed and added to the genome (table 5). These rules aim to simulate the protein gradient along the embryo of multicellular organisms at the axis specification step[14]. The rest of the rules in the genome are generated randomly using an even distribution random generator, and are tuned in the course of evolution. The number of rules in a genome is limited to 25 plus the 4 prewritten rules. Table 5. The 4 pre-defined rules in the genome
Rule index
Rule
1 2 3 4
1 3 3 -2 3 4 0 3 4 3 -2 4 4 0 1 0 1 -1 3 0 0 3 0 1 -1 4 0 0
During the development of the circuit, cells update their structure synchronously. A developmental step is composed of updating all the columns of the circuit, starting
A New Method to Find Developmental Descriptions for Digital Circuits
77
from the leftmost column and moving to the next column at the right until reaching the rightmost column. Updating each column is done by updating the topmost cell in the column and then move to the next cell at the bottom, until reaching the lowest cell in the column. A solution is a genome (i.e. rule base) which leads the desired behavior to emerge in the circuit after going through a certain number of developmental steps. The number of developmental steps needed for this is determined by the evolution, as is the genome itself.
3 Applying Evolution to Find the Developmental Program 3.1 User Interface and Problem Statement We apply evolution as a tool to find the solution to the given circuit design problem. As explained in section 2.2, the solution is a developmental program of the format mentioned in that section, necessary number of steps for the circuit development as well as the size of the circuit. Note that the developmental program itself does not provide or care about the size of the circuit. Any developmental program can be run on any circuit of any size. It is evolution’s task to find the appropriate circuit size for the developmental program. To define a specific problem user has to state the number of inputs, number of outputs, and the mapping between the input patterns and the output(s). The latter one is done by telling the program the set of minterms created on each output pin. No information about the circuit’s possible internal architecture is provided from the user. For example, the following lines define a full adder: Number of inputs: 3; Number of outputs: 2 output[0] = {3, 5, 6, 7} //carry output[1] = {1, 2, 4, 7} //sum Evolution also gives the exact position of each input and output signal on the circuit. Unlike some previous works in which user had to fix the position and the order of input and output signals, evolution is free to find the optimum placement of the I/O signals on the circuit. It is easy to realize that relaxing the I/O interface in this manner increases the density of the solutions in the search space. The easiest support of this is that the horizontal flip of a solution circuit is now a solution circuit itself, something which will not be the case if the inputs are fixed. The inputs are always provided on the left border (cells [i][0]) and the outputs are read from the right border of the circuit (cells[i][N-1]). Fig. 5 shows a sample full adder found by the program for the above description. It is important to remember that the program does not directly find the circuit in Fig. 5, but a generative code which makes the circuit after going through the developmental process. 3.2 Evolutionary Algorithm Evolution starts by creating a fixed sized population of random individuals. The population size was 500 in most of our experiments. As Halavati explains in [10], for
78
M. Ebne-Alian and N. Kharma
evolution of cooperative rule base systems for static problems in which all the training instances are available, the Pittsburgh approach [11] with the individual having the whole rule-base works better than the Michigan approach [12] in which each individual is only one rule and the whole population together form the rule-base system. Each individual then is a complete circuit here, including the developmental program, the number of developmental steps for that program and the size of the circuit. To create a random individual first we create a random sized circuit with the following restrictions (Eq. 1): Number of inputs+2 <= Number of rows <= 2×Number of inputs+2 3 <= Number of columns <= 2×Number of inputs+3 Eq.1
This is because the minimum number of rows needed to provide the inputs is equal to the number of inputs and we also need two rows for border cells. The minimum number of columns is 1, plus two columns for the border cells. The higher limit for the number of rows and columns is limited in regard to available processing power. The number of rows and columns in the initial population are evenly distributed between the two limits. After setting the size of a circuit, a random genome is created and assigned to the circuit. The random genome is composed of a random number of rules (limited to 29), with the first 4 pre-designed rules listed in the table 5. The rest of the rules are filled with evenly distributed random parameters. The fitness function is one of the main contributions of this work and is described in details in section 3.3. We use fixed population size with 2% elitism. Parent selection is a tournament selection of size 3, and each parent goes through either mutation or cross-over with another parent to create the new individuals. The main reason that tournament parent selection is used is to maintain the diversity of the population. Diversity can be measured in both genome and phenome level. When fitness proportional parent selection used, both genomic and phenomic diversities drop quickly. Figure 4 shows both diversities during one run of the evolutionary algorithm for the full adder. It is observed that the genomic diversity maintains its value much higher than the phenomic diversity. This might be because of redundant rules in the genome which never get activated. If two genomes have the same effective rules and different redundant rules, they will be translated to the same phenome while the genomes differ. The redundant rules can be removed once the solution is found. That will make the implementation of the physical circuit cheaper and more efficient. The rate of mutation to cross-over is 1 in this experiment. There are 4 types of mutation which either add, delete or swap rules, or change the parameters of a rule in the genome. The different mutation types have equal chance to happen, and so is the chance of each parameter to be changed in the parameter level mutation. Cross-over can be either single point or shuffle and is always in rule level. Each individual on average creates two children and adds them to the intermediate pool, which is then sorted and the rest 98% of the next generation population are selected trough the fitness proportional selection. It is important to remember that the fitness of a circuit is only a measure of the behavior of the circuit (i.e. its response to the provided inputs) and reflects neither the circuit size nor the genome size. The effect of the circuit and genome size is in the tournament parent selection. If the two randomly selected
A New Method to Find Developmental Descriptions for Digital Circuits
79
Fig. 4. The genomic and phenomic diversity in one run of evolutionary algorithm
individuals have the same fitness in the parent selection step, the one with smaller circuit size is selected. If this does not break the tie, the one with smaller genome size is selected. Because the smallest possible size of the circuit or the genome are not known to the user, finding an individual with fitness equal to 1 is not enough to stop evolution. The evolution stops if a smaller solution with fitness equal to 1 is not found after a certain number of generations, or after 20000 iterations. 3.3 Fitness Function The fitness function is a critical and possibly the most important part of any evolutionary algorithm. It is the fitness function which forms the search space smooth or sharp, conducts the search in a more evolvable search space [13] and eventually guides us to the optimum solution. A good fitness function not only maximizes at the target solution point but also increases gradually as we get close to the solution. This however is not always a trivial task to pick up a suitable fitness function. The “distance” of two solutions is not always very clear, keeping in mind that we usually have no information of the structure of the optimum solution. We usually can rate the fitness based on the performance of individuals and not their structure. This makes our job a bit difficult in problems of digital circuit design. Unlike the most of biological organisms in which a slight change in their DNA usually leads to a non-fatal and slight change in their functionality, a slight change in the developmental or non-developmental description of a digital circuit very often dramatically change the behavior of the circuit. The reason for this is the crisp nature of the digital circuits in which changing one single gate might inverse the whole final outputs. An example of this is when an AND gate from which the final output is produced is replaced with a NAND gate. In the fitness function often used in EHW, such change will instantly drop the fitness from 1.0 to 0.0.
80
M. Ebne-Alian and N. Kharma
The fitness function often used in EHW is the number of correct outputs for all possible combinations of inputs [5][8]. This method suffers from the issue mentioned above, i.e. a small change in the architecture of the circuit might dramatically drop the fitness even if the architecture is very close to the correct architecture. This will form the search space to be extremely crisp with sharp changes, in which evolution needs to be very lucky to not miss the optimum solution. To help this, we have changed the fitness function to not only reflect the number of correct outputs to the combination of inputs, but also the sensitivity of the outputs to the change of each input. The effect of the latter part is most obvious in the given AND – NAND example. While the AND and NAND gates have complementary outputs, their sensitivity to their corresponding inputs are the same. To explain this, consider a combination of the inputs in which the first input is 0 and the second is 1. The change of the first input from 0 to 1 will change the output in both gates (i.e. the gates are sensitive to the first input in this input combination). The change of the second input from 1 to 0 will not change the output in either of the gates, so they have the same sensitivity on both inputs. This holds true for all other combinations of inputs. Thus if the final output gate is changed from AND to NAND gate, our fitness function still rewards the individual. The traditional fitness function used so far does not consider this similarity and only takes the net outputs into the account. Note that our method only examines the sensitivity of the circuit to the primary inputs and not any intermediate signal. Our experiments show that adding the sensitivity analysis to the circuits improves the efficiency of the search in term of decreasing the iterations needed to find the optimum solution. To support this, we tried 50 runs of the evolutionary algorithm with population set to 500 and for a maximum of 20000 generations to find a full adder. Using the traditional fitness function, evolution could find a solution in only 4 runs. Applying our described fitness function, this number was increased to 15. Improving the chance of finding a solution by 375 percent clearly shows the advantage of this new fitness function over the traditional for EHW. The other property of our method is that it sets development free to locate the inputs and outputs at any desired row. This is in contrast to other works that fix the position of the inputs and outputs and force the development to read the input from and make the input to those fixed positions. The fitness function here examines any possible combination of input and output positions and accepts the best combination as the input – output positions. The density of the solutions is thus increased in the search space.
4 Results We tried to find the developmental code for 3 different circuits including full adder, 2-bit multiplexer and 4-bit parity generator. Evolution was able to find the solution for all these circuits. Fig. 5 to Fig. 7 show examples of the found solutions. The number of developmental steps for all shown circuits is 2. Fig. 8 illustrates the fitness of the fittest individual and also the average fitness in the population for the run evolutionary algorithm whose result was depicted in Fig. 3. We can observe from the Fig 8 that the average fitness in the population quickly follows the highest fitness in the population. This means that in each population there are lots of equally fit individuals. The two or three randomly selected parents therefore
A New Method to Find Developmental Descriptions for Digital Circuits
81
6130024 4023047 5201130 3 4 3-2 4 4 0 3410151 4111015 0025101 0421231
Fig. 5. An example of an evolved full adder with corresponding genome. The floating wires are pulled down to 0 by the border cells.
7201027 0017001 5020025 1202101
Fig. 6. An example of an evolved 2-bit multiplexer with corresponding genome. The floating wires are pulled down to 0 by the border cells.
5320221 5100112 0001015 0232440
Fig. 7. An example of an evolved 4-bit parity generator with corresponding genome The floating wires are pulled down to 0 by the border cells.
82
M. Ebne-Alian and N. Kharma
Fig. 8. The highest and average fitness in the population for the evolution of the circuit in Fig. 5
have a high chance to have the same fitness, leading the individual size to play an important role in parent selection. The experiments showed that considering the individual size in survival selection highly favors the small individuals and the EA eventually fails to find the desired circuit. The scalability of the found solution for larger problem sizes was also studied. For this, first we ran the evolution 20 times to find 30 3-bit parity generators in a separate experiment. Then we included those solutions in the initial population of the 4-bit parity generator problem, and ran the evolution 20 times to find 20 4-bit parity generators. We repeated the experiment 20 more times from the scratch, i.e. without including the solutions of the smaller sized problem in the initial population and measured the average of the best fitness in each generation. The results show that the performance of the evolution was incredibly increased when it included the solution to the smaller sized problem. To compare the results of the two case studies we defined a new measure that includes all three parameters of circuit fitness, circuit size and genome size. The formula to adjust the fitness is shown in equation 2. Adjustedfi tness = k × fitness +
Max.Circ .Size − Circ .Size Max.Gen.Size − Gen.Size + Max.Circ .Size − Min.Circ .Size Max.Gen.Size − Min.Gen.Size
Eq. 2 In Which k is a coefficient to enable the minimum increments of fitness to overcome the negative effect of resulting growth of the circuit and the genome size, and according to the fitness function explained in section 3.3. and also the detailed code
A New Method to Find Developmental Descriptions for Digital Circuits
83
implementation is 128/3. Max.CircuitSize and Min.CircuitSize are 110 and 18 respectively (Eq.1) for a 4-bit parity generator, and Max.GenomeSize and Min.GenomeSize are 29 and 4 respectively (section 2.2). Fig. 9 shows the “normalized” performance of evolution when it already has the knowledge about the smaller sized solutions, compared to the performance of evolution when it starts from scratch - without any knowledge about the solutions of smaller sized problem.
Fig. 9. Improvement in evolution’s performance by introducing the solutions of the smaller size problems
5 Conclusion and Future Work In this paper we presented a method for evolving the developmental programs to describe gate level feed forward digital circuits with local connections. We hope that keeping the connections local will eliminate the routing overhead and difficulties on the configurable circuits. We also introduced a new fitness function and showed that the evolution is almost 4 times more successful in finding solutions for certain problems if it uses our fitness function instead of the traditional fitness function used in EHW. We also showed that the solutions are scalable for modular circuits such as parity generators. In the next step of this research, we will focus more on scalability and also study robustness and resolvability of the resulting circuits. We will also focus on enabling evolution to use already discovered modules to be used in the larger problems.
84
M. Ebne-Alian and N. Kharma
References 1. Stoica, A., Keymeulen, D., et al.: Evolutionary experiments with a fine-grained reconfigurable architecture for analog and digital CMOS circuits. In: Proceedings of the First NASA/DoD Workshop on Evolvable Hardware, Pasadena, CA, USA, July 19–21. IEEE Comput. Soc., Los Alamitos (1999) 2. Stoica, A., Zebulum, R.S., et al.: Silicon validation of evolution-designed circuits. In: 2003 NASA/DoD Conference on Evolvable Hardware, Chicago, IL, USA. IEEE Comput. Soc., Los Alamitos (2003) 3. Takahashi, E., Kasai, Y., et al.: A Post-Silicon Clock Timing Adjustment Using Genetic Algorithms. In: 2003 Symposium on VLSI circuits. IEEE Press, Los Alamitos (2003) 4. Miller, J.F., Thomson, P., et al.: Designing electronic circuits using evolutionary algorithms. Arithmetic circuits: a case study. Applications of Computer Systems. In: Proceedings of the Fourth International Conference, Szczecin, Poland, Wydwnictwo i Drukarnia Inst. Inf. Polytech. Szczecinskiej, Szezecin, Poland, November 13–14 (1997) 5. Miller, J.F., Job, D., et al.: Principles in the Evolutionary Design of Digital Circuits - Part I. Genetic Programming and Evolvable Machines 1(1-2), 7–351 (2000) 6. Miller, J.F., Thomson, P.: Cartesian Genetic Programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 7. Miller, J.F., Thomson, P.: A Developmental Method for Growing Graphs and Circuits. In: Tyrrell, A.M., Haddow, P.C., Torresen, J. (eds.) ICES 2003. LNCS, vol. 2606, pp. 93–104. Springer, Heidelberg (2003) 8. Hartmann, M., Haddow, P.C.: Evolution of Fault-Tolerant and Noise-Robust Digital Design. In: IEE proceedings. Computers and digital techniques, 151th edn., pp. 287–294 (2004) ISSN 1350-2387 9. Djupdal, A., Haddow, P.: Evolving Redundant Structures for Reliable Circuits – Lessons Learned. In: Proceedings of the Second NASA/ESA Conference on Adaptive Hardware and Systems, pp. 455–462 (2007) ISBN:0-7695-2866-X 10. Halavati, R., Bagheri Shouraki, S.: Symbiotic Combination to Avoid Linkage Problem. In: Chen, Y.-p., Meng-Hiot, L. (eds.) Linkage in Evolutionary Computation, Series: Studies in Computational Intelligence, vol. 157. Springer, Heidelberg (2008) ISBN: 978-3-54085067-0 11. Smith, S.F.: Flexible Learning of Problem Solving Heuristics Trough Adaptive Search. In: Proc. 8th IJCAI (August 1983) 12. Holland, J.H.: Escaping brittleness:The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In: Michalski, R.S., Carbonell, J.G., Mitchell, T.M. (eds.) Machine learning: An artificial intelligence approach, vol. 2. Morgan Kaufman Publishing, San Mateo (1986) 13. Gordon, T., Bently, P.: Handbook of Nature-Inspired and Innovative Computing, Section II, Ch. 12, pp. 387–432. Springer, US (2006) ISBN: 978-0-387-40532-2 14. Gilbert, S.: Developmental Biology, 4th edn., ch. 15, pp. 542–544. Sinauer Associates, Inc. publication (1994) ISBN 0878932496
Sorting Network Development Using Cellular Automata Michal Bidlo, Zdenek Vasicek, and Karel Slany Brno University of Technology, Faculty of Information Technology Boˇzetˇechova 2, 61266 Brno, Czech republic {bidlom,vasicek,slany}@fit.vutbr.cz
Abstract. The sorting network design represents a task that has often been considered as a benchmark for various applications of evolutionary design and optimization techniques. Although the specific structure of this class of circuits allows to use a simple encoding in combination with additional mechanisms for optimizing the area- and delay-efficiency of designed sorting networks, the design of large sorting networks represents a difficult task. This paper proposes a novel cellular automaton-based approach for the development of specific instances of sorting networks. In order to explore the area of generative cellular automata applied on this specific circuit structures, two different encodings are introduced: (1) an absolute encoding and (2) a relative encoding. The abilities of the both techniques are investigated and a comparative study is provided considering a variety of experimental settings. Keywords: cellular automata, sorting networks, evolutionary design.
1
Introduction
In recent years, many approaches were introduced for the evolutionary design of digital circuits. Probably the most popular approach is Miller’s cartesian genetic programming [13]. His approach represents typical direct mapping between genotypes and phenotypes in the genetic algorithm for the evolution of digital circuits. Developmental systems represent an other class of systems that may be utilized for the circuit design. For example, Miller’s developmental cartesian genetic programming [14], Tufte’s FPGA-based approach for evolving functionality in cellular systems [18] or Gordon’s developmental approach in evolvable hardware [6] represent instances of evolutionary developmental systems. 1.1
Cellular Automata
Cellular automata (CA), originally invented by Ulam and von Neumann in 1966 [15], represent a mathematical model originally intended as a formal framework to study the behavior of complex systems, especially the questions of whether G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 85–96, 2010. c Springer-Verlag Berlin Heidelberg 2010
86
M. Bidlo, Z. Vasicek, and K. Slany
computers can self-replicate. Cellular automata may also be considered as a biologically inspired technique to model and simulate the cellular development. A cellular automaton consists of a regular structure of cells, each of which can occur in one state from a finite set of states. The states are updated synchronously in parallel according to a local transition function. The synchronous update of all the cells of the CA is called a developmental step. The next state of a cell depends on the combination of states in the cellular neighborhood. In this paper we consider the cellular neighborhood consisting of the cell and its two immediate neighbors. Moreover, cyclic boundary conditions will be considered, i.e. the first and the last cell of the CA are considered to be neighbors and the 1D CA can be then viewed as a circle. The local transition function defines a next state of a cell for all the possible combinations of states in the cellular neighborhood. Let us denote s1 s2 s3 → sn a rule of the local transition function, where s1 s2 s3 represents the combination of states of the cells in the cellular neighborhood and sn denotes the next state of the particular middle cell. Cellular automata have been applied to solve many complex problems in different areas. A detailed survey of the principles and analysis of various types of cellular automata and their applications is summarized in [19]. Sipper [17] investigated the computational properties of CA and proposed an original evolutionary design method for the “programming” of the cellular automata called cellular programming. He demonstrated the success rate of this approach to solve some typical problems related to the cellular automata, e.g. synchronization, ordering or the random number generation. In the recent years, scientists have been interested in the design of cellular automata for solving different tasks using the evolutionary algorithms. Miller investigated the problem of evolving a developmental program inside a cell to create multicellular organism of an arbitrary size and characteristic [12]. Tufte and Haddow utilized a FPGA-based platform of Sblocks [7] for the online evolution of digital circuits. The system actually implements a cellular automaton whose development determines the functions and interconnection of the Sblock cells in order to realize a function [18]. The cellular automata-based developmental approach has successfully been applied to the evolutionary design of combinational circuits [1]. This paper represents a continuation of this kind of research considering the development of sorting networks. Two different sets of experiments will be presented utilizing various encodings of the sorting networks in the developmental process of the cellular automaton. An absolute encoding and a relative encoding will be proposed in order to determine how the positional information, represented by an index of a cell in a CA, may influence the evolutionary design process and the properties of the sorting networks generated by the cellular automaton by means of those encodings. Several sets of experiments will be presented considering various setups of the developmental system. Statistical results of the evolutionary process and the properties of resulting sorting networks are investigated in the dependence on the experimental setup.
Sorting Network Development Using Cellular Automata
1.2
87
Sorting Networks and Their Design
The concept of sorting networks (SN) was introduced in 1954; Knuth traced the history of this problem in his book [11]. A sorting network is defined as a sequence of compare–swap operations (comparators) that depends only on the number of elements to be sorted, not on the values of the elements. A compare– swap of two elements (a, b) compares and exchanges a and b so that we obtain a ≤ b after the operation. The main advantage of any sorting network is that the sequence of comparisons is fixed. Thus it is suitable for parallel processing and hardware implementation, especially if the number of sorted elements is small. Figure 1 shows an example of a 3-input sorting network. The number of compare–swap components and the circuit delay are two crucial parameters of any sorting network. By delay we mean the minimal number of groups of compare–swap components that can be executed sequentially. Designers try to minimize the number of comparators, delay or both parameters. Some of the best currently known sorting networks were designed (or optimized) using evolutionary techniques [3,5,4,8,10,9]. In most cases the evolutionary approach was based on the direct encoding given in Fig. 1 (in which comparator connections are encoded by using a pair of integers). In order to find out whether an N -input sorting network operates correctly we should test N ! input combinations. Thanks to the zero–one principle this number can be reduced. This principle states that if an N -input sorting network sorts all 2N input sequences of 0’s and 1’s into the non-decreasing sequence, it will sort any arbitrary sequence of N numbers into the non-decreasing sequence [11]. Sorting networks are usually designed for a fixed number of inputs. This approach also was applied in the mentioned evolutionary techniques. However, the evolutionary approach is not usually scalable. Conventional approaches already exist for generic design of the sorting networks with some examples of this approach (e.g. straight-insertion sort or select sort) described in [11]. These generic approaches were improved by evolution using a generative encoding called instruction-based development [16], [2]. However, the sorting networks created using these generic principles are not usually efficient in comparison with the appropriate instances designed and optimized for a fixed number of inputs.
Fig. 1. (a) A three-input sorting network consists of three comparators. (b) Alternative symbol. This network can be described using the string (0,1)(1,2)(0,1)
88
2
M. Bidlo, Z. Vasicek, and K. Slany
Development of Sorting Networks Using Cellular Automata
In this section, two different encodings will be introduced for the development of sorting networks by means of cellular automata. Each encoding of the sorting network is based on a suitable enhancement of the local transition function of the CA. The fundamental principle of this enhancement is based on including an additional information to the local transition function (next to the new cell state) that represents a prescription for generating a compare–swap component. The meaning of this additional information and the way of generating the compare–swap components is described in the next paragraphs. In this paper, two different encodings of the sorting networks inside the local transition function are investigated: (1) an absolute encoding and (2) a relative encoding. Both the encodings are assumed to generate a comparator by each cell during every developmental step of the CA. In both cases the number of cells (N ) of the CA corresponds to the number of inputs of the sorting network to be developed. In general, a comparator is generated by each cell during the development of the CA. The comparator to be generated is specified by the rule of the local transition function that is applied to determine the next state of the cell depending on the combination of states in the cellular neighborhood. Therefore, up to N comparators can be generated in one developmental step of the CA. The conditions for including the generated comparator into the sorting network being developed are specified separately in each encoding that we have used. In order to ensure that the process of generating a sorting network is deterministic, a unique ordering of cells in the CA is introduced. The series of comparators generated by the cells is then specified by the ordering of the cells. The following ordering will be applied in all the experiments presented in this paper. Consider a CA that consists of four cells ordered as c0 c1 c2 c3 and that performs three developmental steps. Then a series of comparators C0,0 C1,0 C2,0 C3,0 C0,1 C1,1 C2,1 C3,1 C0,2 C1,2 C2,2 C3,2 is generated during the development of the CA, where Ci,j represents a comparator generated by the cell ci in the j-th developmental step. The initial state of the CA together with the enhanced local transition function is the subject of the evolutionary design process (see Section 3 for details). For the purposes of the experiments presented in this paper, let us denote S the number of possible cell states of the CA and T the number of steps of the CA after which the generated comparator sequence is evaluated. 2.1
Absolute Encoding
In fact, the absolute encoding represents a direct comparator-generating technique using a cellular automaton. In order to accomplish this process, a pair of non-negative digits (w1 , w2 ) satisfying a relation w1 < w2 is associated with each rule of the local transition function. Therefore, a general form of a rule is s1 s2 s3 → sn : w1 w2 ,
Sorting Network Development Using Cellular Automata
89
where the part on the right of the colon describes a comparator which is generated by a cell of the CA if this cell determines its next state according the given rule. These digits represent indices of inputs of the comparator to be generated; the range of both of them is from 0 to N −1, where N corresponds to the number of cells and the number of inputs of the sorting network to be developed. For example, consider a 3-cell CA whose behavior is specified by its local transition function containing rules (1) 010 → 0 : 0 1, (2) 100 → 1 : 1 2, (3) 001 → 1 : 0 1. The initial state of the CA is 100. Let the CA perform a developmental step, i.e. its state is in the form 011 after that. The first cell has determined its next state according to the rule (1), therefore it has generated a comparator (0, 1). The state of the second cell has been calculated using the rule (2) and a comparator (1, 2) has been created in a series with the previous one. Finally, the last comparator, (0, 1), has been generated by the third cell according to the rule (3). In summary, one developmental step of this CA has produced a sequence of comparators (0, 1)(1, 2)(0, 1) which correspond to the sorting network shown in Figure 1. 2.2
Relative Encoding
The aim of the relative encoding is to utilize the positions of the cells in a cellular automaton for generating the compare–swap elements. The enhanced local transition function consists of rules, each of which in the form s1 s2 s3 → sn : r d, where the part on the right of the colon has the following meaning. The value of r specifies the index of the first comparator input w1 relatively to the position (cell index) c of a cell that generates the comparator, i.e. w1 = c + r. The range of r is considered from −R to R, where R is a positive integer specified as a parameter for a given set of experiments. The value of d represents a “width” of a comparator, i.e. the difference between the indices of inputs of the comparator. Therefore, the index of the second input w2 is calculated as w2 = w1 + d. The maximal value of d (let us denote it D) represents the second parameter of the design system. The value of D was determined experimentally as D = 2R for a given set of experiments. If w1 or w2 exceeds the index range of the inputs of the target sorting network, then the comparator is not generated (i.e. it is not included in the comparator sequence generated by the CA). An example of a sorting network development using the relative encoding is illustrated by Figure 2 (initial state of CA is 0100). The cells of the CA and the inputs (wires) of the target sorting network are indexed by integer values in range from 0 to 3. The first cell at the position c = 0 generates a comparator using the relative value r = 0 and the width of the comparator d = 1 (see the pair 0, 1 as specified in the rule on the right of the first cell). Therefore, the first input of the comparator is calculated as w1 = c + r = 0 + 0 = 0. the second
90
M. Bidlo, Z. Vasicek, and K. Slany
input is calculated as w2 = w1 + d = 0 + 1 = 1 and the comparator (0, 1) is generated (see the comparator denoted as 1 at the right part of Figure 2). The same principle is used for generating comparator 2 (2, 3), 3 (0, 2), and 4 (1, 3). After the first step, CA possesses the state 1110. During the second step the first cell (at c = 0) generates a comparator using the relative value −1 and width 1. However, after calculating the comparator inputs a pair (-1, 0) is obtained. This is not a valid comparator for a 4-input network and therefore it is not included in the developed comparator sequence (illustrated by a dashed comparator 5 of the sorting network in Figure 2). Similarly, comparator 7 (3, 4) generated by the cell at c = 2 is also invalid and hence meaningless for the target sorting network. The sorting network shown in the right part of Figure 2 has been created in two developmental steps of the CA. Note that the comparator 8 (0, 3) is redundant in this network because it does not swap any values during the complete test of the sorting network and therefore it can be removed from the comparator sequence without loss of the network functionality.
3
Evolutionary System Setup
The simple genetic algorithm was utilized for the evolutionary design of the cellular automaton that generates a target sorting network. Two sets of experiments will be presented regarding the development of this kind of circuits using the absolute and relative encoding. In both sets of experiments the initial state of the CA is evolved together with its local transition function. The initial state is encoded in the chromosome as a finite sequence of integers. A rule of the enhanced local transition function consists of the next state and two integer values whose range and meaning differs for the absolute and relative encoding (see Section 2.1 and 2.2). The general structure of a chromosome is in the form s0 s1 . . . sN −1 ns0 x0 y0 ns1 x1 y1 . . . ns|Q|3 −1 x|Q|3 −1 y|Q|3 −1 , where si is the initial state of i-th cell (i = 0, 1, . . . N − 1), nsj is the new state for the appropriate combination of states in the cellular neighborhood expressed by its index j = 0, 1, . . . |Q|3 − 1 (|Q| is the number of possible cell states), xj and yj is the additional information according to which the comparators are generated. The index (position in the genome) is specified implicitly by means of the value expressed by the number representing the combination of states in the
Fig. 2. An example of development of a 4-input sorting network (4-cell CA is used)
Sorting Network Development Using Cellular Automata
91
cellular neighborhood. Therefore, if we consider the general form of the rule s1 s2 s3 → sn : x y, only the part on the right of the arrow is encoded in the genome. For example, if a cellular automaton with 2 different states and the cellular neighborhood consisting of 3 cells ought to be evolved, there are 23 rules of the local transition function. Consider the rule 0 1 1 → 0 : 2 3. Since the combination of states 0 1 1 corresponds to the binary representation of value 3, this rule will be placed in the chromosome at the position 3 of the local transition function. In all the experiments, the population consists of 20 chromosomes which are initialized randomly (with respect to the correct range of each gene) at the beginning of evolution. The chromosomes are selected by means of the tournament operator with the base 4. Only mutation operator is utilized. In each chromosome selected by the tournament operator, 5 genes are chosen randomly and each of them is mutated with the probability 0.95. The fitness function is calculated as the number of correct output bits of the sorting network using all the binary input test vectors. For example, there are 24 test vectors in case of 4-input SN. Therefore, the fitness value of a perfect solution is Fmax = 4 · 24 = 64. If no solution is evolved in 100,000 generations the evolutionary run is terminated.
4
Experimental Results and Discussion
The experiments were focused of the evolutionary development of 16-input sorting networks by means of one-dimensional uniform cellular automata. 16-input networks were chosen as a benchmark problem for the proposed developmental encodings. In general, sorting networks exhibit a specific structure in which a comparator represents a basic building block. The comparator approach to the design of sorting networks actually represents a higher level of abstraction of this kind of circuits in comparison with the basic gate-level representation. A specific feature of a comparator is that the function of a SN does not go wrong if an arbitrary valid comparator is appended to the existing comparator sequence which, in fact, may simplify the design process. However, unsuitable arrangement of the comparators may cause both area- and delay-inefficient sorting networks. These properties are hence the subject of investigation with respect to different setup of the developmental system. Note that we do not deal with the optimization of the sorting networks during the evolutionary process in this stage of research. 4.1
Results from the Absolute Encoding
The absolute encoding may be considered as the simplest representation of the comparators inside the developmental process of a cellular automaton. The crucial part of the design process is that the evolution searches for an enhanced local transition function of the CA containing suitable set of comparators that are encoded directly (by the indices of their inputs). The experimental setup of
92
M. Bidlo, Z. Vasicek, and K. Slany
Table 1. Statistical results of the evolutionary process using the absolute encoding
states 3 4 5 6
Success rate in % CA steps 5 6 7 8 9 16 65
45 98 100
95 100 100
1 98 100 100
33 100 100 100
Average number of generations CA steps 5 6 7 8 9 49,0k 32,8k
58,5k 8,7k 5,5k
23,1k 3,3k 1,7k
68,9k 10,5k 2,9k 0,8k
58,6k 5,4k 1,3k 0,8k
the developmental system includes the setting of the number of cell states and the number of steps of the CA. These values are specified at the beginning of the evolutionary process. Table 1 shows the success rate and the average number of generations for the experiments using the absolute encoding. A hundred independent experiments were performed for each combination of the number of states and the number of steps of the CA. As evident, it is easy to evolve a fully functional solution if the number of states is greater than 4 and the number of steps greater than 6. As there are many combinations of inputs possible for a comparator in a 16-input sorting networks, three states of the CA showed to be a minimum to evolve a working solution. For two states, no CA was found within 100,000 generations. If the number of states is sufficient, then there is a higher probability that a working solution is found in the given number of generations (see Table 1) for 5 steps. Moreover, the number of generations needed to evolve a working solution is substantially lower in those cases, although the search space is more complex due to the number of CA states. This shows that there are many correct solutions in such search space which is probably heavily influenced by the specific structure of sorting networks. On the other hand, more developmental steps are needed if the number of states is low (see Table 1). Table 2 contains the average properties (the number of comparators and delay) of the resulting sorting networks for different experimental setup. These results show (as one would usually expect) that if more steps are performed to create s working SN, then this SN exhibit worse parameters (more comparators are needed, the delay is higher). However, if the results are compared for a given number of steps, the sorting networks developed with a higher number of states exhibit slightly better properties in most cases (especially from the point of view Table 2. Properties of the resulting SNs obtained from the absolute encoding
states 3 4 5 6
Average number of comparators CA steps 5 6 7 8 9 - 93,0 98,8 - 85,4 90,0 92,4 95,7 78,4 85,5 90,1 92,1 94,8 78,0 86,1 90,0 92,5 94,3
5 23,1 22,2
Average delay CA steps 6 7 8 - 28,0 27,3 29,3 30,1 26,9 28,6 28,9 26,4 28,6 29,3
9 30,8 31,8 30,9 30,6
Sorting Network Development Using Cellular Automata
93
of the delay). This may be caused by the possibility of higher number of different comparators that can be generated in one step thanks to the higher number of different combinations of states in the cellular neighborhood.
4.2
Results from the Relative Encoding
The goal of this set of experiments is to investigate the sorting networks development that involves positional information of cells inside the CA to determine the inputs of the comparators being generated. Since the enhanced transition function of the CA includes information for calculating the indices of comparator inputs (i.e. the relative position value and the comparator width), the experimental setup includes, besides the number of states and the number of steps of the CA, limit value of the relative position R according to which the maximal comparator width is also calculated. These values are specified at the beginning of the evolutionary process. Table 3 summarizes the success rate for the evolutionary experiments utilizing the relative encoding. It can be observed that the dependence of the success rate on the increasing number of states and developmental steps is for the given value of R very similar to the results obtained from the absolute encoding. Interestingly, the success rate decreases with increasing the maximal comparator width if the number of states is small (2 and 3 states). This dependency is inverse for 4 states. Although no correct solution was evolved in 100,000 generations for less than 6 steps, the evolution succeeded with utilizing only 2 states of the CA. This result shows that the generation of comparators relatively with respect to the cell position has a significant influence on the ease of building a sorting network. However, the increasing number of states does not reduce the computational effort (expressed by the average number of generations needed to find a working solution) in comparison with the absolute encoding - see Table 4. Although the average number of comparators of the resulting SNs decreases with increasing R for a given number of states and developmental steps (see Table 5), it is difficult to observe a significant dependence of average delay on varying R (Table 6). However, it is possible to say that the properties of the resulting networks are better for the relative encoding (in comparison with the absolute encoding) especially for a higher number of developmental steps. Table 3. Success rate in % for the evolution using the relative encoding
states 2 3 4
6 2 1
3 10
4 3 15
CA steps 7 2 3 4 41 30 30 45 43 33 76 92 81
/ relative limit 8 9 2 3 4 2 99 98 90 100 99 98 97 100 99 100 100 100
3 100 100 100
4 100 100 100
94
M. Bidlo, Z. Vasicek, and K. Slany
Table 4. Average number of generations of the evolutionary process using the relative encoding. The values are measured in thousands of generations
states 2 3 4
6 2 39,1
3 36,2
4 58,4 42,2
CA steps / relative limit 7 8 2 3 4 2 3 4 26,5 26,5 21,3 3,21 4,94 7,13 22,2 26,8 26,9 5,94 4,84 7,51 21,0 21,1 27,2 5,12 5,35 8,86
9 2 0,19 0,48 1,23
3 0,35 0,56 1,67
4 0,57 1,37 3,89
Table 5. Average number of comparators of SNs developed using the relative encoding
states 2 3 4
6 2 83,0
3 84,1
4 83,7 83,1
CA steps / relative limit 7 8 2 3 4 2 3 4 91,2 89,9 88,5 97,5 94,4 93,9 91,8 89,5 88,3 95,6 93,1 92,7 90,3 88,3 87,1 93,5 91,7 90,7
9 2 98,0 98,6 96,1
3 96,9 96,9 94,5
4 97,7 95,5 92,6
Table 6. Average delay of SNs developed using the relative encoding
states 2 3 4
5
6 2 24,0
3 26,0
4 30,0 27,8
CA steps / relative limit 7 8 2 3 4 2 3 4 20,9 20,3 20,0 22,7 24,6 24,7 27,1 28,8 27,8 29,2 28,9 29,8 28,6 28,8 28,0 30,0 29,5 29,6
9 2 27,8 30,4 31,0
3 26,9 30,1 31,3
4 28,8 29,9 30,4
Conclusions
In this paper a developmental method based on uniform 1D cellular automaton was presented for the design of sorting networks. Two different encodings of the sorting networks in the process of development of the CA were proposed: (1) an absolute encoding and (2) a relative encoding. The goal was to investigate the influence of utilization of relative positional information on the evolutionary design process and the properties of resulting sorting networks. The results showed that the number of states and the number of steps of the CA has a significant influence on the ability of the CA to develop successfully a working sorting network. The relative encoding was shown as more suitable for the development of SNs using a lower number of states. Moreover, the sorting networks designed by means of this encoding exhibit better properties in average in comparison with the absolute encoding. As evident, the resulting networks are neither area-efficient nor delay-efficient. The currently best-known 16-input SN consists of 60 comparators working with the delay 10. The best result obtained from the absolute encoding contains 75 comparators and its delay is 16. The relative encoding produced the best SN with
Sorting Network Development Using Cellular Automata
95
92 comparators and the delay 14. The significant difference in the proposed approach is that we have used a developmental encoding whilst the best result was obtained using a direct representation with an explicit area/delay optimization mechanism (e.g. see [3]). Another example represents a generic developmental approach proposed in [16] by means of which a 92-comparator network was created working with the delay 21. The findings from the results presented herein are interesting especially for the future research considering the application of cellular automata in which we are going to focus on advanced encodings able to reduce of the resulting properties during the developmental process. Moreover, the possibilities of designing regular structures will be investigated with the utilization of these encodings which may lead to the research of generic design using cellular automata.
Acknowledgement This work was partially supported by the Grant Agency of the Czech Republic under contract No. GP103/10/1517 Natural Computing on Unconventional Platforms, No. GD102/09/H042 Mathematical and Engineering Approaches to Developing Reliable and Secure Concurrent and Distributed Computer Systems, the Grant Fund (GRAFO) of Brno University of Technology (BUT), the internal BUT research project No. FIT-S-10-1 and the Research Plan No. MSM 0021630528 Security-Oriented Research in Information Technology.
References 1. Bidlo, M., Vasicek, Z.: Gate-level evolutionary development using cellular automata. In: Proc. of The 3rd NASA/ESA Conference on Adaptive Hardware and Systems, AHS 2008, pp. 11–18. IEEE Computer Society, Los Alamitos (2008) ˇ 2. Bidlo, M., Skarvada, J.: Instruction-based development: From evolution to generic structures of digital circuits. International Journal of Knowledge-Based and Intelligent Engineering Systems 12(3), 221–236 (2008) 3. Choi, S.S., Moon, B.R.: A hybrid genetic search for the sorting network problem with evolving parallel layers. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pp. 258–265. Morgan Kaufmann, San Francisco (2001) 4. Choi, S.S., Moon, B.R.: Isomorphism, normalization, and a genetic algorithm for sorting network optimization. In: GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 327–334. Morgan Kaufmann Publishers Inc., San Francisco (2002) 5. Choi, S.S., Moon, B.R.: More effective genetic search for the sorting network problem. In: Proc. of the Genetic and Evolutionary Computation Conference, GECCO 2002, pp. 335–342. Morgan Kaufmann, New York (2002) 6. Gordon, T.G.W., Bentley, P.J.: Towards development in evolvable hardware. In: Proc. of the 2002 NASA/DoD Conference on Evolvable Hardware, pp. 241–250. IEEE Press, Washington D.C. (2002)
96
M. Bidlo, Z. Vasicek, and K. Slany
7. Haddow, P.C., Tufte, G.: Bridging the genotype–phenotype mapping for digital FPGAs. In: Proc. of the 3rd NASA/DoD Workshop on Evolvable Hardware, pp. 109–115. IEEE Computer Society, Los Alamitos (2001) 8. Hillis, W.D.: Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D 42(1–3), 228–234 (1990) 9. Koza, J.R., Bennett, F.H., Andre, D., Keane, M.A.: Genetic Programming III: Darwinian Invention and Problem Solving. Morgan Kaufmann, San Francisco (1999) 10. Juill´e, H.: Evolution of non-deterministic incremental algorithms as a new approach for search in state spaces. In: Proc. of 6th Int. Conference on Genetic Algorithms, pp. 351–358. Morgan Kaufmann, San Francisco (1995) 11. Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, 2nd edn. Addison-Wesley, Reading (1998) 12. Miller, J.F.: Evolving developmental programs for adaptation, morphogenesis and self-repair. In: Banzhaf, W., Ziegler, J., Christaller, T., Dittrich, P., Kim, J.T. (eds.) ECAL 2003. LNCS (LNAI), vol. 2801, pp. 256–265. Springer, Heidelberg (2003) 13. Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 14. Miller, J.F., Thomson, P.: A developmental method for growing graphs and circuits. In: Tyrrell, A.M., Haddow, P.C., Torresen, J. (eds.) ICES 2003. LNCS, vol. 2606, pp. 93–104. Springer, Heidelberg (2003) 15. von Neumann, J.: The Theory of Self-Reproducing Automata. In: Burks, A.W. (ed.). University of Illinois Press, US (1966) 16. Sekanina, L., Bidlo, M.: Evolutionary design of arbitrarily large sorting networks using development. Genetic Programming and Evolvable Machines 6(3), 319–347 (2005) 17. Sipper, M.: Evolution of Parallel Cellular Machines. LNCS, vol. 1194. Springer, Heidelberg (1997) 18. Tufte, G., Haddow, P.C.: Towards development on a silicon-based cellular computing machine. Natural Computing 4(4), 387–416 (2005) 19. Wolfram, S.: A New Kind of Science. Wolfram Media, Champaign IL (2002)
Markerless Articulated Human Body Tracking from Multi-view Video with GPU-PSO Luca Mussi1 , Spela Ivekovic1,2, and Stefano Cagnoni1 1
Dept. of Information Engineering, University of Parma, Italy 2 Lessells Scholar, Royal Society of Edinburgh, Scotland
Abstract. In this paper, we describe the GPU implementation of a markerless fullbody articulated human motion tracking system from multi-view video sequences acquired in a studio environment. The tracking is formulated as a multidimensional nonlinear optimisation problem solved using particle swarm optimisation (PSO). We model the human body pose with a skeleton-driven subdivision-surface human body model. The optimisation looks for the best match between the silhouettes generated by the projection of the model in a candidate pose and the silhouettes extracted from the original video sequence. In formulating the solution, we exploit the inherent parallel nature of PSO to formulate a GPUTM TM PSO, implemented within the nVIDIA CUDA architecture. Results demonstrate that the GPU-PSO implementation recovers the articulated body pose from 10-viewpoint video sequences with significant computational savings when compared to the sequential implementation, thereby increasing the practical potential of our markerless pose estimation approach.
1 Introduction Articulated human body pose estimation is an active research area with solutions applicable in many domains, including virtual character animation, biometrics, humancomputer interaction, gait analysis, video surveillance, and others. While most industrial solutions still tend to rely on marker-based systems, such as Vicon [23], the advances in the markerless video-based estimation are progressing rapidly [16]. The attraction of the markerless pose estimation lies in the reduced preparation time for each capture session as well as the non-invasive nature of the procedure. In markerless capture, the use of tight body suits and magnetic or optical markers is not necessary; instead, the subjects can normally take part in their every-day clothing. Replacing the marker-based systems with markerless solutions, such as the one described in this paper, opens the possibility of using motion capture in areas such as medical analysis and home entertainment, where the use of tight body suits and markers is not acceptable. Additionally, the increasing availability and affordability of the video cameras makes markerless motion capture an ever more attractive alternative. Modelling the articulated structure of the full human body for the purpose of pose estimation requires a large number of parameters, typically at least 30. The articulated pose estimation problem is therefore usually formulated as a search in a highdimensional parameter space, which is invariably computationally very complex. In this paper, we address the issue of complexity by exploring the parallel nature of the G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 97–108, 2010. c Springer-Verlag Berlin Heidelberg 2010
98
L. Mussi, S. Ivekovic, and S. Cagnoni
Fig. 1. Example pose results shown as skeletons overlaid on the corresponding input image. The examples shown are taken from different sequences (Jon Walk, Tony Kick, Tony Punch and Tony Stance) and different camera views (10 views were used for each sequence), hence the difference in person size as well as orientation.
markerless pose estimation problem at hand and searching the corresponding large parameter space using PSO. We exploit the fact that the PSO solution naturally lends itself TM to a parallel implementation on the state-of-the-art CUDA architecture, as well as that the multi-view pose estimation, based on silhouette comparison, itself contains a degree of parallelism that can be exploited to design a more efficient solution. This paper is organised as follows. We begin with an overview of the related work TM in Section 2. In Section 3 we outline the CUDA architecture and present the PSO algorithm developed for it. Our pose estimation algorithm is presented in Section 4. Finally, we report experimental results in Section 5 and conclude with Section 6.
2 Related Work In this section, we review the related work relevant to our approach. We begin with the related research in articulated human body pose estimation and then review the basics of PSO and relevant research in the area of PSO parallelisation. 2.1 Articulated Human Body Pose Estimation from Video Articulated 3-D human body pose estimation from video is an active research area [13,19]. The complexity of the human body pose parametrisation has invariably required the pose estimation to be formulated as a high-dimensional space search problem and research has focused on reducing the complexity of the search. Various implementations of particle filters quickly gained popularity [1,4]. Partitioning the search space into smaller, more manageable subspaces is also a popular approach [1,12]. Furthermore, given the complexity of the articulated human body motion, the standard motion models used in the tracking literature did not suffice and attempts were made to learn motion models for particular actions from training data collected in advance [2,21].
Markerless Articulated Human Body Tracking from Multi-view Video with GPU-PSO
99
The above mentioned approaches suffer from various setbacks. The particle-filtering solutions critically rely on a high number of particles to adequately represent the posterior distribution, which in turn increases their computational complexity beyond practical use when considering a wide variety of motion. The tendency to rely on pre-trained motion models causes the human body tracking approaches to lose their generalisation abilities. In order to address that, researchers started turning to methods which could reliably provide motion estimates without a pre-trained motion model [5]. In this paper, we explore a similar direction. We use a powerful search algorithm which is capable of recovering the pose without any prior knowledge of the nature of motion. The main advantage of such an approach is in its generality as it can estimate any kind of body motion when provided with a sufficient number of constraints, in our case image silhouettes. The downside is that it requires a lot of computation time. Previous attempts at reducing the computational complexity have focused on algorithmic improvements [7,8,9]. However, as we show in this paper, exploiting the parallel nature of both the search algorithm and the multi-view pose estimation problem by implementing the approach on a graphical processing unit (GPU) provides a natural alternative solution which is significantly more efficient while being equally general. 2.2 Particle Swarm Optimisation Particle Swarm Optimisation (PSO) [10] is a powerful optimisation algorithm which searches the optimum of a fitness function following rules inspired by the behaviour of flocks of birds looking for food. As a population based meta-heuristic, PSO has recently gained popularity due to its robustness, effectiveness, and simplicity. A particle’s position and velocity within the domain of the fitness function at time t can be computed using the following equations: V (t) = w V (t − 1) + C1 R1 [X best (t − 1) − X(t − 1)] + C2 R2 [X gbest (t − 1) − X(t − 1)] X(t) = X(t − 1) + V (t)
(1) (2)
where V is the velocity of the particle, C1 , C2 are positive constants, R1 , R2 are random numbers uniformly drawn between 0 and 1, w is the so-called ‘inertia weight’, X(t) is the position of the particle at time t, X best (t − 1) and X gbest (t − 1) are, respectively, the best-fitness position reached by the particle and the best-fitness point ever found by the whole swarm up to time t − 1. Many variants of the basic algorithm have been developed [18], some of which define different topologies for particles’ neighbourhoods. A usual variant of PSO substitutes X gbest (t − 1) with X lbest (t − 1), the best position ever found within a pre-set neighbourhood of the particle under consideration. This formulation admits, in turn, several variants, depending on the neighbourhood topology. Another factor that affects the performance of PSO is the order by which X gbest / X lbest are updated. In ‘synchronous’ PSO, during each iteration, positions and velocities of all particles are updated one after another in turn, after which each particle’s fitness is evaluated. Finally, when the fitness of all particles is known, the value
100
L. Mussi, S. Ivekovic, and S. Cagnoni
of X gbest / X lbest is updated. The ‘asynchronous’ version of PSO, instead, updates X gbest / X lbest immediately after evaluation of each particle’s fitness, leading to a more ‘reactive’ swarm which is attracted more promptly by newly-found optima. Despite good convergence properties, PSO is still an iterative process which may require millions of particle updates and fitness evaluations. This makes the design of efficient PSO implementations a problem of great practical relevance, especially for realtime applications to dynamic environments. This is the case, for example, of computer vision applications in which PSO has been used to determine location and orientation of objects [14,15] or posture of people [8]. PSO parallelisation has therefore become a popular subject for research. Before GPU-based programming environments were available, PSO was implemented following more traditional parallel computing paradigms, as in [6,20]. Some of the implementations were hybridised with evolutionary algorithm paradigms, such as the so-called ‘island model’, obtaining a coarse-grained parallelisation [3,24]. Conversely, research on fine-grained parallel PSO algorithms has mainly focused on the swarm topology. One of the first GPU-based PSO implementations was based on a fine-grained approach [11] which, however, was still based on ’hand-coded’ texture-rendering mapping and did not rely on any GPU-specific programming environment. An overview of published work according to granularity analysis can be found in [24]. An interesting classification of parallel PSO algorithms, based on the best position update strategy, is reported in [27]. The most recent implementations are GPU-based [22,25,28], mostly developed TM within the CUDA environment, like the parallel PSO algorithm which we have developed and used in this work. Comparisons on the same benchmarks (not yet published) suggest that our approach outperforms these in terms of computation efficiency. TM
3 Parallel PSO Implementation within the CUDA Architecture TM
CUDA (Compute Unified Distributed Architecture) is a parallel computing environTM ment by nVIDIA which exploits the massively parallel computation capabilities of TM its most recent GPUs. CUDA ’s programming model requires that the problem under consideration be partitioned into many independent sub-tasks (thread blocks) which are TM solved in parallel by a number of cooperating threads. In the CUDA abstraction a programmer can define a two-dimensional grid of thread blocks; each block is associated with a unique pair of indices that identifies it within the grid. Within each block, as well, the threads that compose it can be organised as a two- or three-dimensional grid within which they are identified by a unique set of indices. This mechanism allows each thread to personalise its access to data structures and to decompose problems effectively. TM From a hardware viewpoint, a CUDA -compatible GPU is made up of a scalable array of multithreaded Streaming Multiprocessors (SMs), each of which is able to execute several thread blocks at the same time. Each SM embeds eight scalar processing cores and is equipped with a number of fast 32-bit registers, a parallel data cache shared among all cores, a read-only constant cache and a read-only texture cache accessed via a texture unit that provides several different addressing/filtering modes. In addition, SMs can access local and global memory spaces which are (non-cached) read/write regions
Markerless Articulated Human Body Tracking from Multi-view Video with GPU-PSO
101
Listing 1.1. Synchronous PSO pseudo-code
<Set initial personal/global bests> for ( i = 0; i < generationsNumber; i++){ }
of device memory: these memories are characterised by latency times about two orders of magnitude larger than the registers and texture cache. Only threads belonging to the same thread block can share data in fast memory; different thread blocks may only TM share data allocated in slow memory. CUDA ’s scheduler allocates as many thread blocks at the same time as possible, compatibly with available resources, which permits TM a CUDA program to be run on any number of SMs. SMs can manage hundreds of threads running different code segments thanks to an architecture called SIMT (Single Instruction, Multiple Thread) which creates, manages, schedules, and executes groups (warps) of 32 parallel threads. Opposite to what happens in a SIMD (Single Instruction, Multiple Data) architecture, the whole execution and branching behavior of threads is specified. This way it is possible to manage parallel code for independent scalar threads as well as code for parallel data processing, which is executed by coordinated threads1,2 . 3.1 Parallelising PSO Using CUDA
TM
The structure of PSO is very close to being intrinsically parallel. In PSO, the only dependence between the processes which update the particles’ velocities and positions is related to the information which must be shared among the particles. This information is either only X gbest or the corresponding vector X lbest of the best positions found by any member of each particle’s neighbourhood. The most natural way to remove the dependence between particles’ updates would consist of implementing synchronous PSO, updating X gbest or X lbest only at the end of each iteration. While this would permit the use of a single thread block (with one thread per particle) to implement a swarm, while avoiding accesses to global memory, it would impose limitations to the implementation of the fitness function and use computing resources inefficiently. TM To better exploit the capabilities offered by CUDA in developing a parallel PSO algorithm, we considered the main stages of the algorithm as separate tasks, which can be parallelised differently. Listing 1.1 shows the pseudo-code of a synchronous PSO algorithm, regardless of implementation. In our case the three stages of the main loop are implemented as different kernels sequentially scheduled by the GPU. This does not affect execution time since kernel scheduling is very efficient. However, it imposes that each kernel must load all data it needs initially and store it back at the end of every 1 2
nVIDIA CUDA C programming - Best practices guide v. 2.3, nVidia Corporation, May 2010. nVIDIA CUDA programming guide v. 2.3, NVidia Corporation, May 2010.
102
L. Mussi, S. Ivekovic, and S. Cagnoni TM
execution, since in CUDA data can be shared among kernels only through the (slow) global memory. Despite this, having limited the number of such accesses and organised data in order to exploit the GPU coalescing capability, the multi-kernel approach turned out to be more efficient. The first kernel (PositionUpdateKernel) updates the particles’ positions scheduling a number of thread blocks equal to the number of particles; each block updates the position of one particle running a number of threads equal to the problem dimension D. The second kernel (FitnessKernel) is used to compute the fitness. Depending on the fitness function structure, i.e., its parallel nature, more than one kernel can be used at this stage to maximise the use of GPU resources. The last kernel (BestUpdateKernel) updates X gbest and X lbest . Since its structure must reflect the swarm topology, the number of thread blocks to be scheduled may vary from one per swarm, in case of global-best topology, to many per swarm (to have one thread per particle), in case of ring topology. Pseudo-random numbers are directly generated on the GPU using the Mersenne TM Twister kernel available in the CUDA SDK. Based on the available amount of device memory, we run this kernel every given number of PSO iterations. Pseudo-random numbers are stored in a dedicated array which can be accessed by other kernels.
4 Pose Estimation Algorithm In this section we provide a detailed description of the articulated pose estimation problem and its building blocks. We describe the articulated human body model which we use to represent the candidate body poses, formulate the pose estimation as a PSOsearch and define the cost function used to evaluate the quality of a candidate pose. 4.1 Body Model To represent the candidate body pose, we use a 3-D layered subdivision surface body model consisting of two layers, the skeleton and the skin. The skeleton layer is defined as a set of homogeneous 4×4 transformation matrices Ti which encode the information about the position and orientation of every joint with respect to its parent joint in the kinematic tree hierarchy: Skeleton = {T0 , T1 , T2 , ..., T20 },
(3)
where Ti , i = 0 . . . 20, is a homogeneous transformation matrix encoding the orientation of the coordinate system of joint i with respect to the coordinate system of the preceding joint, specified by the kinematic tree shown in Figure 2. The skin layer, which represents the second layer in the model, is connected to the skeleton through the joints’ local coordinate systems. Each joint controls a certain area of the skin. Whenever a joint or limb moves, the corresponding part of the skin moves and deforms with it. As the skin is a subdivision surface, only the base mesh has to be specified in the corresponding joint’s coordinate system. After the joint’s configuration has been specified, the base mesh is subdivided by repeatedly applying the CatmullClark subdivision operator until the desired smooth shape of the body is obtained [26].
Markerless Articulated Human Body Tracking from Multi-view Video with GPU-PSO
103
Fig. 2. Catmull-Clark subdivision surface body model and the corresponding skeletal hierarchy. In the full hierarchy, every joint has 3 rotational and 3 translational degrees of freedom (DOF). For the purpose of our work, we choose a subset of rotational DOF, detailed in Table 1. We also fix the limb lengths and only optimise the global position of the body in space.
4.2 PSO Parametrisation of the Articulated Pose In PSO, each particle represents a potential solution in the search space. Our search space is the space of all plausible skeleton configurations. The individual particle’s position vector in the search space is specified as follows: X i = (rx , ry , rz , α0x , βy0 , γz0 , α1x , βy1 , γz1 , ..., γzM ),
(4)
where i denotes the index of the particle in the swarm, rx , ry , rz denote the position of the root joint with respect to the reference (world) coordinate system, and αjx , βyj , γzj refer to rotational degrees of freedom of joint j around the x, y, and z-axis, respectively. The total number of joints (the root joint has both translational and rotational degrees of freedom) is M + 1. As not all joints that are used to display the body need to be optimised, the joints and their respective degrees of freedom actually used in our pose estimation algorithm are given in Table 1. 4.3 Search Hierarchy Searching for the correct articulated pose configuration in a 32-dimensional search space is expensive. Fortunately, the hierarchy in the kinematic structure of the human body allows for the search to be formulated as a sequence of steps in which only a subset of the 32 parameters is optimised at any one time. The hierarchy has the form of a kinematic tree and is illustrated in Figure 2. We formulate the search algorithm as 11 disjoint steps (equivalent to splitting the 32-dimensional search space into 11 disjoint subspaces) detailed in Table 2, where the solution of each step constrains the search space for the steps which follow. The individual steps are chosen so that only one limb segment at a time is optimised.
104
L. Mussi, S. Ivekovic, and S. Cagnoni
Table 1. Joints used to describe the configuration of the human body pose and their respective degrees of freedom used in the pose estimation algorithm. There are 32 DOF in total. The numbers in parentheses refer to the transformations in Figure 2. JOINT (index) Global body position (0) Torso orientation (1) Head orientation (2) Left clavicle orientation (5) Left shoulder orientation (6) Left elbow orientation (7) Right clavicle orientation (9) Right shoulder orientation (10)
# DOF JOINT (index) 3 rx , ry , rz Right elbow orientation (11) 3 α1x , βy1 , γz1 Root left hip orientation (13) 2 α2x , γz2 Left hip orientation (14) 2 α5x , γz5 Left knee orientation (15) 3 α6x , βy6 , γz6 Root right hip orientation (17) 1 γz7 Right hip orientation (18) 9 2 αx , γz9 Right knee orientation (19) 10 10 3 α10 TOTAL x , βy , γz
# DOF 1 γz11 13 2 αx , γz13 14 3 αx , βy14 , γz14 1 γz15 17 2 αx , γz17 18 3 αx , βy18 , γz18 1 γz19 32
Table 2. The 11 steps of the hierarchical optimisation. Joint indices are the same as in Figure 2. (Step 1) Global body pos.: 3 DOF: rx , ry , rz (Step 4) Left upper arm: 4DOF: α5x , γz5 , α6x , γz6 (Step 7) Right lower arm: 2DOF: βy10 , γz11 (Step 10) Left lower leg: 2DOF: βy14 , γz15
(Step 2) Torso: (Step 3) Head: 3 DOF: α1x , βy1 , γz1 2 DOF: α2x , γz2 (Step 5) Right upper arm: (Step 6) Left lower arm: 10 4DOF: α9x , γz9 , α10 2DOF: βy6 , γz7 x , γz (Step 8) Left upper leg: (Step 9) Right upper leg: 13 14 14 17 17 18 18 4DOF: α13 x , γz , αx , γz 4DOF: αx , γz , αx , γz (Step 11) Right lower leg: 2DOF: βy18 , γz19
4.4 Fitness Function The fitness function compares the silhouettes generated by the model in its candidate pose with the silhouettes extracted from the original images. The original images can be acquired from N different viewpoints. Each image is foreground-background segmented and binarised to obtain a silhouette. Let the images containing the original silhouettes be denoted as Iio , i = 1...N . Similarly, let Iim , i = 1...N denote images of the model silhouettes. The cost function can then be written as follows: E=
N row col 1 o (Ii & Iim ), Z i 1 1 i=1
(5)
where row and col denote the image rows and columns, respectively, and & denotes the bitwise AND operation. Coefficients Zi are the normalisation constants obtained by counting the number of silhouette pixels in every original image.
5 Experiments Data. The set of 5 test sequences are a courtesy of CSSVP, University of Surrey. They were acquired in a dedicated multi-camera acquisition studio and consist of 10 synchronised videos with resolution 720 × 576, acquired at 25fps.
Markerless Articulated Human Body Tracking from Multi-view Video with GPU-PSO
105
Table 3. This table shows the consistency of the joint pose estimates for each of the 5 test sequences over 10 runs of the pose estimation algorithm. As the mean joint position estimate depends on the pose which changes through the sequence, we only report, for each joint, the standard deviation (in cm) in the estimate of its 3-D position computed over the entire sequence. Joint numbers correspond to those shown in Figure 2. Sequence Jon Walk Tony Kick Joint Number σx σy σz σx σy σz 1 2.4 1.1 1.9 1.8 1.1 1.6 2 1.1 1.1 1.0 1.6 0.9 1.2 3 0.7 1.0 0.7 0.9 0.9 0.7 4 0.2 1.0 0.4 0.3 0.8 0.4 6 1.2 1.3 3.1 2.0 1.2 3.6 7 1.8 4.1 1.8 1.3 1.0 3.0 8 2.1 5.8 2.7 1.9 3.6 1.8 10 1.2 1.1 3.1 1.8 1.1 3.2 11 0.9 1.5 1.7 1.5 1.4 2.4 12 0.9 1.3 1.6 1.9 1.9 1.4 14 2.4 1.1 1.7 1.8 1.1 1.3 15 0.9 1.3 0.7 0.5 1.1 0.6 16 2.3 1.5 2.5 1.9 1.2 0.8 18 2.4 1.1 1.9 1.8 1.2 1.7 19 0.6 1.3 0.6 1.7 2.3 1.6 20 2.3 1.4 1.4 2.7 3.1 1.7
Tony Punch σx σy σz 0.6 0.5 0.6 0.5 0.4 0.8 0.2 0.4 0.4 0.1 0.5 0.2 0.7 0.6 2.1 0.5 0.6 1.0 1.6 4.5 2.6 0.9 0.6 1.6 0.7 0.5 1.5 1.6 1.3 1.1 0.6 0.5 0.6 0.3 0.5 0.3 0.8 0.5 0.3 0.6 0.5 0.6 0.3 0.5 0.2 0.4 0.5 0.3
Tony Stance σx σy σz 0.9 0.6 0.9 0.7 0.6 0.7 0.4 0.6 0.4 0.2 0.6 0.3 0.8 0.8 3.0 0.6 0.9 1.0 1.4 5.3 3.8 0.9 0.8 2.2 0.6 0.4 2.2 1.1 1.5 0.6 0.9 0.6 1.2 0.4 0.6 0.6 0.9 0.7 0.5 0.9 0.6 0.9 0.3 0.7 0.3 0.7 0.7 0.5
Tony Walk σx σy σz 2.7 0.9 2.3 1.5 0.9 1.1 0.8 0.9 0.6 0.2 0.9 0.2 1.7 1.7 3.0 0.9 2.3 2.0 1.1 2.2 3.0 1.6 1.7 1.5 0.9 2.3 1.6 0.9 2.1 2.3 2.7 0.9 2.1 1.6 1.3 2.0 2.7 1.8 5.3 2.7 0.9 2.4 1.9 1.2 3.4 3.4 2.7 5.3
Algorithm settings. The experiments we report in this paper were run with a swarm containing 10 particles. The PSO inertia parameter decreased over time as in [8,9], that is, it decreased according to w = 2.0/ex , where x has the role of a counter and where the starting value for the first frame was set to x = 1.0 and for all subsequent frames to x = 2.0. Whenever a PSO iteration (one swarm move) did not produce an improved global best estimate, the inertia value decreased by increasing the counter to x = x + 0.05. The optimisation terminated when the inertia value fell under 0.1. The constants C1 and C2 in Equation (2) were set to 2.0. TM
GPU. The experiments were run with an nVIDIA Quadro FX 5800 with 4GB Gddr3 RAM on a PC powered by a 64-bit Intel(R) Core(TM) i7 CPU running at 2.67GHz. Human Body Model. The process of pose estimation, as presented in this paper, requires that the particle position vector be rendered as a human body model using Catmull-Clark subdivision and then projected onto the camera image plane(s) to generate candidate silhouettes. To perform the body model subdivision on the GPU, we have adapted the implementation by Patney et al. [17]. The projection of the body model onto camera planes is implemented in OpenGL and is the only operation that has been left to the CPU. As such, it represents a bottleneck of our algorithm, because it incurs a memory transfer between the CPU and GPU every time the body models are rendered to generate silhouettes for the fitness function. In order to minimise the number of transfers, we render all camera views for all particles into one large OpenGL buffer and perform only one transfer for every iteration of the PSO. As the OpenGL buffer size is limited to 8192 × 8192 pix, we can use only 10 particles before exceeding the available buffer size. Porting the camera projection code onto GPU would remove the problem of the CPU-GPU memory transfer and allow the use of larger swarms as the limit would not be imposed by the OpenGL buffer size, but instead by the amount of memory available on the graphics card.
106
L. Mussi, S. Ivekovic, and S. Cagnoni
Results. The presented CUDA-PSO-based pose estimation algorithm was developed from the hierarchical PSO reported in [8] for the problem of upper body pose estimation and using a subdivision surface body model. The same algorithm was later adapted to full body pose in [9]; however, it also replaced the subdivision model with a simpler cylinder model to enable a fair comparison of the search method with a competing particle filtering approach. The work was further extended in [7], where an adaptive fullbody hierarchical pose estimation (APSO) was reported which dynamically adjusted the search region size in every frame in an attempt to reduce the computation time. In [7], the APSO algorithm took on average 155 seconds for the Tony Kick and Tony Punch sequence, whereas the algorithm reported here requires only 6.9 seconds per frame for the same sequence. Similarly, in [7], Jon Walk required 176 seconds per frame while our algorithm takes only 7.4 seconds. Not only does the algorithm reported in this paper achieve a 20-fold faster execution time, but it does so with a more complex body model which includes the subdivision process and allows for much more flexibility in modelling the shape of the human body. Figure 1 shows examples of estimated poses for different camera views and different sequences. We performed a quantitative study of the pose estimation accuracy on a 50-frame long synthetic sequence of a kick, the results of which are reported in Figure 3. The plots show that the mean error with respect to the ground truth over 500 estimates are well below 5 cm for individual joints, and below 7 cm for the full pose which is comparable or better than the competing generative pose estimation methods which have been extensively tested in [9]. The main deviation from the ground truth is detected in the right ankle joint (joint 17) in Figure 3 left which is also the reason for the large spread of estimates in Figure 3 right between frames 20 and 25, when the ankle joint is not correctly estimated. In spite of occasional glitches, the optimisation seems to recover from bad estimates without difficulty. The results were obtained with 10 particles and we anticipate that a larger number of particles would further improve the performance; however, this would require the camera projection implementation on GPU and has been left as future work. As we do not have the ground truth available for the real sequences, we instead study the variability in the pose estimates over 10 runs of the algorithm. The results are shown in Table 3 and indicate that, just like in the synthetic sequence, the estimates are generally consistent with occasional imperfections which, however, do not cause algorithm divergence. Unlike the competing approaches, our method handles initialisation automatically. We start from a canonical “T-pose” and use a higher starting inertia value in the first frame of the sequence, which causes the particles to explore a larger region of the search space. In the subsequent frames, the temporal consistency of the human motion is exploited by initialising the search around the final estimate of the previous frame and using a lower starting inertia value to encourage the search around the previous estimate. The performance on the first frame is comparable to the performance on the rest of the sequence and in line with the ability of the algorithm to automatically recover from bad estimates. This ability is due to the global search nature of the PSO approach.
Markerless Articulated Human Body Tracking from Multi-view Video with GPU-PSO
25
distances mean values
20
distance (cm)
distance (cm)
distances mean values
25
20
107
15
10
15
10
5 5
0 1
2
3
4
5
6
7
8
9 10 joint
11
12
13
14
15
16
17
18
1
5
10
15
20
25 frame
30
35
40
45
50
Fig. 3. Algorithm performance on the synthetic sequence. Left: distances (in cm) from the ground truth of each joint estimate in 50 frames over 10 runs. Right: distances from the ground truth of all joint estimates over 10 runs for each of the 50 frames. Means are represented by bullets.
6 Conclusions In this paper, we described a parallel approach to articulated human body pose estimaTM tion from multi-view video sequences, based on the CUDA architecture. The results show that the execution time can be cut down noticeably by formulating the algorithm on the GPU, without sacrificing the pose estimation accuracy, thereby exploiting the vast computational resource available on an ordinary desktop PC. The current implementation still combines the computational power of the CPU and GPU and additional speedup is possible by deploying the complete algorithm on GPU in order to avoid the communication bottleneck. This would also allow us to increase the size of the swarm, which is likely to lead to better performance. A further improvement is anticipated from exploiting the parallelism in the kinematic structure of the human body. Both improvements have been left as future work.
Acknowledgments The authors would like to thank Prof. A. Hilton from the CSSVP, University of Surrey, for the test sequences, and Mr A. Patney from University of California, Davis, for sharing his CUDA implementation of the Catmull-Clark subdivision. S. Ivekovic would like to thank the RSE Lessells Scholarship for the financial support that enabled this work.
References 1. Bandouch, J., Engstler, F., Beetz, M.: Evaluation of hierarchical sampling strategies in 3D human pose estimation. In: Proc. British Machine Vision Conference (2008) 2. Caillette, F., Galata, A., Howard, T.: Real-time 3-D human body tracking using learnt models of behaviour. Computer Vision and Image Understanding 109(2), 112–125 (2008) 3. Chang, J.F., Chu, S.C., Roddick, J.F., Pan, J.S.: A parallel particle swarm optimization algorithm with communication strategies. J. Inf. Sci. Eng. 21(4), 809–818 (2005) 4. Deutscher, J., Reid, I.: Articulated body motion capture by stochastic search. International Journal of Computer Vision 61(2), 185–205 (2005)
108
L. Mussi, S. Ivekovic, and S. Cagnoni
5. Gall, J., Rosenhan, B., Brox, T., Seidel, H.P.: Optimization and filtering for human motion capture. International Journal of Computer Vision 87(1–2), 75–92 (2010) 6. Gies, D., Rahmat Samii, Y.: Reconfigurable array design using parallel particle swarm optimization. In: Intl. Symp. Antennas and Propagation Soc., vol. 1, pp. 177–180 (2003) 7. Ivekovic, S., John, V., Trucco, E.: Markerless multi-view articulated pose estimation using adaptive hierarchical particle swarm optimisation. In: Di Chio, C., Cagnoni, S., Cotta, C., Ebner, M., Ek´art, A., Esparcia-Alcazar, A.I., Goh, C.-K., Merelo, J.J., Neri, F., Preuß, M., Togelius, J., Yannakakis, G.N. (eds.) EvoApplicatons 2010. LNCS, vol. 6024, pp. 241–250. Springer, Heidelberg (2010) 8. Ivekovic, S., Trucco, E., Petillot, Y.: Human body pose estimation with particle swarm optimisation. Evolutionary Computation 16(4), 509–528 (2008) 9. John, V., Trucco, E., Ivekovic, S.: Markerless human articulated tracking using hierarchical particle swarm optimisation. In: Image and Vision Computing (in Press, 2010) 10. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proc. IEEE Int. conf. on Neural Networks, vol. IV, pp. 1942–1948. IEEE CS Press, Los Alamitos (1995) 11. Li, J., Wang, X., He, R., Chi, Z.: An efficient fine-grained parallel genetic algorithm based on GPU-accelerated. In: IFIP Int. Conf. on Network and Parallel Computing Workshops, pp. 855–862 (2007) 12. MacCormick, J., Isard, M.: Partitioned sampling, articulated objects, and interface-quality hand tracking. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 3–19. Springer, Heidelberg (2000) 13. Moeslund, T., Hilton, A., Kr¨uger, V.: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104(2-3), 90–126 (2006) 14. Mussi, L., Cagnoni, S.: Particle swarm for pattern matching in image analysis. In: Artificial life and evolutionary computation, pp. 89–98. World Scientific, Singapore (2010) 15. Mussi, L., Daolio, F., Cagnoni, S.: GPU-based road sign detection using particle swarm optimization. In: IEEE Conf. Intelligent System Design and Applications, pp. 152–157 (2009) 16. Organic Motion (2010), http://www.organicmotion.com/ 17. Patney, A., Ebeida, M.S., Owens, J.D.: Parallel view-dependent tessellation of Catmull-Clark subdivision surfaces. In: Proc. Conf. on High Performance Graphics, pp. 99–108 (2009) 18. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization: an overview. Swarm Intelligence 1(1), 33–57 (2007) 19. Poppe, R.: Vision-based human motion analysis: An overview. Computer Vision and Image Understanding 108(1–2), 4–18 (2007) 20. Schutte, J.F., Reinbolt, J.A., Fregly, B.J., Haftka, R.T., George, A.D.: Parallel global optimization with the particle swarm algorithm. J. Num. Methods in Eng. 61, 2296–2315 (2003) 21. Urtasun, R., Fleet, D.J., Hertzmann, A., Fua, P.: Priors for people tracking from small training sets. In: Proceedings of IEEE ICCV, pp. 403–410 (2005) 22. Veronese, L.d., Krohling, R.A.: Swarm’s flight: accelerating the particles using C-CUDA. In: IEEE Congress on Evolutionary Computation (CEC 2009, pp. 3264–3270 (2009) 23. Vicon Motion Capture Systems (2010), http://www.vicon.com/ 24. Waintraub, M., Schirru, R., Pereira, C.: Multiprocessor modeling of parallel Particle Swarm Optimization applied to nuclear engineering problems. Progress in Nuclear Energy 51, 680– 688 (2009) 25. Wang, W., Hong, Y., Kou, T.: Performance gains in parallel particle swarm optimization via nVIDIA GPU. In: Workshop on Computational Mathematics and Mechanics 2009 (2009) 26. Warren, J., Schaefer, S.: A factored approach to subdivision surfaces. Computer Graphics and Applications 24(3), 74–81 (2004) 27. Xue, S.D., Zeng, J.C.: Parallel asynchronous control strategy for target search with swarm robots. International Journal of Bio-Inspired Computation 1(3), 151–163 (2009) 28. Zhou, Y., Tan, Y.: GPU-based parallel particle swarm optimization. In: Proc. 2009 IEEE Congress on Evolutionary Computation (CEC 2009), pp. 1493–1500 (2009)
Evolving Object Detectors with a GPU Accelerated Vision System Marc Ebner Eberhard-Karls-Universit¨ at T¨ ubingen Wilhelm-Schickard-Institut f¨ ur Informatik Abt. Rechnerarchitektur, Sand 1, 72076 T¨ ubingen, Germany [email protected] http://www.ra.cs.uni-tuebingen.de/mitarb/ebner/welcome.html
Abstract. Using GPU processing, it is now possible to develop an evolutionary vision system working at interactive frame rates. Our system uses motion as an important cue to evolve detectors which are able to detect an object when this cue is not available. Object detectors consist of a series of high level operators which are applied to the input image. A matrix of low level point operators are used to recombine the output of the high level operators. With this contribution, we investigate, which image processing operators are most useful for object detection. It was found that the set of image processing operators could be considerably reduced without reducing recognition performance. Reducing the set of operators lead to an increase in speedup compared to a standard CPU implementation.
1
Motivation
In the field of evolutionary computer vision, evolutionary algorithms are used to search for optimal or approximately optimal solutions for computer vision problems [2]. A programmer developing a computer vision algorithm needs to decide in what sequence well known image processing operators have to be arranged to obtain a desired result. In evolutionary computer vision, Genetic Programming [1,13] is used to arrange different image processing operators to obtain a particular output or to perform a given task such as object recognition. This approach is particularly interesting for problems for which the solution is not readily apparent to those skilled in the art. Unfortunately, most experiments in evolutionary computer vision require enormous computational resources because multiple algorithms have to be evaluated over several generations to find an appropriate solution. However, the graphics processing unit (GPU) of a PC can be used for speeding up image processing tasks [7,8]. The GPU is ideal for speeding up image processing as the same operation usually needs to be computed for each image pixel. We have created a GPU accelerated evolutionary image processing system which is able to learn how to detect a user-specified object in an image [3,4]. G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 109–120, 2010. c Springer-Verlag Berlin Heidelberg 2010
110
M. Ebner input images
evolutionary algorithm initialization of population
selection
reproduction
variation
new image
output of best individual
done?
best algorithm
Fig. 1. Evolutionary object detection system
This is the first system of this type 1 . The system receives an image sequence as input. The user has to tell the system where this object is located using the mouse pointer. The user simply moves the mouse over the object to be detected and then follows the object while pressing the mouse button. The system maintains a population of image processing algorithms. For each new image, all algorithms are run on the input image. The algorithm transforms the input image into another image, the output image. The largest pixel response of the output image is taken as the position of the detected object. The power of the GPU is required to evaluate multiple algorithms for each incoming image at interactive rates. Figure 1 provides an overview. The evolutionary algorithm uses a generational based model. For each new image, a new generation of algorithms is created. Parents are re-evaluated on the new image. Both offspring and parents are considered, when selecting the parents of the next generation. Fitness is computed as the Euclidean distance between the desired object position (position of the mouse pointer) and the detected position of the algorithm. Over the course of several images, the population of algorithms adapts to the problem of detecting the desired object position. Humans are able to locate and detect objects in single images. Similarly, the object detection algorithms only use a single image as input and not multiple images. However, motion is an important cue. That’s why this system uses image sequences as input. If the desired object moves around in the scene, then the background constantly changes while there is little change in the object. The evolutionary algorithm discounts all aspects that are irrelevant but necessary in detecting the object successfully. If there is a large change in the object because 1
A video of this system is available for download from http://www.ra.cs.unituebingen.de/mitarb/ebner/research/publications/uniTu2/EvoCV.m4v
Evolving Object Detectors with a GPU Accelerated Vision System
111
of a change in perspective, then the evolved detector may focus only on color or on color and texture if color is not sufficient by itself. We have recently extended this evolutionary vision system by removing the necessity of having the user interact with the system. Moving objects are automatically detected and used as teaching input [5]. Thus, motion is used as a cue but object detection can still be performed when this cue is not available as the object detectors only use single images. Each object detector consists of several image processing operators known from the computer vision literature [10,21]. The complete list of operators is given by Ebner [4]. The representation of the individuals will be described below. With this contribution, we investigate which of the available operators are actually used and how important they are, i.e. the fraction of their usage. In addition, we rigorously evaluate the speedup obtained compared to standard processing on the CPU.
2
Evolutionary Computer Vision
Evolutionary algorithms are especially useful in computer vision when it is not at all clear what an optimal algorithm should look like. Evolutionary algorithms can be used to find optimal parameters for an existing algorithm but can also be used to evolve an algorithm from scratch. Evolutionary computer vision started in the early 1990s when Lohmann used an Evolution Strategy to find an algorithm which computes the Euler number of an image [15]. Initially, research focused on the automatic generation of low-level operators, e.g. edge detectors [9] or feature detectors [19]. Katz and Thrift [12] applied evolutionary algorithms for target recognition. Using evolutionary algorithms, it is possible to evolve adaptive operators which are optimal or near optimal for a given task [6]. Poli [18] has shown that Genetic Programming is particularly useful for image processing. Johnson et al. [11] have used Genetic Programming to evolved visual routines. Current research focuses on the evolution of low-level detectors [22] and object recognition [14]. A taxonomic tutorial on the field of evolutionary computer vision is given by Cagnoni [2]. Most experiments in evolutionary computer vision are performed off-line because of the enormous computational resources which are required because each individual of the population has to be evaluated for several generations. In contrast to such off-line experiments, we try to build an adaptive, self-learning vision system which is always on. In our system, multiple alternative image processing algorithms are run on each new input image. This is done using GPU accelerated image processing.
3
GPU Accelerated Image Processing
Consumer graphics cards are specifically optimized to render images at high speeds. A three-dimensional scene consists of numerous triangles which are fed
112
M. Ebner
to the graphics card. In order to obtain photo-realistic images, small programs can be sent to the graphics card to specify computations which should be carried out per vertex (vertex shaders) or per pixel (pixel shaders). The OpenGL shading language (OpenGLSL) [20] has been developed as a standard to program vertex and pixel shaders. This shading language as well as the computations which are carried out on the graphics card are highly optimized for rendering threedimensional scenes consisting of thousands of triangles. We use this programming paradigm to perform image processing on the graphics card efficiently. Instead of using the OpenGLSL, we could also have used CUDA [17]. CUDA allows for more general image processing algorithms. However, it does not provide Mip Maps. Mip Maps are usually used for texturing polygons. In our case, Mip Maps allow for scale space processing [23]. To fully exploit the power of the GPU, we use exactly the same paradigm which is used when rendering images. Only a single polygon is rendered. This polygon represents the output image of the image processing algorithm. The image processing algorithm is fed to the pixel shader. This pixel shader is then used to compute the correct output color for each pixel. The original input image is supplied to the pixel shader as a texture, thus Mip Mapping is available. The pixel shader we use is actually a universal pixel shader which is able to interpret the genetic material of an individual of the population as an image processing algorithm. Each individual consists of an array of bytes and represents an image processing algorithm. For each input image, all of the individuals of the population are evaluated by sending the array of bytes one after the other to the GPU. The input image is loaded into the GPU as a texture only once. Thus, the speedup depends on the number of individuals to evaluate. The more individuals are evaluated, the higher the speedup. The GPU renders the input image on the screen and also shows the output of the three best individuals of the population. The evolutionary algorithm itself is run on the main CPU. The evaluation of the individuals and the display of the images take approximately 81% of the total time. Since the OpenGLSL is used to compute the output image, the code is highly portable and can be run on any graphics card as long as the graphics card supports vertex and pixel shaders, e.g. OpenGLSL 2.0 and up.
4
Representation
A special data structure is used for all image processing algorithms in order to fully exploit the power of the GPU. The representation is shown in Figure 2. This is a variant of the Cartesian Genetic Programming approach [16]. First, n1 high level operators are applied to the input image. Operators include convolution, edge detection, Laplacian or image segmentation. The high level operators can access the pixels of the original image using a mask or they can simply return a constant value. Some of the operators use parameters which specify the scale level at which the operator is to be applied. This is readily possibly by using the texture processing operations of the OpenGLSL. The Mip Map mechanism
Evolving Object Detectors with a GPU Accelerated Vision System Individual (byte array) 103
72
23
operator id
173
node 1
54
219
46
113
parameters/ connections
138
30
node 7
node 2
n 1 high level operators node 1
Interpretation as image processing algorithm
edge detection node 2
convo− lution node 3
n x x n y matrix of low level point operators node 4
addition
node 5
step
node 6
max
node 7
gate function
sub−detector 1 Output nodes are averaged to obtain result
sub−detector 2
segmen− tation
Fig. 2. The genotype of an individual is simply a byte array which is modified through simulated evolution. Each individual represents a computer vision algorithm. High level operators are located in column 0. Low level operators are located inside a nx × ny matrix. The low level operators are used to recombine the output of the high level operators.
allows us to read out the texture at any scale. Some also allow for a shift of the entire image. The additional parameters required for these operations are all part of the genetic representation. The output of these high level operators is then recombined using a nx × ny processing matrix. We call this the n1 −nx ×ny representation. It consists of 3(n1 + nx ny ) bytes. The mapping from genotype to phenotype is fully described in Ebner [4]. The operators used as well as the connectivity of the matrix is optimized by the evolutionary algorithm. The input is fed from left to right through this matrix. Arithmetic, threshold or channel selecting operations are used for this task. Gate functions which can be used to implement if-functions are also included. Each row of the right hand side of the matrix can be viewed as a sub-detector which is designed by evolution to extract the desired object. Thus, on the right hand side of the nx × ny processing matrix, the output of ny sub-detectors is available. The output of all sub-detectors is averaged to obtain the overall response to the input image. The object is said to be located at the position with the largest pixel value. If multiple pixels have the same maximum value, then the center of gravity is computed. This representation adheres to the image rendering paradigm as close as possible in order to use the GPU as efficiently as possible. That is why image processing operators such as convolution and edge detection are applied first and then the output of these operators is recombined to detect the object. It would have also been possible to allow full image processing operators at every
114
M. Ebner
position of the matrix. However, in order to do this, one would have to first read out the results computed by an operator from the graphics card, and then again send this result to the graphics card as a texture because texture access is read only. The current representation reduces the texture transfer between the CPU and the GPU to a minimum because only the input image is transfered.
5
Experiments
The system was tested on two video sequences. The first sequence shows a radio controlled car, while the second sequence shows a toy train. These two sequences have also been used for previous experiments [5]. The radio controlled car is relatively easy to detect because of its distinct color. No other object in the sequence has the same purple color as shown on the car. The train was mostly yellow and red. However another stationary object also had the same two colors, albeit in a different arrangement. Thus, the toy train is more difficult to detect. For each sequence, the moving object, car respectively toy train was detected automatically. The center of this moving object was used to compute the fitness. Let pm be the position of the moving object and let pd be the object position as detected by the algorithm. The quality or fitness f of the detector is measured by computing the Euclidean distance beween these two positions f = |pm − pd |. A perfect individual would always respond with the exact same position as the moving object. It would have a fitness of zero. Evolution was turned on whenever the fitness increased beyond 25 pixels. Evolution was turned off whenever the fitness decreased below 10 pixels for five consecutive images. Figure 3(a/b) shows the probability that evolution was turned on again for both image sequences. For each of the representations, 10 runs with different seeds were conducted. Figure 3(c/d) shows the percentage of image frames that evolution was turned on for both image sequences. When evolution is turned on, offspring are generated using mutation and crossover. Starting with μ = 3 parent individuals, λo = 20 offspring are generated. An additional number of λr = 20 offspring are generated completely at random. The mutation operator is used to generate 50% of the λo offspring. The mutation operator either increases or decreases one of the parameters by one or applies a per bit mutation with a probability of pmut = 2/l where l is the length of the genotype in bits, i.e. on average, we will have two bit changes per mutation. The crossover operator selects two individuals and exchanges parts of the genetic material of one individual with the other individual (two point crossover with a probability pcross = 0.5). Fitness is computed for the current input image for both, parent and offspring. All individuals, parent and offspring, are sorted with respect to fitness. Individuals with the same fitness are considered to be identical. Only the best μ individuals are selected as parents for the next generation. The set of image processing operators is shown in Figure 4. The operators have been fully described in [4]. The individuals usually adapt to the given problem within a relatively short amount of time, i.e. on average within 10 image frames. The evolved detectors
Evolving Object Detectors with a GPU Accelerated Vision System toy train 0.03
0.025
0.025
5-2x5
6-1x6
6-2x6
5-2x5
6-1x6
6-2x6
4-2x4
5-1x5
(b)
radio controlled car
toy train
0.1
0.3 average # of generations
0.08 0.06 0.04 0.02 0
0.25 0.2 0.15 0.1 0.05 4-2x4
3-2x3
4-1x4
2-2x2
3-1x3
1-1x1
6-2x6
5-2x5
6-1x6
4-2x4
5-1x5
3-2x3
(c)
4-1x4
2-2x2
3-1x3
1-1x1
2-1x2
0 2-1x2
average # of generations
5-1x5
(a)
3-2x3
1-1x1
6-2x6
5-2x5
6-1x6
4-2x4
5-1x5
3-2x3
4-1x4
2-2x2
0 3-1x3
0 1-1x1
0.005 4-1x4
0.01
0.005
2-2x2
0.01
0.02 0.015
3-1x3
0.02 0.015
2-1x2
# of restarts
0.03
2-1x2
# of restarts
radio controlled car
115
(d)
Fig. 3. (a/b) Probability that evolution had to be restarted. (c/d) Percentage of image frames that evolution was required (a/c) Radio controlled car (b/d) Toy train.
High level image pro- Image, DX, DY, Lap, Grad, ImageGray, ImageChrom, cessing operators ImageLogDX, ImageConv1, ImageConv4, ImageConv16, ImageConvd, ImageSeg, 0.0, 0.5, 1.0 Low level point oper- id, abs, dot, sqrt, norm, clamp(0,1), step(0), step(0.5), ations smstep(0,1), red, green, blue, avg, minChannel, maxChannel, equalMin, equalMax, gateR, gateG, gateB, gateRc, gateGc, gateBc, step, +, -, *, /, min, max, clamp0, clamp1, mix, step, lessThan, greaterThan, dot, cross, reflect, refract Fig. 4. Set of image processing operators
may not be perfect but are able to locate the object approximately. Over time, this detector will be refined by evolution. As previous experiments have shown, the evolved detectors are very robust, i.e. the object is still located even though it was distorted or had changed its scale [4]. Figure 5(a) shows the time required to evaluate a single individual depending on the representation used. Results are shown for GPU accelerated processing as well as standard CPU processing. It is clear that the time is linear in the number of high level operators when CPU processing is used. For this experiment, the
116
M. Ebner Time per Evaluation 1.2
Time per Evaluation (s)
Speedup 45
GPU accelerated CPU only
40 35
1
30
Speedup
0.8 0.6
25 20 15
0.4
10 0.2
5
0
0 1
2
3
4
5
6
7
8
9
10
1
Number of high level operators
2
3
4
5
6
7
8
9
10
Number of high level operators
(a)
(b)
Fig. 5. (a) time required to evaluate a single individual depending on the representation used. (b) speedup.
sequence of the radio controlled car was used. All of the individuals used a 2 × 2 processing matrix and only the number of high level operators was varied. The evaluated individuals were generated completely at random to ensure an unbiased sampling of the search space. Each input image had a size of 512 × 288 pixels. Figure 5(b) shows the speedup obtained. This data was measured on a Linux system (Intel Core 2 CPU running at 2.13GHz) equipped with a GeForce 9600GT/PCI/SEE2. Given a more powerful graphics card, the system can easily be scaled up. One can simply increase the number of algorithms which are evaluated for each image or one can increase the size of the images which are processed. With the 3−2×2 representation (see Figure 2), 78.5% of the total computation time are used to perform image processing on the GPU, 2.7% are used for rendering and the remaining 18.8% is used for all other computations on the CPU including the operations of the evolutionary algorithm. Figure 6 shows how useful the different operators are for detecting the radio controlled car as well as the toy train. The plot was generated by tabulating High Level Operator Usage
Point Operator Usage
2
1
0.8
Percentage
Percentage
1.5
1
0.6
0.4
0.5 0.2
(a)
norm reflect smstep(0,1) abs gateR refract cross sqrt red id green clamp(0,1) gateG min * gateGc clamp0 maxChannel equalMin gateBc gateB blue dot step(0) + clamp1 greaterThan avg mix step(0.5) step equalMax lessThan gateRc max step minChannel dot /
ImageConvd
0.0
ImageConv16
1.0
Grad
ImageLogDX
DX
Lap
Image
DY
0.5
ImageSeg
ImageConv1
ImageGray
ImageConv4
0 ImageChrom
0
(b)
Fig. 6. Operator usage sorted from most often used to least often used. (a) high level operators (b) point operators.
Evolving Object Detectors with a GPU Accelerated Vision System
117
how often each operator occurred during all experiments which were carried out to produce Figure 3. The most useful operator was ImageChrom. This operator is especially useful to create object detectors based on color as it can be used to compute chromaticities. Since some operators are more useful, than others, we conducted another set of experiments. The set of high level operators was limited to the 8 most useful high level operators and the set of low level point operators was limited to the 16 most useful point operators. This reduced set of operators is shown in Figure 7. High level image processing operators Low level point operations
Image, DY, ImageGray, ImageChrom, ImageConv1, ImageConv4, ImageSeg, 0.5 id, abs, sqrt, norm, clamp(0,1), smstep(0,1), red, green, gateR, gateG, -, *, min, cross, reflect, refract
Fig. 7. Reduced set of image processing operators
Removing unnecessary operators has the advantage that the pixel shaders which are used to map the genotype to an image processing algorithm can be considerably simplified. It is not possible to directly map an operator to an executable function as it is possible in C or C++. With a pixel shader, this needs to be done using a sequence of if-instructions. A case statement is not available. For our initial experiments, we had a total of 56 operators, i.e. 16 high level operators and 40 point operations. Hence, the pixel shader has to run through 56 if-statements for every output pixel that is computed. The ifinstructions require considerable time to execute. In comparison, a CPU only implementation is of course able omit these operations per pixel. The CPU implementation decides once for each node which image processing operator is applied and then computes all of the output pixels using this image processing operator. Figure 8 shows the results using the reduced operator set. Compared to Figure 3 we see that the evolved operators are more robust when the reduced operator set is used, i.e. evolution has to be turned on for fewer generations. The difference is statistically significant for half of the representations at 95%. For the remaining experiments, the difference is not statistically significant. In other words, when the reduced operator set was used, performance remained the same or improved. We also see that less re-starts are required (except for representations 2−1×2 and 6−2×6 for the radio controlled car and 2−2×2 for the toy train). On average good solutions are found within 8 images compared to 10 image using the full instruction set. Figure 9(a) shows the time required to evaluate a single individual when the reduced operator set was used compared to the full set using GPU acceleration. On average, the time required to evaluate an individual was reduced by a factor of 1.5. Since the complexity of the pixel shader is considerably reduced with the reduced operator set, a lot less time was required to evaluate a single individual and hence the speedup improved. Figure 9(b) shows the speedup comparing the reduced operator set with and without GPU acceleration.
M. Ebner toy train
0.025
0.025
5-2x5
6-1x6 6-1x6
6-2x6
5-1x5
4-2x4
(b)
radio controlled car
toy train 0.3 average # of generations
0.1 0.08 0.06 0.04 0.02
0.25 0.2 0.15 0.1 0.05
(c)
6-2x6
4-2x4
3-2x3
3-1x3
2-2x2
1-1x1
6-2x6
6-1x6
5-2x5
5-1x5
4-2x4
4-1x4
3-2x3
3-1x3
2-2x2
2-1x2
0 1-1x1
0
2-1x2
average # of generations
5-2x5
(a)
5-1x5
1-1x1
6-2x6
6-1x6
5-2x5
5-1x5
4-2x4
4-1x4
3-2x3
3-1x3
0 2-2x2
0 2-1x2
0.005 4-1x4
0.01
0.005
3-2x3
0.01
0.02 0.015
3-1x3
0.015
2-2x2
0.02
2-1x2
# of restarts
0.03
1-1x1
# of restarts
radio controlled car 0.03
4-1x4
118
(d)
Fig. 8. (a/b) Average number of times that evolution had to be restarted using the reduced operator set. (c/d) Percentage of image frames that evolution was required using the reduced operator set. (a/c) Radio controlled car (b/d) Toy train.
Time per Evaluation
Speedup (reduced set GPU vs reduced set CPU)
GPU accelerated (full set) GPU accelerated (reduced set)
0.03
60 50
0.025 Speedup
Time per Evaluation (s)
0.035
0.02 0.015
40 30
0.01
20
0.005
10
0
0 1
2
3 4 5 6 7 8 9 Number of high level operators
(a)
10
1
2
3 4 5 6 7 8 Number of high level operators
9
10
(b)
Fig. 9. (a) time required to evaluate a single individual using GPU acceleration. (b) speedup comparing the reduced instruction set with and without GPU acceleration.
6
Conclusion
We have built an evolutionary object recognition system working at interactive rates using GPU processing. Initially, objects had to be manually identified by the user using the mouse pointer. The current system uses motion as a cue to detect moving objects in video sequences. The largest moving object provides
Evolving Object Detectors with a GPU Accelerated Vision System
119
the teaching input. The system evaluates a population of algorithms for each new image. The representation is a variant of the Cartesian Genetic Programming representation. First, a number of high level operators are applied. The output of these high level operators is then recombined using a processing matrix of point operations. The processing matrix provides one or more transformed images as output. If more than one output image is computed, then these are averaged to obtain a single output image. The largest RGB response is the detected object position. Fitness is computed as the Euclidean distance between the detected and the actual object position. Offspring are generated using mutation and crossover operators. Half of the offspring are generated at random for each new image. This allows evolution to restart at any point in time. The evolved object detectors receive only a single image as input. The size of the moving object is unknown to the detectors. However, since the objects move across the background, the surrounding of the object will constantly change. The evolved object detectors need to focus on features which remain constant over the course of several image frames. Object detectors focusing on varying features will be eliminated from the population. Only the detectors using robust detection strategies will remain. With this contribution, we have thoroughly evaluated which operators are most useful for object detection. The speedup obtained through GPU processing was also analyzed. By reducing the number of operators, the speedup could be improved. The current system is a step towards a fully self-adapting/self-learning vision system. Object detectors are evolved using one cue (motion) but detect the object when this cue is not available. Future research will focus on the evolution of concepts.
References 1. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming An Introduction: On The Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers, San Francisco (1998) 2. Cagnoni, S.: Evolutionary computer vision: a taxonomic tutorial. In: 8th Int. Conf. on Hybrid Intelligent Systems, pp. 1–6. IEEE Comp. Soc., Los Alamitos (2008) 3. Ebner, M.: Engineering of computer vision algorithms using evolutionary algorithms. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) Advanced Concepts for Intelligent Vision Systems, Mercure Chateau Chartrons, Bordeaux, France, Berlin, pp. 367–378. Springer, Heidelberg (2009) 4. Ebner, M.: A real-time evolutionary object recognition system. In: Vanneschi, L., Gustafson, S., Moraglio, A., Falco, I.D., Ebner, M. (eds.) Genetic Programming: Proc. of the 12th Europ. Conf., T¨ ubingen, Germany, pp. 268–279. Springer, Berlin (2009) 5. Ebner, M.: Towards automated learning of object detectors. In: Applications of Ev. Computation, Proc., Istanbul, Turkey, pp. 231–240. Springer, Berlin (2010) 6. Ebner, M., Zell, A.: Evolving a task specific image operator. In: Poli, R., Voigt, H.-M., Cagnoni, S., Corne, D., Smith, G.D., Fogarty, T.C. (eds.) Joint Proc. of the 1st Europ. Workshops on Evolutionary Image Analysis, Signal Processing and Telecommunications, G¨ oteborg, Sweden, pp. 74–89. Springer, Berlin (1999)
120
M. Ebner
7. Fung, J., Mann, S., Aimone, C.: OpenVIDIA: Parallel GPU computer vision. In: Int. Multimedia Conf. Proc. of the 13th annual ACM int. conf. on Multimedia, Singapore, vol. 5, pp. 849–852. ACM, New York (2005) 8. Fung, J., Tang, F., Mann, S.: Mediated reality using computer graphics hardware for computer vision. In: Proc. of the 6h Int. Symp. on Wearable Computers, pp. 83–89. ACM, New York (2002) 9. Harris, C., Buxton, B.: Evolving edge detectors with genetic programming. In: Koza, J.R., Goldberg, D.E., Fogel, D.B., Riolo, R.L. (eds.) Genetic Programming, Proc. of the 1st Annual Conf., Stanford University, pp. 309–314. The MIT Press, Cambridge (1996) 10. Jain, R., Kasturi, R., Schunck, B.G.: Machine Vision. McGraw-Hill, New York (1995) 11. Johnson, M.P., Maes, P., Darrell, T.: Evolving visual routines. In: Brooks, R.A., Maes, P. (eds.) Artificial Life IV, Proc. of the 4th Int. Workshop on the Synthesis and Simulation of Living Systems, pp. 198–209. The MIT Press, Cambridge (1994) 12. Katz, A.J., Thrift, P.R.: Generating image filters for target recognition by genetic learning. IEEE Trans. on Pattern Analysis and Machine Intelligence 16(9), 906–910 (1994) 13. Koza, J.R.: Genetic Programming. In: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992) 14. Krawiec, K., Bhanu, B.: Visual learning by evolutionary and coevolutionary feature synthesis. IEEE Trans. on Evolutionary Computation 11(5), 635–650 (2007) 15. Lohmann, R.: Bionische Verfahren zur Entwicklung visueller Systeme. PhD thesis, Technische Universit¨ at Berlin, Verfahrenstechnik und Energietechnik (1991) 16. Miller, J.F.: An empirical study of the efficiency of learning boolean functions using a Cartesian Genetic Programming approach. In: Banzhaf, W., et al. (eds.) Proc. of the Genetic and Evolutionary Computation Conf., pp. 1135–1142. Morgan Kaufmann, San Francisco (1999) 17. NVIDIA. CUDA. Compute Unified Device Architecture. Version 1.1 (2007) 18. Poli, R.: Genetic programming for image analysis. In: Koza, J.R., Goldberg, D.E., Fogel, D.B., Riolo, R.L. (eds.) Proc. of the 1st Annual Conf. Genetic Programming, Stanford University, pp. 363–368. The MIT Press, Cambridge (1996) 19. Rizki, M.M., Tamburino, L.A., Zmuda, M.A.: Evolving multi-resolution featuredetectors. In: Fogel, D.B., Atmar, W. (eds.) Proc. of the 2nd Am. Conf. on Evolutionary Programming, pp. 108–118. Evolutionary Prog. Society (1993) 20. Rost, R.J.: OpenGL Shading Language, 2nd edn. Addison-Wesley, Upper Saddle River (2006) 21. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall, Upper Saddle River (2001) 22. Trujillo, L., Olague, G.: Synthesis of interest point detectors through genetic programming. In: Proc. of the Genetic and Evolutionary Computation Conf., Seattle, Washington, July 8-12, pp. 887–894. ACM, New York (2006) 23. Witkin, A.P.: Scale-space filtering. In: Proc. of the 8th Int. Joint Conf. on Artificial Intelligence, Karlsruhe, Germany, pp. 1019–1022 (1983)
Systemic Computation Using Graphics Processors Marjan Rouhipour1, Peter J Bentley 2, and Hooman Shayani2 1
BIHE University (The Bahá'í Institute for Higher Education), Iran [email protected] 2 Department of Computer Science, University College London, Malet Place, London {p.bentley,h.shayani}@cs.ucl.ac.uk
Abstract. Previous work created the systemic computer – a model of computation designed to exploit many natural properties observed in biological systems, including parallelism. The approach has been proven through two existing implementations and many biological models and visualizations. However to date the systemic computer implementations have all been sequential simulations that do not exploit the true potential of the model. In this paper the first parallel implementation of systemic computation is introduced. The GPU Systemic Computation Architecture is the first implementation that enables parallel systemic computation by exploiting multiple cores available in graphics processors. Comparisons with the serial implementation when running a genetic algorithm at different scales show that as the number of systems increases, the parallel architecture is several hundred times faster than the existing implementations, making it feasible to investigate systemic models of more complex biological systems. Keywords: Bio-inspired computation; systemic computation; GPU; parallel architectures; genetic algorithm.
1 Introduction In biological modeling and bio-inspired computation, the demand for fast parallel computation has never been greater. In computer science, entire fields now exist that are based purely on the tenets of simulation and modelling of biological processes. In the developing fields of synthetic biology, DNA computing, and living technology, computer modelling plays a vital role in the design, testing and evaluation of almost every stage of the research1. There are also many fields of computer science that focuses on bio-inspired algorithms such as genetic algorithms, artificial immune systems, developmental algorithms, neural networks, and swarm intelligence. Almost without exception these computer models and algorithms involve parallelism, although they are usually implemented as serial simulations of parallel processes. While multi-core processors, clusters or networked computers provide one way to parallelise the computation, the underlying computer architectures remain serial, and 1
Evident from the many publications of the European Center for Living Technology: http://www.ecltech.org/publications.html
G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 121–132, 2010. © Springer-Verlag Berlin Heidelberg 2010
122
M. Rouhipour, P.J. Bentley , and H. Shayani
so can significantly limit our ability to scale up our models and bio-inspired algorithms and make them practical for real-world problems [1]. Several research groups have focused on this area for many years, resulting in novel bio-inspired architectures such as the POEtic tissue [2] and the Ubichip of Perplexus project [3], as well as formalisms such as PI-Calculus, Bi-Graphs [4] and Brane Computing [5]. Systemic computation is a similar attempt to exploit desirable natural properties such as parallelism within a computer, developed in 2005 [1]. Although not the first such model, SC is the result of considerable research into bio-inspired computation and biological modelling, and has been developed into a working computer architecture [1,6]. To date, two simulations of this architecture have been developed, with corresponding machine and programming languages, compilers and graphical visualiser [1,6]. Extensive work has shown how this form of computer enables useful biological modeling and bio-inspired algorithms to be implemented with ease [7-11] and how it enables properties such as fault-tolerance and self-repairing code [12]. Research is ongoing in the improvement of the PC-based simulator, refining the systemic computation language and visualiser [10,14]. However, the systemic computation model defines a highly parallel, distributed form of computer. While simulations on conventional computers enable improvement of the model and programming tools, the speed of simulated systemic computation is too slow to be useable for larger models. The work described in this article aims to overcome this problem by making use of Graphics Processing Units (GPUs) to parallelize some of the bottlenecks in systemic computation and thus take the first steps towards a fully parallel systemic computer, capable of high-speed modeling. Graphic Processing Units are multi core processors designed to process graphical information at high speed. Because of their price and power, the use of GPUs for more general-purpose computation is rapidly becoming something of a revolution in affordable parallel computation [15]. More recently the design of GPUs was changed to support more general computation. Today GPGPU (General Purpose GPU) languages are used widely in scientific computation. They are used in physical based simulation, signal and image processing, global illumination, and geometric computing [15]. GPGPU is frequently used in biological modelling and visualization that requires large-scale computation and real-time processing. They have been used in molecular modeling applications [17], string matching to find similar protein and gene sequences [18], and in implementations of bio-inspired algorithms [19]. This work describes a novel GPU-based implementation of the bio-inspired computing approach known as systemic computation. The next section summarizes systemic computation. The new GPU Systemic Computation Architecture is then described, followed by an experiment to compare the GPU version with the singleprocessor implementation.
2 Systemic Computation Systemic Computation is a model of computation and corresponding computer architecture based on a systemics world-view and supplemented by the incorporation of natural characteristics [1]. This approach stresses the importance of structure and interaction, supplementing traditional reductionist analysis with the recognition that circular causality, embodiment in environments and emergence of hierarchical
Systemic Computation Using Graphics Processors
123
organisations all play vital roles in natural systems. Systemic computation makes the following assertions: • • • •
Everything is a system. Systems can be transformed but never destroyed or created from nothing. Systems may comprise or share other nested systems. Systems interact, and interaction between systems may cause transformation of those systems, where the nature of that transformation is determined by a contextual system. • All systems can potentially act as context and affect the interactions of other systems, and all systems can potentially interact in some context. • The transformation of systems is constrained by the scope of systems, and systems may have partial membership within the scope of a system. • Computation is transformation. In systemic computation, everything is a system, and computations arise from interactions between systems. Two systems can interact in the context of a third system. All systems can potentially act as contexts to determine the effect of interacting systems. Every system is divided into three parts: two schemata and one functional region. These three parts can be used to hold anything (data, typing, etc.) in binary. The functional region defines the transformation of two systems interacting in its context. The two schemata specify through a matching function which subject systems may interact in this context. A system can also contain or be contained by other systems. This enables the notion of scope. Interactions can only occur between systems within the same scope. An SC program therefore comprises systems that are instantiated and positioned within an embedded hierarchy (some inside each other). It defines an initial state from which the systems can then randomly interact, transforming each other through those interactions and following an emergent process rather than a deterministic algorithm. For full details see [1] and [10].
Fig. 1. A basic SC program, which sums the values held in the schemata of systems, leaving one system containing the solution. Initial state is shown on the left, state after several interactions is shown on the right.
124
M. Rouhipour, P.J. Bentley , and H. Shayani
Systems can be implemented using representations similar to those used in genetic algorithms and cellular automata. In implementations to date, each system comprises three binary strings: two schemata that define sub-patterns of the two matching systems and one coded pointer to a transformation function. Two systems that match the schemata will be transformed according to the appropriate transformation function (i.e. their binary strings are modified according to the function defined in the context). A simple example of a (partially interpreted) system string (where S11 is the first schema and S12 is the second schema of system 1) might be: “zzx00rzz
[S11=SUM(S11,S21); S12=SUM(S12,S22); S21=0; S22=0]
zzx00rzz”
meaning: for every two systems that have functional part of NOP that interact in the context of this system, add their two S1 values, storing the result in S1 of the first system and add the two S2 values, storing the result in S2 of the first system, then set S1 and S2 of the second system to zero. (The table of schema codes is given in [1].) Given a pool of inert data systems, able to interact but with no ability to act as context, for example (where NOP means “no operation”): “00010111 NOP 01101011” and “00001111 NOP 00010111”
after a sufficient period of interaction, the result will be a single system with its S1 and S2 values equal to the sum of all S1 and S2 values of all data systems, with all other data systems having S1 and S2 values of zero. Figure 1 illustrates the program using SC graph notation. (The program performing this operation was described in [1].) Bentley [1] describes the first implementation of the systemic computer. The initial work included the creation of a virtual architecture, instruction set, machine code and corresponding assembly language with compiler. Over 30 transformation functions were implemented (e.g., arithmetic and logical operations, and basic i/o). The simulation was implemented in ANSI C on a PowerBook Macintosh G4, enabling systemic computation programs to be simulated using conventional computer processors. Later work by Le Martelot created a second implementation on PCs with a higher-level language and visualization tools [2,6-12,14]. Other work provided a discussion on the use of sensor networks to implement a systemic computer [13]. Many systemic computation models have been written, showing that simulations of this parallel computer can perform tasks from investigations of neurogenesis to a self-adaptive genetic algorithm solving a travelling salesman problem [10]. Work on the language and refinements to systemic computation and its use for modelling are underway. However, perhaps the biggest single problem with all implementations to date has been the speed of execution. The simulation of a parallel bio-inspired computer on a conventional serial computer can be excessively slow for models using large numbers of systems. For this reason, we hypothesize that an implementation on GPUs may provide significant speedup.
3 GPU Implementation of Systemic Computation Systemic computation is a Turing Complete parallel computer [1], but it is still a significant challenge to exploit the parallel architecture of the GPU to implement such a flexible bio-inspired approach.
Systemic Computation Using Graphics Processors
125
The constituent elements of Systemic Computation are systems. Two systems can only interact with a context and within a shared scope, which are also systems. Akin to Bentley’s implementation [1], membership of scopes can be implemented as a global scope table, a row and column for every system, with each entry defining the membership of one system within another system. Also akin to Bentley’s original implementation [1], in the GPU design we can implement the concept of a system by storing it in three binary parts: two schemata and one function. If the current system is acting as a context, then its two schemata define all possible pairs of systems that could interact within the current context. The system function defines the interaction between the two systems, i.e. it provides a transformation function. In the implementation created by Bentley [1], the transformation function contains a matching threshold for schema 1 and schema 2. The length of each part of the system is 16 character codes. Each code is decoded to three binary characters (0, 1, and wildcard). If the difference between a system and the decoded schema’s part is less than the schema’s threshold, that system matches the context system’s schema [1]. We call a system that has a valid transformation function, a ‘context’ or ‘functional’ system. A context system and two interacting systems together are called a triplet. If two interacting systems can match the schema parts of a functional system to form a triplet, and all of them are in the same scope, it is defined as a matched or valid triplet. The GPU Systemic Computation architecture has two main tasks: 1) finding a valid triplet, and 2) performing a transformation to the interacting systems. The finding of valid triplets is the biggest bottleneck in systemic computation, so in our design, the producer finds matched triplets and puts them in a shared buffer whilst the consumer picks one of triplets and performs the interaction between them. The producer and consumer are two threads, running in parallel on the CPU, see Figure 2. 3.1 Consumer: Performing System Interactions The consumer is a thread running on the CPU. It is responsible for enabling the interactions between systems. This thread selects valid triplets (each a valid context and two matching systems) randomly from the shared buffer. It then uses the transformation function defined in the context system to transform the pair of interacting systems. However, performing the transformation may change the systems’ scopes and definitions. As a result, other triplets in shared memory that share the one of the current triplet’s systems may no longer match (i.e., the transformation of one triplet may invalidate other triplets). In order to solve this problem we check the validity of triplets before performing interaction. After selecting a triplet, if the triplet is still a matched triplet, the transformation is performed. Then the flags of systems definitions and scopes are set. These flags are necessary to update data on GPU memory for the other thread, producer. 3.2 Producer: Finding Matching Systems General Purpose Graphic Processing Unit (GPGPU) languages are based on shared memory [20][21]. In the GPU Systemic Computation Architecture, systems are shared data and the instructions that check validity of triplets in a scope are the same for all threads. So, CUDA is a good choice to find matched triplets in parallel.
126
M. Rouhipour, P.J. Bentley , and H. Shayani
Fig. 2. Producer and Consumer flowchart
Finding a list of matched triplets is both a sequential and parallel procedure. There are six main steps: initializing, updating, finding matched triplet, prefix sum, creating a list of matched triplets, and copying them to the shared buffer. These steps are run sequentially on the CPU. The third, fourth, and fifth steps are the main parts that are run on GPU. Each of these steps is a kernel, a section of code that is run on the GPU. As only one kernel can be run on the GPU in CUDA, and the output of each step is an input of the next step, these steps run sequentially on the CPU, but individually parallel on the GPU. In the summary below we focus on kernels. STEP 1: Initialize. Memory is allocated on the GPU for different variables: system definitions, scope table and decoded systems (including the two decoded schemata and threshold function). STEP 2: Update. The other thread that performs transformation functions changes the scope table and system definitions, and so the producer thread always updates variables on GPU before checking all possible combination of triplets. Then new
Systemic Computation Using Graphics Processors
127
values of variables are copied from the host, (a part of the hardware that is on the CPU’s processing part) to the device, (part of hardware that is on the GPU’s part for processing). In addition, functional systems and valid scopes (scopes with equal or greater than three systems inside and at least one functional system) are found and copied to the GPU Memory. For finding differences between the schema part and the system, the schema is decoded; therefore, a list of decoded functional systems is prepared on the device and updated after being changed by the consumer thread. STEP 3: Finding Matched Triplets. In this step a kernel is called that searches through all possible triplets in order to find matched triplets. Before calling the kernel, memory is allocated for the list of flags. Each thread has a flag that is initialized to zero. Next, the GPU grid and block dimensions are set. All threads in a block check one scope and context for some interacting systems. Thus, one thread in each block, usually the first one, calculates the index of context and scope systems and stores it in the shared memory of the block; meanwhile, other threads in a block wait. If the context is in the scope, the index of interacting systems is then calculated. After that if interacting systems are in the scope and three systems are matched, the triplet’s flag is set to one. (Memory access is optimized by making use of the short-latency, on-chip memory as a cache.) STEP 4: Prefix Sum. In this step we want to create a list of parallel matched triplets, whose flags have been set in the previous step. In order to do so, the index of matched triplets in the new list have to be found. The prefix sum kernel is run on the flags to find indexes of matched triplets. The prefix sum calculates number of previously matched triplets in the current list for each matched triplet. A parallel implementation of this algorithm is available in CUDA SDK 2.22. The current implementation of this algorithm is only run in one grid dimension, but we have changed it to two dimensions to support a larger size array, if sufficient memory is available. STEP 5: Creating List of Triplets. The kernel for creating lists of matched triplets is run with the same grid’s and block’s dimensions of the found matched triplets kernel. The first thread calculates the block and context, stores in the shared memory. Then threads read two subsequent memories of prefix-sum flags. If a thread identified two different values, the thread’s dimensions indicate indexes of matched triplet’s systems and scope. Here each flag memory is read twice, for optimization the first 16 threads of each block load the 17 subsequent memory of flags to shared memory. This helps to reduce the number of reads from global memory by half. STEP 6: Copy matched triplet list to buffer. A list of matched triplets is copied to the host. It is then randomized and copied to the shared buffer. For a large number of systems, it is not possible to process all combination of triplets at once, because there is not enough memory for flags and list of matched triplets on the GPU. So each time step, a subset of all possible triplets is chosen randomly and processed, a list of matched triplets of the subset triplets is created. After all possible triplets checks, variables update the GPU memory and the second round starts.
2
Available online at: http://www.nvidia.com/object/cuda_get.html
128
M. Rouhipour, P.J. Bentley , and H. Shayani
4 Architecture Testing and Evaluation In order to assess the new implementation we run a systemic program on the original architecture created by Bentley [1] and the new GPU Systemic Computation Architecture, and compare the results. In the following section we outline a problem implemented in the Systemic Computation language: the knapsack problem using a Genetic Algorithm (GA). The problem is specifically designed to challenge the systemic computer with a complex parallel computation. 4.1 Genetic Algorithm Optimization of Binary Knapsack In the knapsack problem there are n objects with value vi > 0 and weight wi > 0. We want to find a set of objects with maximum total weight that fits into a knapsack with capacity C. Thus, we wish to maximize: n
n
∑ i=1
v i x i where
∑w x i
i
≤ C and x i ∈ {0,1}
i=1
Here, we use a Genetic Algorithm (GA) [10] as a bio-inspired algorithm to implement the optimization program. The binary knapsack implementation in the Systemic Computation language is derived from the genetic algorithm model developed for systemic computation in [2]. There are three different solutions, as shown in Figure 3: uninitialized solutions, initialized solutions, and final solutions. The chromosome size equals the schema size (16 in this implementation). So, this program supports a knapsack with 16 objects.
Fig. 3. Left: The Solution system S. Schema1 stores the chromosome. XY in Schema2 specifies solution type (00: non initialized, 10: initialized, 11: final solution). Right: the systemic program (not all non-initalised, inialised solutions and GA systems shown).
Solution initialisation is random. An initialiser system selects one solution and initializes it randomly then places it in the scope of the computation system. Three GA operators are used: uniform crossover, one-point crossover and binary mutation [10]. Candidate (evolving) solutions to the problem are implemented as systems, and GA operators are implemented as context systems, defining the interaction of pairs of
Systemic Computation Using Graphics Processors
129
solution systems. Each operator performs crossover or mutation to generate a new solution, this is evaluated and then the less fit solution is replaced by the new one. The final solution system is used to store the final chromosome. The output context accepts the final solution and one initialized solution, and updates the final solution if the input solution system has better fitness than final solution system. 4.2 Experiments
To assess any performance gain of the GPU Systemic Computation Architecture, comparisons are made with Bentley’s original serial implementation [1]. We study the effect of number of systems. The execution time in both experiments includes the running time for 10,000 interactions. It does not include reading of input and program files and initializing variables on CPU or storing results, but it does contain initializing and updating GPU memory, allocating and releasing memory from it. The hardware and software specification used in this work is given in Table 1. Bentley’s original Systemic Computation Architecture was written with C language [1], thus we use C language based on CUDA as programming language. Setup The goal of the experiment is to compare the performance of the new parallel implementation and Bentley’s original sequential implementation of the systemic computation architecture with increasing number of Solution systems on the GA binary knapsack program. In this experiment, increasing number of systems refers to increasing number of GA solution systems. Each systemic computation program was run for 10 times; reported execution time is the average of all programs’ execution time. In this experiment, the number of knapsack objects is 16; the maximum knapsack’s weight is 80.0 kg. The configuration of experiment is:
• Context systems: 3 GA systems and 1 output system • Solution systems: 50 to 4000 systems for sequential implementation and 50 to 8000 systems for parallel implementation (each increment is double the previous increment except 800 to 1000 with an increment of 200) • Final Solution system: 1 system • Scope: 1 main scope and 1 computation scope • Initial increment: 50 Table 1. Hardware and operating system specifications are used for experiments CPU RAM OS
GPU
Intel® dual core™, 2.40 GHz 2 GB Microsoft Windows XP professional 2002 SP1 Name: GeForce 9800 GT CUDA: 1.1 Size of Global memory: 1 GB Multiprocessors 14 Number of cores: 112 Clock Rate: 1.62 GHz
130
M. Rouhipour, P.J. Bentley , and H. Shayani
Results Figure 4 shows the execution time for both parallel and sequential implementation with increasing number of systems. As can be seen in the figure, the execution time of the sequential implementation increases with a 2nd degree polynomial, characteristic of algorithms and implementations with time complexity O(n2). The parallel implementation appears to increase linearly, characteristic of algorithms and implementations with time complexity O(n). The increase in performance varies according to the number of systems, e.g. it is 2.3 times as fast for 50 systems, 108 times faster for 800 systems, 256 times faster for 2000 systems, and 465 times faster for 4000 systems. The improvement derives from the efficient division of labour using the parallel resources of the GPU; it is likely that as the number of systems increases beyond this capacity, the execution time will appear as O(n2). Consistent results have since been found with other programs, including those that move systems between scopes.
Fig. 4. Top: Execution time of knapsack problem on both sequential and parallel implementation with increasing number of systems. Bottom left: execution time of parallel implementation alone. Bottom right: improvement as shown by sequential divided by parallel execution times for different numbers of systems in the program.
Systemic Computation Using Graphics Processors
131
5 Conclusion The need for fast bio-inspired computation has never been greater. Systemic computation is a new bio-inspired model of computation that has shown considerable success for biological modeling and bio-inspired computation [6-14]. However until now it has only been available as a serial simulation running on conventional processors. In this work the first parallel GPU Systemic Computation Architecture was presented. Its performance was assessed by comparing the change in execution time needed when scaling up the number of systems within a genetic algorithm knapsack problem. As the number of systems increased, the parallel GPU architecture was several hundred times faster than the existing implementations. These highly successful results will make it possible to investigate systemic models of more complex biological systems in the future. Further improvements are also planned. The GPU SC architecture is just the first step towards creating a fully parallel systemic computer. Future work will continue the development of parallel SC architectures and will investigate the use of reconfigurable hardware such as FPGAs.
References [1] Bentley, P.: Systemic computation: A model of interacting systems with natural characteristics. J. of parallel, emergent and distributed systems 22, 103–121 (2007) [2] Barker, W., Halliday, D.M., Thoma, Y., Sanchez, E., Tempesti, G., Tyrrell, A.M.: Fault tolerance using dynamic reconguration on the poetic tissue. IEEE Transactions on Evolutionary Computation 11(5), 666–684 (2007) [3] Upegui, A., Thoma, Y., Sanchez, E., Perez-Uribe, A., Moreno, J.-M., Madrenas, J.: The perplexus bio-inspired recongurable circuit. In: Proceedings of Adaptive Hardware and Systems (AHS 2007) Second NASA/ESA Conference, pp. 600–605 (2007) [4] Milner, R.: Pure bigraphs: structure and dynamics. Inf. Comput. 204(1), 60–122 (2006) [5] Paun, G.: P systems with active membranes: Attacking np complete problems. Journal of Automata, Languages and Combinatorics 6, 75–90 (1999) [6] Le Martelot, E., Bentley, P.J., Lotto, R.B.: A Systemic Computation Platform for the Modelling and Analysis of Processes with Natural Characteristics. In: Proceedings of the 9th Genetic and Evolutionary Computation Conference (GECCO 2007) Workshop: Evolution of Natural and Artificial Systems - Metaphors and Analogies in Single and MultiObjective Problems, July 7-11, pp. 2809–2816 (2007) [7] Bentley, P.J.: Methods for Improving Simulations of Biological Systems: Systemic Computation and Fractal Proteins. J. R. Soc. Interface 2009 6, S451–S466 (2009) [8] Le Martelot, E., Bentley, P.J., Lotto, R.B.: Exploiting Natural Asynchrony and Local Knowledge within Systemic Computation to Enable Generic Neural Structures. In: Proceedings of the 2nd International Workshop on Natural Computing (IWNC 2007), Nagoya University, Nagoya, Japan, December 10-13, pp. 122–133 (2007) [9] Le Martelot, E., Bentley, P.J.: Metabolic Systemic Computing: Exploiting Innate Immunity within an Artificial Organism for On-line Self-Organisation and Anomaly Detection. J. of Mathematical Modelling and Algorithms (JMMA) 8(2), 203–225 (2008) [10] Le Martelot, E., Bentley, P.J.: Modelling Biological Processes Naturally using Systemic Computation: Genetic Algorithms, Neural Networks, and Artificial Immune Systems. In: Choing, R. (ed.) Nature-Inspired Informatics for Intelligent Applications and Knowledge Discovery, pp. 204–241. IGI Global (2008)
132
M. Rouhipour, P.J. Bentley , and H. Shayani
[11] Le Martelot, E., Bentley, P.J., Lotto, R.B.: Eating Data is Good for Your Immune System: An Artificial Metabolism for Data Clustering using Systemic Computation. In: Bentley, P.J., Lee, D., Jung, S. (eds.) ICARIS 2008. LNCS, vol. 5132, pp. 412–423. Springer, Heidelberg (2008) [12] Le Martelot, E., Bentley, P.J., Lotto, R.B.: Crash-Proof Systemic Computing: A Demonstration of Native Fault-Tolerance and Self-Maintenance. In: Proceedings of the 4th IASTED International Conference on Advances in Computer Science and Technology (ACST 2008), Langkawi, Malaysia, April 2-4, 2008, pp. 49–55 (2008) [13] Bentley, P.J.: Designing Biological Computers: Systemic Computation and Sensor Networks. In: Liò, P., Yoneki, E., Crowcroft, J., Verma, D.C. (eds.) BIOWIRE 2007. LNCS, vol. 5151, pp. 352–363. Springer, Heidelberg (2008) [14] Le Martelot, E., Bentley, P.J.: On-Line Systemic Computation Visualisation of Dynamic Complex Systems. In: Proceedings of the 2009 International Conference on Modeling, Simulation and Visualization Methods (MSV 2009), July 13-16, pp. 10–16 (2009) [15] Owens, J., et al.: A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum 26, 80–113 (2007) [16] Buck, I., et al.: Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics 23, 777–786 (2004) [17] Rodrigues, C., et al.: GPU acceleration of cutoff pair potentials for molecular modeling applications. In: Conference on Computing Frontiers, pp. 273–282 (2008) [18] Schatz, M., Trapnell, C.: Fast exact string matching on the gpu. Center for Bioinformatics and Computational Biology (2007) [19] Clayton, T., Patel, L., Leng, G., Murray, A., Lindsay, I.: Rapid evaluation and evolution of neural models using graphics card hardware. In: Genetic and Evolutionary Computation Conference 2008, pp. 299–306 (2008) [20] Programming Guide, NVIDIA Corporation, version 2.2. NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 (2009), http://developer. download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_ Programming_Guide_2.2.pdf [21] NVIDIA GeForce 8800 GPU Architecture Technical Brief TB-02787-001_v01 (2006)
An Efficient, High-Throughput Adaptive NoC Router for Large Scale Spiking Neural Network Hardware Implementations Snaider Carrillo1, Jim Harkin1, Liam McDaid1, Sandeep Pande2, and Fearghal Morgan2 1
Intelligent Systems Research Centre (ISRC), University of Ulster, Magee Campus, Derry, Northern Ireland [email protected] 2 Bio-Inspired Electronics and Reconfigurable Computing Research Group (BIRC), National University of Ireland, NUI Galway, Galway, Ireland [email protected]
Abstract. Recently, a reconfigurable and biologically inspired paradigm based on network-on-chip (NoC) and spiking neural networks (SNNs) has been proposed as a new method of realising an efficient, robust computing platform. However the use of the NoC as an interconnection fabric for large scale SNN (i.e. beyond a million neurons) demands a good trade-off between scalability, throughput, neuron/synapse ratio and power consumption. In this paper an adaptive NoC router architecture is proposed as a way to minimise network delay across varied traffic loads. The novelty of the proposed adaptive NoC router is twofold; firstly, its adaptive scheduler combines the fairness policy of a round-robin arbiter and a first-come first-served priority scheme to improve SNN spike packet throughput; secondly, its adaptive routing scheme (verified using simulated SNN traffic) allows the selection of different NoC router output ports to avoid traffic congestion. The paper presents the performance and synthesis results of the proposed adaptive NoC router operating within the EMBRACE architecture. Results illustrate that the high-throughput, low area and low power consumption of the adaptive NoC router make it feasible for use in large scale SNN hardware implementations. Keywords: Adaptive Router Architecture, Spiking Neural Networks, work-on-Chip, EMBRACE.
Net-
1 Introduction In the last sixty years, several computational models such as the Hodgkin and Huxley, the leaky integrate & fire and the Izhikevich have been proposed to mimic to a certain degree the biological behaviour of real neurons. These computational neuron models have led to the creation of the interesting and powerful bio-inspired Spiking Neural Network (SNN) computational paradigm [1]. The challenge is to develop a complex high-performance synapse/neuron interconnection pattern, implemented in an electronic device that exhibits low power consumption, reconfigurable capabilities, intrinsic parallelism and a high level of scalability. G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 133–144, 2010. © Springer-Verlag Berlin Heidelberg 2010
134
S. Carrillo et al.
The complexity of inter-neuron connectivity is prohibiting the progress in hardware toward biological-scale SNNs as the rapid increase in the ratio of fixed connections to the number of neurons is limiting the size of the network. To overcome this issue, initial research has focused on the Network-on-Chip (NoC) interconnect paradigm as a possible mechanism to support scalability. Nevertheless, the use of the NoC as an interconnection fabric for large scale SNNs (i.e. beyond a million neurons) demands a good trade-off between scalability, throughput, neuron/synapse ratio and power consumption. Consequently, the router itself plays an important role, mostly because its hardware architecture has a major impact on the following parameters: •
•
•
Power Consumption: the router is the communication point to which synapses and neurons are attached; this implies that the number of routers increases proportionally with the number of neurons. Hence, the power consumption for large scale SNN hardware implementations increases as the major contributor for this power consumption is the interconnection fabric (i.e. routers, index tables, etc). The neuron model usually has a power consumption approximately six orders of magnitude smaller than that of the interconnection fabric [2]. Throughput: the router is also responsible for managing SNN spike events. However the traffic pattern shown by spiking neurons is highly asynchronous and non-uniform [1]. Hence, an effective arbitration policy is desired. This policy should be dynamically adaptable depending on the traffic behaviour and should also route and deliver as many SNN spike events as possible in a short period of time without affecting the traffic performance and without incurring any significant hardware overhead. Traffic Congestion: the typical firing rate for a biological neuron is between 10 ms - 30 ms [1]. However, as the number of neurons increases, the increasing number of SNN spike events presents routers with the more difficult task of accomplishing real-time routing without SNN spike packet loss. Therefore, for large scale SNN hardware implementation, routing algorithms which implement traffic congestion management features to reduce traffic congestion impact are required.
In this paper an adaptive NoC router architecture is proposed and its feasibility for large scale EMBRACE [3] SNN hardware implementation is presented. The rest of this paper is organised as follows: Section 2 presents the motivation for this research and summarises current work regarding NoC-based SNN hardware implementation. Section 3 discusses the proposed adaptive NoC router architecture incorporated within the EMBRACE architecture. Section 4 presents results and analysis of the proposed adaptive NoC router architecture in term of area utilisation, power consumption and spike packet throughput. Finally Section 5 provides a brief discussion regarding large scale SNN hardware implementations and conclusion.
2 Motivation and Previous Works Traditionally, software approaches are too slow to execute a long simulation of SNNs and do not scale efficiently [4]. Thus, researchers have explored alternative hardware SNN solutions using FPGAs and GPUs that provide a fine-grained parallel architecture
An Efficient, High-Throughput Adaptive NoC Router 135
and a 2D mesh interconnect topology [5], [6]. The authors highlight [4] that FPGA approaches have several limitations such as inefficient area utilisation and a Manhattan style interconnect. [7] indicates that the problem of accessing memory in a coherent manner and limited memory bandwidth, are major drawbacks for SNNs based on GPU platforms. Moreover, neither the FPGA nor the GPU architectures are power efficient and they have limited on-chip weight storage capabilities. Therefore, it is necessary to look to custom hardware to provide area and power requirements which can support large scale hardware SNN realisation with dense neuron interconnection. Traditional approaches using a bus-shared topology offer a simple and inexpensive channel to interconnect several neurons. However, in [8] the authors compare the performance of the bus topology with different interconnection networks topologies, and conclude that a bus topology does not scale to allow interconnection of a large number of SNN processing elements (PEs). Furthermore, a bus topology is not able to guarantee real-time execution since the latency of the network increases proportionally to the number of PEs connected to the shared-bus. In [9] the Network-on-Chip (NoC) interconnect paradigm is introduced as a promising solution to solve the on-chip communication problems experienced in Systems-on-Chip (SoC) computing architectures, where generally high throughput and high interconnect capability is required. In general, NoC architectures are composed of a set of shared PEs, network interfaces, routers and channels, which are arranged in a topology depending on the application. In the context of SNNs, these PEs refer to the neuron models attached to the NoC routers placed throughout the network. Channels are analogous to the synapses/axons of spiking neurons. The SNN topology in this case refers to the way spiking neurons are interconnected across the network. The concept of using NoCs in SNNs for large scale hardware implementations has been reported in [10], [11], [12] (summarised in Table 1). Although a few of these projects exhibit good throughput and some others provide a high Quality of Service (QoS), their major drawback is that they do not provide an adequate mechanism to deal with traffic congestion with scaling of the SNN size. This is important in achieving efficient large scale SNN hardware implementations. Table 1. SNN hardware implementation examples using NoCs Project Reference Spinnaker [12] Facets [11] Theocharides. et al [10]
Quality of Service (QoS) Best Effort Best/Guaranteed Effort Best Effort
Congestion Mechanism No No No
Power [mW] 64 NA NA
Throughput [Gbps] 14.4 6.1 0.1
3 Adaptive NoC Router In previous work [3] the authors proposed EMBRACE, a custom field programmable neural network architecture that merges the programmability features of FPGAs and the scalable interconnectivity of NoC router strategy to implement large scale spiking
136
S. Carrillo et al.
Fig. 1. EMBRACE architecture overview [3]
neural networks with a custom low-area/power programmable synapse cell, which has characteristics similar to real biological synapses [13]. The proposed NoC strategy uses individual routers to group n synapses and the associated neuron using a novel structure referred to as a neural tile. The neural tile is viewed as a macro-block of EMBRACE and its novelty resides in the merging of analogue synapse/neuron circuitry with NoC digital interconnect to provide a scalable and reconfigurable neural building block. The EMBRACE NoC architecture is a mesh-based two-dimensional array of interconnected SNN neural tiles (each connected in N, E, S, W directions) as illustrated in Fig. 1. Spike exchange between neural tiles is achieved by routing packet-based spike information through NoCs connected to neural tile ports. Moreover, the EMBRACE architecture supports the programmability of SNN topologies on hardware, providing an architecture that enables the accelerated prototyping and hardware in-loop training of SNNs. In this regards, a parallel research project called EMBRACE-FPGA [14], has enabled the development of a 32 neuron, 32-synapses per neuron hardware SNN evolution platform, executing a range of applications, and allowing refinement of EMBRACE architecture selection. Although the previous NoC router design [3] exhibits good latency and area performance, i.e. the router can process incoming data packets every 10 clock cycles with source packet generation requiring 12 cycles using 234 LUTs for a Virtex 4 device, its non-adaptive architecture makes it difficult to overcome the congestion problem present in a large scale SNN implementation. Therefore, an adaptive router architecture has been proposed and investigated as a way to route packets of spike events efficiently throughout the network whilst trying to balance congestion in the network (i.e. increase the throughput). The adaptability of the proposed router can be described in two dimensions: •
•
An adaptive arbitration policy (AAP) module which combines the fairness policy of a round-robin arbiter and the priority scheme of a first-come first-served (FCFS) approach, enabling improved router throughput according to the traffic behaviour presented across the network. An adaptive routing scheme (ARS) module which enables the selection of different router paths to avoid traffic congestion, based on pattern traffic and a channel congestion detector (CCD).
An Efficient, High-Throughput Adaptive NoC Router 137 EMBRACE Neural Tile
N
EMBRACE Adaptive NoC Router Architecture
General Packet Layout
0010
Unicast mode
0011
* the remaining bits are kept for future extensions, e.g. multicast mode, debug mode, just to mention a few.
Packet Decoder and Encoder
FIFO
SpikeIn (N-1:0) SpikeOut
Configuration Interface (APIs)
Configuration mode (end)
Synapse Weights (N-1:0)
CCD
0001
I
0000
Configuration mode (start)
E
N
Value
Runtime mode
EMBRACE Neural Cell
Adaptable Routing Scheme (ARS)
I N
Header Information
Adaptable Arbitration Policy (AAP)
CCD
W
Busy/congested lines
CCD
FIFO
Target address (21 bits)
Source address (7 bits)
FIFO
Input/output lines Header* (4 bits)
FIFO
CCD
S
Fig. 2. EMBRACE neural cell and adaptive NoC router structure and connection
Those modules and their interconnection are explained in more detail in the following sub-sections. Fig. 2 illustrates the proposed adaptive router and its interconnection with the EMBRACE neural cell. 3.1 Adaptive Arbitration Policy A key property of an arbiter is its fairness in the way that it provides equal service to different network traffic requests. Accordingly, several arbitration policies have been proposed [15]. However, almost all of them are either based on strong fairness (i.e. round-robin arbiter) or weak fairness (i.e. FCFS arbiter). A round-robin arbiter exhibits a strong fairness to service each port, because it allocates the lowest priority to the port that has been serviced in the previous round and gives the highest priority to the next port. This approach seems to be good for a heavy load scenario where all router ports are busy, since it gives equal priorities to all of them. However, this strong fairness also introduces two factors that alter the latency in the router, namely, the number of router ports is proportional to the router latency, and the position of the round robin arbiter when there is an incoming spike [16]. On the other hand, a FCFS arbiter gives the highest priority to the first event that occurs. Thus, a FCFS arbiter would be good for those traffic scenarios where only one port is busy (or at most a few ports are busy at the same time) since the router does not spend extra clock cycles servicing inactive or unused ports. However, contrary to the round robin approach, this weak fairness is not feasible for heavy traffic load scenarios since the probability of discarded packets increases due to the arbiter priority being given to the first port that requests its attention. Consequently, the authors propose an adaptive arbitration policy (AAP) which combines the strong fairness policy of the round-robin arbiter and the priority scheme of a FCFS approach. This hybrid approach improves router throughput according to the traffic behaviour presented across the network. The proposed AAP uses a spike event register to store information regarding any new spike event for each port input buffer. Five distributed control units (i.e. one for each port) allow the scheduler to manage thread communication without incurring
138
S. Carrillo et al.
task-switching overhead. Therefore only the input buffers that contain information will be serviced, avoiding wasted clock cycles servicing those input buffers that do not contain packets. In the same way, when a heavy load traffic scenario occurs, all of the ports will be serviced based on a round-robin arbitration scheme. The pseudo code used for the proposed arbitration module is illustrated in Fig. 3.
Fig. 3. Pseudo code for the adaptive arbitration policy (AAP) module
3.2 Adaptive Routing Scheme The adaptive routing scheme module for the proposed router is composed of three main elements. Firstly, a routing algorithm that is based on an XY routing approach [15] and receives a default output port direction from the AAP module. Secondly, the channel congestion detector (CCD), illustrated in Fig. 4, uses information received from neighbouring routers to generate an alternative output port direction and passes this information to the adaptive routing decision (ARD) module, see Fig. 5. The ARD module takes the default output port direction given by the XY routing algorithm, and based on the information generated by the CCD, the proposed default output port is granted, or the alternative output port direction which is generated according to the traffic information received from the CCD, is selected. The CCD and ARD components are described below. The Channel Congestion Detector (CCD): Fig. 4 illustrates the CCD module which provides a means of detecting the current state of SNN packet traffic in any given direction. For any given direction the CDD module can detect whether the forward N, E, S or W channels are free, busy or congested, as follows: a. b. c.
Free: the input FIFO is empty or less than half-full. Busy: the input FIFO is half-full. Congested: the input FIFO is full.
The CCD module uses a combination of logical two-input AND gates and two-input OR gates. Whenever a router FIFO buffer is full, it asserts the FIFO full signal so that the full status can be detected by the CCD and propagates logic ‘1’ to
N S
N S
S
N
An Efficient, High-Throughput Adaptive NoC Router 139
Fig. 4. Channel congestion detector (CCD). This figure only shows the situation when looking along the east channel for clarity purposes. However, the ‘look-ahead’ facility is replicated in all N, E, S and W directions.
each of its associated AND, as illustrated in Fig 4. Similarly, if a FIFO is half-full then it will generate logic ‘1’ to its associated OR gate. The AND and OR gates are connected in a daisy chain between each router. The routing decision output signal of each router appears as a two bit value, one belonging to the congested line and the other belonging to the busy line, as shown in Fig. 4. When both lines are ‘0’ then the channel is free. Logic ‘1’ on the busy line indicates that a FIFO can take a packet but that it has a limited capacity. The output of this logic element will be four two-bit outputs, i.e. two bits for each direction. The traffic information (i.e. the congested, busy o free status per channel) generated from the CCD, is then forwarded as an input to the adaptive routing decision (ARD, in order to select the output port direction. The Adaptive Routing Decision (ARD): this module selects the forwarding port direction for routing spike packet data. The ARD module considers the two bits generated by the XY routing algorithm as the priority/default direction and the input from the CCD. The default output port direction given by the XY routing algorithm is used to index the routing table shown in Fig. 5. Those directions are ‘00’, ‘01’, ‘10’ and ‘11’, which correspond to N, E, W and S, respectively. The output of this table is the two alternative adapted directions for the given input. For example, if the input direction is north (‘00’) then the output of the lookup table is either “01” or “10”. These two outputs represent E and W direction, respectively. These values are both used as select
Fig. 5. Adaptive routing decision (ARD) module
140
S. Carrillo et al.
lines in two individual multiplexers, which are used to compare these values with the busy lines from the congestion detector to check if either of these channels is busy. The proposed ARD occupies a small area, is scalable and is a low power element. It therefore adds very little overhead to the NoC router architecture.
4 Performance Analysis This section presents results on the throughput capability of the proposed adaptive router for varied SNN traffic loads, and benchmarks its performance against existing approaches. The area and power requirements of the proposed additional circuitry are also highlighted. 4.1 Methodology A VHDL implementation of the proposed adaptive router architecture has been created in order to evaluate its performance. The router is characterised by its packet throughput, area utilisation and power consumption parameters. A SystemC spike event counter/generator testbench facilitated measurement of packet throughput. Area and power metrics have been obtained using the Synopsys Design Compiler tool for TSMC 90nm CMOS technology. The measurement setup was inspired by [15] and verified in [16]. This setup proposed the attachment of terminal instruments such as counters and generators at each router port. The spike event generator includes a packet source module to generate spike data according to the spike packet layout illustrated in Fig. 2. The spike event generator also defines the traffic pattern, packet length and the spike injection rate (i.e. the time between spike events). The spike event counter measures the SNN output spike rate and deduces the spike throughput and the number of unsuccessfully routed (dropped) spike packet. The relationship between the depth of the input FIFOs and their impact on the maximum throughput of the router and the total area power consumption is analysed within the simulation framework the depth of the input FIFOs varies between 1 up to 5. The inter-router packets data width is 32-bits, the width of the spike packet. The router operating frequency is 100MHz and the spike counter sample window time used is 1ms. A pre-count stage is applied before each counter window to allow the router to reach steady state operation. 4.2 Performance Results Several experiments have been carried out to assess the packet throughput for the proposed adaptive NoC router. These experiments have examined the impact of the spike inject rate (SIR) variation on the average adaptive router packet throughput. Router performance has been compared with that of a non-adaptive router (i.e. a round-robin equivalent) using an input FIFO depth of 5. Fig. 6 illustrates the packet throughput advantages in using the proposed adaptive router strategy. This demonstrates equal throughput performance for both adaptive and non-adaptive router when an SIR of 20 is applied (i.e. a spike packet is generated every 20 clock cycles). However, when an SIR of 2 is applied, the adaptive router achieves almost double the throughput of the round robin-based router. This is a typical traffic scenario for spiking neurons in burst mode. The advantages of the proposed adaptive router approach are as follows:
An Efficient, High-Throughput Adaptive NoC Router 141
Fig. 6. Relationship between the spike injection rate and the throughput per router, under different traffic load
•
•
When not all router ports are used, the adaptive router skips over idle router ports. Several clock cycles can be saved compared to the round-robin approach and the overall throughput can be increased. Figures 6a and 6b show the result obtained when one and three ports are used. When all router ports are busy, a packet throughput advantage using the proposed adaptive router occurs when the SIR value is less or equal to the number of ports minus one (i.e. SIR = 4), due to the non-adaptive router reaching the saturation level, i.e. the router is not able to deliver the packets as fast as they are being generated [15]. Therefore it is impossible for the round-robin arbiter to service all ports adequately since the SIR is smaller than the time available to the router to process the incoming packets. As a result, the unattended ports drop packets and the throughput saturates.
142
S. Carrillo et al.
Table 2. Synthesis summary for the proposed router obtained from Synopsys Design Compiler tool based on the TSMC 90nm CMOS technology library Input FIFO [Depth] 1 2 3 4 5
Dynamic Power [mW] 0.82 1.04 1.24 1.44 1.67
Leakage Power [mW] 0.13 0.14 0.15 0.17 0.19
Total Power [mW] 0.95 1.19 1.39 1.61 1.86
Area [mm^2] 0.039 0.041 0.045 0.048 0.054
Avg. Throughput [Gbps] 13.44 14.08 14.72 15.36 16.00
4.3 Evaluation Table 2 summarises the power consumption and area utilisation for the proposed adaptive NoC router. These parameters have been obtained using the Synopsys Design Compiler tool for the TSMC 90nm CMOS technology. A router clock frequency of 100 MHz has been used and the dynamic power consumption metrics have been obtained based on a fully loaded traffic scenario, where all neurons spike at the same time. Table 2 also shows the trade-off between the depth of the input FIFO and the maximum throughput per router. In addition, Tables 3 and 4 compare the performance of the proposed router with other existing approaches [10], [11], [12]. Table 3 highlights the routing algorithms used for the NoC routers. Table 4 highlights a high throughput of 16Gbps for the proposed adaptive NoC router whilst exhibiting a low power overhead of 1.86mW. The adaptive NoC router achieves a higher throughput performance than existing approaches. The authors are aware that the proposed router does not contain any index table to implement a multicasting scheme as in [12], which would increase the presented area and power metrics for the proposed router. However the throughput would remain the same since the arbitration policy would be independent of a future multicasting approach. Table 3. Comparison of the proposed router against other existing approaches Project Reference This work Spinnaker [12] Facets [11] Theocharides. et al [10]
Neuron Model LI&F Izhikevich LI&F I&F
NoC Topology 2D Mesh 2D Triangular Torus 2D Torus Mesh 2D Mesh
Routing Algorithm Adaptive XY routing Node table routing iSLIP XY routing
Table 4. Comparison of router performance against other existing approaches Project Reference This work Spinnaker [12] Facets [11] Theocharides. et al [10]
Quality of Service (QoS) Best Effort Best Effort Best/Guaranteed Effort Best Effort
Congestion Mechanism Yes No No No
Throughput [Gbps] 16.0 14.4 6.1 0.1
Power [mW] 1.86 64 NA NA
An Efficient, High-Throughput Adaptive NoC Router 143
5 Summary and Discussion The work presented here is part of a long-term vision to create EMBRACE, a mixed signal hardware platform to advance large scale SNN implementations. Research approaches previously discussed, i.e. [3], [10], [11] and [12] have shown promising results in establishing the motivation to continue using the NoC paradigm as a way to overcome the interconnection problems in hardware SNNs. Nevertheless, different aspects of NoC architectures need to be explored in order to take full advantages of all its capabilities and utilisation as an interconnect fabric for SNNs platforms. In this regards, the authors have proposed a novel adaptive NoC router architecture to alleviate the communication constraints currently experienced in the efficient realisation of SNNs in hardware. The paper demonstrates the advantages of using an adaptive NoC router architecture to improve throughput, area and power consumption. The proposed adaptive NoC router contributes to the plausibility of developing a scalable NoC-based EMBRACE SNN hardware implementation. Although having an efficient, high-throughput adaptive router is important, it is also vital that a balance between increased throughput and minimal area utilisation and power consumption is achieved. Thus, the proposed adaptive NoC router is a step forward in this direction. Acknowledgments. Snaider Carrillo Lindado is supported by a Vice-Chancellor Research Scholarship (VCRS) from the University of Ulster. Henry Carrillo, who helped to setup some experiments using the Synopsys Design Compiler tool.
References 1. Gerstner, W.: Spiking neuron models: Single neurons, populations, plasticity. Cambridge Univ. Pr., Cambridge (2002) 2. Livi, P., Indiveri, G.: A current-mode conductance-based silicon neuron for address-event neuromorphic systems. In: 2009 IEEE International Symposium on Circuits and Systems, pp. 2898–2901. IEEE, Los Alamitos (2009) 3. Harkin, J., Morgan, F., McDaid, L., Hall, S., McGinley, B., Cawley, S.: A Reconfigurable and Biologically Inspired Paradigm for Computation Using Network-On-Chip and Spiking Neural Networks. International Journal of Reconfigurable Computing 2009, 1–13 (2009) 4. Maguire, L.P., McGinnity, T.M., Glackin, B., Ghani, A., Belatreche, A., Harkin, J.: Challenges for large-scale implementations of spiking neural networks on FPGAs. Neurocomput. 71, 13–29 (2007) 5. Shayani, H., Bentley, P., Tyrrell, A.: A Cellular Structure for Online Routing of Digital Spiking Neuron Axons and Dendrites on FPGAs. In: Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp. 273–284. Springer, Heidelberg (2008) 6. Thomas, D., Luk, W.: FPGA Accelerated Simulation of Biologically Plausible Spiking Neural Networks. In: 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines, pp. 45–52. IEEE, Los Alamitos (2009) 7. Nageswaran, J.M., Dutt, N., Krichmar, J.L., Nicolau, A., Veidenbaum, A.: Efficient simulation of large-scale Spiking Neural Networks using CUDA graphics processors. In: 2009 International Joint Conference on Neural Networks, pp. 2145–2152. IEEE, Los Alamitos (2009) 8. Roche, B., Mc Ginnity, T., Maguire, L., Mc Daid, L.: Signalling techniques and their effect on neural network implementation sizes. Information Sciences 132, 67–82 (2001)
144
S. Carrillo et al.
9. Benini, L., De Micheli, G.: Networks on chips: a new SoC paradigm. Computer 35, 70–78 (2002) 10. Theocharides, T., Link, G., Vijaykrishnan, N., Irwin, M., Srikantam, V.: A generic reconfigurable neural network architecture implemented as a network on chip. In: Proceedings of IEEE International SOC Conference 2004, pp. 191–194. IEEE, Los Alamitos (2004) 11. Philipp, S., Schemmel, J., Meier, K.: A QoS network architecture to interconnect largescale VLSI neural networks. In: 2009 International Joint Conference on Neural Networks, pp. 2525–2532. IEEE, Los Alamitos (2009) 12. Plana, L.A., Furber, S.B., Temple, S., Khan, M., Shi, Y., Wu, J., Yang, S.: A GALS Infrastructure for a Massively Parallel Multiprocessor. IEEE Design & Test of Computers 24, 454–463 (2007) 13. McDaid, L., Hall, S., Kelly, P.: A programmable facilitating synapse device. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1615–1620. IEEE, Los Alamitos (2008) 14. Morgan, F., Cawley, S., Mc Ginley, B., Pande, S., Mc Daid, L., Glackin, B., Maher, J., Harkin, J.: Exploring the evolution of NoC-based Spiking Neural Networks on FPGAs. In: 2009 International Conference on Field-Programmable Technology, pp. 300–303 (2009) 15. Dally, W.J., Towles, B.: Principles and practices of interconnection networks. Morgan Kaufmann, San Francisco (2004) 16. Pande, S., Carrillo, S., Morgan, F., Cawley, S., Harkin, J., Mc Ginley, B., McDaid, L.: EMBRACE-SysC for Analysis of NoC-based Spiking Neural Network Architectures. Technical Report: Bio-Inspired Electronics and Reconfigurable Computing Research Group (BIRC), National University of Ireland, NUI Galway, Galway, Ireland (2010)
Performance Evaluation and Scaling of a Multiprocessor Architecture Emulating Complex SNN Algorithms Giovanny Sánchez, Jordi Madrenas, and Juan Manuel Moreno Department of Electronic Engineering, Technical University of Catalunya Jordi Girona 1-3, Campus Nord UPC, edif. C4; 08034 Barcelona, Catalunya, Spain {gsanchez,madrenas,moreno}@eel.upc.edu
Abstract. The performance analysis of an efficient multiprocessor architecture that allows accelerating the emulation of large-scale Spiking Neural Networks (SNNs) is reported. After describing the architecture and the complex SNN algorithm mapping, the performance study demonstrates that the system can emulate up to 10,000 300-synapse neurons in real time at 64 MHz with conventional FPGAs. Important improvements can be achieved by using advanced technology and increased clock rate or by means of simple architecture modifications. The architecture is flexible enough to be efficiently applied to any SNN model in general. Keywords: Spiking Neural Networks, SIMD, FPGA, Hardware implementation.
1 Introduction Bio-inspired systems employ different artificial neural network models to solve complex systems. In particular, Spiking Neuronal Network (SNN) models have recently attracted great interest. SNNs fall into the third generation of neural network models and they have emerged as a plausible paradigm for characterizing neuron dynamics in the cerebral cortex[1], due to the high realism level in a neuronal simulation. In the literature, several studies have reported that the SNNs require intensive computational effort [1, 11]. The interest in fast simulation of SNNs is two-fold: First, to provide a tool for neuroscientists to evaluate their large-scale models by studying their different dynamics and second to allow the implementation of low-power and real-time applications, for instance in robotics or human-computer interaction. Because of the computational-intensive requirements of SNN simulation, it is necessary to select the appropriate electronic device to implement large-scale network models that operate fast enough, usually in real time or almost. The parallelism that SNN models naturally show makes them suitable for efficient physical hardware implementation[2-5]. Analog implementations are a serious option, since they are very compact, technology-opportunistic and consuming very low-power, but they suffer from network scalability difficulty for massively parallel implementations [6]. In the fully digital domain, alternatively to general-purpose processors, Graphics Processing Units (GPUs) appear to be a promising solution to implement large-scale SNN models, as G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 145–156, 2010. © Springer-Verlag Berlin Heidelberg 2010
146
G. Sánchez, J. Madrenas, and J.M. Moreno
well as parallel processing tasks in general; however, inefficiency of these architectures in synapse processing appears due to bottlenecks in their memory system[7]. From the alternative view of specific or programmable hardware, FieldProgrammable Gate Arrays (FPGAs) are flexible and cost-effective programmable devices that allow for implementing general-purpose digital systems. The ability to reconfigure FPGA blocks and interconnects has attracted research to explore the mapping and implementation of SNNs [6-7]. Still relatively little work has been carried out on the implementation of SNNs in digital platforms such as FPGAs [2-5,8,9]. The open question, however, remains in what type of architecture fits better to efficiently build such neural networks in hardware devices, in order to fulfil the real time process requirements with minimum size and power. A multiprocessor architecture specifically designed for the efficient emulation of complex SNNs [11] was recently proposed in the frame of the European Perplexus research project[10]. The architecture was implemented both in FPGA chip prototypes and in a semicustom Application-Specific Integrated Circuit (ASIC) chip. In this paper, we report the performance and scalability results of the Perplexus multiprocesor architecture for a specific bio-inspired SNN model [11]. This model was proposed as a demonstration of complex system simulation within the Perplexus project. Nevertheless, the programmable architecture is not able to emulate a particular algorithm but it is flexible enough to support emulation of virtually any spiking model. The paper is organized as follows. In Section 2, the target SNN model and the proposed multiprocessor architecture is reviewed. In Section 3, the SNN model mapping and programming is indicated. The performance study architecture is presented in Section 4. From the previous results, some architecture improvements are proposed in Section 5 to finally conclude in Section 6.
2 SNN Algorithm and Multiprocessor Architecture The implemented SNN algorithm [11] models a spiking neuron from the calculation of its membrane potential V(t), which mainly depends on its own time evolution, on the incoming excitatory and inhibitory spikes received from synapses and on the background noise, as shown in equation (1).
(
Vi ( t + 1) = Vrest ⎡ q ⎤ + Bi ( t ) + 1 − Si ( t ) ⎣ ⎦
) ((Vi (t ) − Vrest q
⎡⎣ ⎤⎦
) kmem q ) + ∑j ω ji (t ) ⎡⎣ ⎤⎦
(1)
where Vi is the membrane potential for neuron i, Bi the background noise that models spikes coming from remaining neurons beyond those considered in the model, si the neuron output spike, kmem is an exponential decay and wji every incoming weighted synapse spike from neuron j. In the emulation of SNN [11], the long-term depression (LTD) changes were incorporated by means of spiking-timing dependent plasticity (STDP). STDP makes postsynaptic neurons sensitive to the timing of incoming action potentials, which leads to competition among the pre-synaptic neurons. Latencies become shorter, spike synchronizes and the information propagation through the network becomes faster. The deployed approach in the neuron model is based on STDP and proposes that pre- before postsynaptic spikes increase the efficiency of synapses
Performance Evaluation and Scaling of a Multiprocessor Architecture
147
(Hebbian rule). Communication between neurons is by means of low-rate binary spikes, with an average between 100 and 200 spikes/s. The complete algorithm description is reported in [11]. To emulate both neurons and synapses, from the operator point of view, the algorithm requires additions, comparisons, logical operations and exponential decays. From the control point of view, the algorithm is common for all neurons and all synapses, but it executes conditionally depending on neuron and synapse parameter values at a given moment. The communication between neurons is only by means of spikes, making all the other processing local to each Processing Element (PE) that emulates neuron and incoming synapses. As stated before, the purpose for emulating on hardware the SNN algorithm is to be able to implement complex neural networks in real-time. One of the purposes is to provide researchers on neuroscience with a simulation tool whose results can be compared with biological networks. From this comparison, improved models closer to the biological neural networks can be developed, mapped and executed again in the architecture to provide results each time closer to the biological behaviour. Thus, it is fundamental that the emulation system is flexible enough to permit easy algorithm modification. Taking into account that the algorithm is the same for all neurons and all synapses and that data is to be locally processed by individual PEs, it appears that Single-Instruction Multiple Data (SIMD) multiprocessor machines are best suited to efficiently emulate massive SNN [11]. In Fig. 1, the proposed SIMD multiprocessor architecture is shown. The target multiprocessor of an initial prototype contains a 10×10 array of SIMD PEs (processors based on a 16-register bank and a very simple 16-bit ALU), a sequencer that reads instructions from an external SRAM, broadcasts them to the PE array and reads/writes array data from/to the same SRAM. An AER (Address Event Representation) bus controller, reads the spikes calculated in the PE array and broadcasts to all chips connected to the AER bus. to other Ubichips
to other Ubichips
(spikes)
ready
spike out
ready in
spike in
ready out
AER bus
Ubichip
handshake
spike in
spike in
data
SRAM
column select
hit
spike out
CAM
CAM/AER controller handshake
AER decoder
PE PE
PE
PE PE
PE
PE PE
PE
row select
instructions and data
Sequencer
instructions
64Mx32 row & col select monitoring and debugging
Colibri
Fig. 1. The SNN emulation SIMD multiprocessor proposed architecture
148
G. Sánchez, J. Madrenas, and J.M. Moreno
External to the SNN multiprocessor there is a simple, 9-line AER bus, an AER decoder embedding a Content Associative Memory (CAM) that detects the incoming spikes from any system PE and sends the hits back to the sequencer. Also, an external SRAM stores the program and data, mainly neuron and synapse parameters. Finally, an XScale microprocessor (Colibri board) configures the chip, loads the data and retrieves the processed information. The target of the proposed architecture is to provide support to networks up to 10,000 neurons and 300 synapses per neuron, thus making a total of 3,000,000 interconnects. As stated before, in a current design, a single standard-cell chip embeds 100 PE, each one emulating a neuron and all its associated synapses, up to 300 per neuron. Network expansion is achieved by direct chip connection to the AER bus. For this 0.18 micron CMOS design, 100 chips would be necessary to implement the full 10,000-neuron network, which is feasible but complex; however, a chip embedding 1,000 PE or more can be easily developed either removing some additional features of the current chip, that are not reported here, or by using a more advanced technology (e.g. 45 nm or below). Fig. 2 shows the two operation phases required for the SNN emulation. • Phase 1. The SIMD PE array executes a neuron and associated incoming synapse emulation step. Once done, the sequencer stops. • Phase 2. The CAM/AER controller broadcasts the spikes that were generated during that emulation step. Those spikes are read by the AER decoders connected to the bus and decoded by the internal CAM. The hits are stored into the corresponding SRAM positions that contain the presynaptic spike information. After all spikes have been processed, the sequencer resumes Phase 1 operation for the next emulation step.
PHASE 1 STATE
PHASE 2
PHASE 1
S_STOP
stopped (eo_exec) SRAMw_exec resume
Fig. 2. Handshake sequencer signals between sequencer and CAM/AER controller for operation mode switching.
3 SNN Model Mapping and Implementation In this analysis the SNN mapping and algorithm implementation in multiprocessor assembly code is summarized. More details can be found in [11-13]. Since all neurons and synapses execute the same algorithm, neuron and synapse parallelism is applied. All PEs perform in parallel, and each of them serially executes the neuron and incoming synapse algorithms. The algorithm implementation has been done in structured style, using procedures so as to simplify maintenance and model updating. The machine native low-level programming allows optimizing the algorithm execution time.
Performance Evaluation and Scaling of a Multiprocessor Architecture
149
Neuron load & value
Initialization
Y
SYN#?
Phase 1 N
Phase 2
Synapse calculation
Neuron calculation
Fig. 3. Execution loop for SNN emulation. Phase 1 main operations are detailed
As shown in Fig. 3, after a short parameter initialization, the SNN algorithm is cyclically emulated by means of an infinite loop that executes Phase 1 to emulate the neural network and stops for the AER controller/decoder to execute the spike broadcasting of Phase 2. In this phase, no instructions are executed, but the AER controller and AER decoder control units perform the required operations by means of finite state machines. When Phase 2 is done, the sequencer resumes Phase 1 execution, in an infinite loop. In Phase 1, when the neuron and synapse algorithms are executed, the neuron parameters are first loaded and the membrane value is calculated. Then, the input synapses of the associated neuron are calculated. Finally, the neuron is updated taking into account input synapses, background noise and refractory period, determining whether it spikes or not. .MAIN GOTO 00NEURONLOAD GOTO 01MEMBRANEVALUE LOOP synapses GOTO 00SYNAPSELOAD GOTO 02SYNAPTICWEIGHT GOTO 03REALVALUEDVARIABLE GOTO 04ACTIVATIONVARIABLE GOTO 05MEMORYOFLASTPRESYNAPTICSPIKE GOTO 99SYNAPSESAVE ENDL GOTO 06MEMORYOFLASTPOSTSYNAPTICSPIKE GOTO 07SPIKEUPDATE GOTO 08BACKGROUNDACTIVITY GOTO 09REFRACTORYP GOTO 99NEURONSAVE GOTO SPIKESENABLE STOP ; AER/CAM UPDATE OF SPIKES GOTO MAIN
K1 + K2xN K3
N: Number of PEs S: Number of synapses
K4 + K5xS + K6xNxS
K7 K8 + K9xN K10 K11 + K12xN + K13xN2
Fig. 4. Main program of the SNN emulation assembly code
150
G. Sánchez, J. Madrenas, and J.M. Moreno
In Fig. 4 the main loop of the assembler program for the SNN emulation is shown. From a total of 14 subroutines, there are 8 single subroutine calls and a synapse loop which calls 6 additional subroutines inside the loop. These 6 subroutines are executed a number of times equal to the defined number of synapses SYN#. The constants K1··K13 allow the calculation of the number of clock cycles required for the execution of each procedure, as a function of the number of neurons and the number of synapses.
4 Performance Figures In Table 1, the encoding for each subroutine in Phase 1 is indicated. The initial conditions (IC) (initialization of Fig. 3) are also considered, although they have no relevant incidence in the calculation. The Cycles per Synapse (CS) represents the synapse calculation loop. The encoding of subroutines contained in the synapse loop is shown in Table 2. From the number of cycles of these subroutines, the Cycles per Synapse (CS) expression of Table 1 is obtained. Table 1. Main loop subroutine encoding and execution number of clock cycles
Table 2. Synapse loop routine encoding
Adding all the contributions of Table 1, the number of clock cycles NT required for the initialization and Phase 1 execution in one simulation cycle is obtained in eq. (2). NT = 1909 + 10 × N + 1392 × S + 4 × N × S
(2)
In Table 3, eq. (2) is used to calculate execution times for different SNN emulation array sizes. The execution time depends on the system clock. Here, the conservative
Performance Evaluation and Scaling of a Multiprocessor Architecture
151
Table 3. Execution time of one simulation cycle for different SNN size Array #PE #Syn #Chip Processing phase (clock cycles) N N S C K1 K2*N K3*S K4*N*S 2x2 4 2 1 1909 40 2784 32 2x2 4 3 1 1909 40 4176 48 6x6 36 8 1 1909 360 11136 1152 6x6 36 12 1 1909 360 16704 1728 10x10 100 300 1 1909 1000 417600 120000 100x100 100 300 100 1909 1000 417600 120000
fCK Total phase1 Spiking phase Total phase2 TOTAL SPIKE RATE -1 (s ) TOTAL (MHz) (ms) AER (MHz) (ms) (ms) 4765 50 0,095 8 5 0,0016 0,10 10320 6173 50 0,123 8 5 0,0016 0,13 7996 14557 50 0,291 40 5 0,008 0,30 3343 20701 50 0,414 40 5 0,008 0,42 2370 540509 50 10,810 104 5 0,0208 10,83 92 540509 50 10,810 10202 5 2,0404 12,85 78
prototype 50 MHz clock, or 20 ns period, is assumed, although the results would be proportionally reduced as the clock period decreases. Also, the Phase 2 (spike propagation) delay is considered. As shown in the table, even working with the slow prototype clock, the system performance is very close to the real-time emulation, which can be considered when 100 spike/s rate is achieved. For the 300-synapse 10000-neuron network proposed case, using 100 chips, 78 spike/s emulation rate is obtained, which is very close to the proposed target. Furthermore, the spiking phase is calculated on a worst-case basis, because not all neurons will be spiking at every simulation step. In the following figures, the number of clock cycles required for one-step emulation of the SNN algorithm is analyzed. The purpose is to show the delay contribution in clock cycles of every subroutine as a function of the number of neurons and synapses being emulated. The figures have been obtained from simulations and they have been verified for consistency with eq. (2). To infer scalability, multiprocessor arrays that emulate 2x2, 4x4, 6x6 and 10x10 neuron networks are considered in the analysis. In fact, a 6x6 array has been mapped into the FPGA used to prototype the final chip and 10x10 is the array implemented in the standard-cell ASIC. In Fig. 5, the required number of cycles per emulation step for the four configurations is shown. In this case, a single synapse is considered, to show scaling only with the number of neurons. Consistent with eq. (2), the total execution time increases linearly with the number of neurons. The figure displays both the total number of cycles (in the last column) and its distribution among the main loop subroutines. As it can be observed, the delay mostly depends on the synapse calculation cycle, even for a single synapse. In Fig. 6, the number of cycles for the synapse loop (CS in Fig. 5) distributed among the internal subroutines is shown, also for the neural network arrays previously indicated. It can be observed that the only subroutines that increase their number of cycles with N (number of neurons) are SL (synapse load) and SS (synapse save), i.e., when the SRAM is accessed. Since they are inside the synapse loop, they will linearly increase with the synapse number, so they provide a major contribution to the total delay. The serial emulation of synapses within each neuron implies that each synapse requires a synapse emulation loop. This is why CS (Cycles per Synapse) is dominant even for a small number of synapses and the remaining subroutines become irrelevant. In fact, for the 100-neuron 300-synapse neuron array, 99.5% of the cycles are dedicated to synapse cycle.
152
G. Sánchez, J. Madrenas, and J.M. Moreno
Fig. 5. Required number of cycles for the execution of Iglesias-Villa implementation for 4, 16, 36 and 100 neurons with 1 synapse per neuron
Fig. 6. Required number of cycles for the execution of the synapse loop for 4, 16, 36 and 100 neurons with 1 synapse per neuron
5 Proposed Architecture Improvements As indicated before, the target of real-time emulation is almost achieved with the current implementation. Simply by slightly increasing the current operating frequency (50 MHz) it would be totally fulfilled. On the other hand, from the previous performance analyses, there are some possible modifications that would significantly boost the processing power of SNN algorithms based on the proposed architecture. This section is devoted to propose architecture changes and to evaluate their improvement impact. The current architecture bottlenecks clearly arise when a large SNN is considered, as in the case of 100 neurons and 300 synapses per neuron. Analyzing the synapse cycle operations, the main time-consuming tasks are memory access (LOAD and STOREC) instructions and the exponential decay algorithm. This algorithm is based on a software multiplication since the original PE does not include a hardware multiplier.
Performance Evaluation and Scaling of a Multiprocessor Architecture
153
Table 4. Clock cycles devoted to LOAD, STOREC, multiplication and all other instructions Instructions and subroutine LOAD NEURONS LOAD SYNAPSES STNC NEURONS STNC SYNAPSES MULT. NEURONS MULT. SYNAPSES REMAINING INSTRUCTIONS
# 6 600 4 600 3 600
# cycles/instruction Total number of cycles 100 600 100 60000 100 400 100 60000 432 1296 432 259200 159013 TOTAL
540509
In Table 4, the number of clock cycles needed to execute the LOAD, STOREC (conditional store family) and multiplication instructions are shown. In the Remaining instructions row, all the other instructions that complete the algorithm are accounted.
Fig. 7. Clock cycle number distribution of instructions as classified in Table IV
In the Fig. 7 pie chart, the relative execution time of these instructions is displayed. Approximately, one-fourth of the processing time is devoted to LOAD and STOREC (where STNC is a particular case of STOREC) instructions, i.e., to SRAM access, one-half on the time to product instructions and the other one fourth, to the remaining instructions. In order to speed up processing the following improvements can be considered: • •
Hardware multiplier: Full parallel multiplier, radix-4 multiplier or parallelserial multiplier. Parallelization of LOAD and STNC instructions.
In order to parallelize the LOAD and STNC instructions it would be feasible to implement a memory controller (finite state machine) operating as a sequencer slave. This memory controller would be responsible of sending the data values to each PE before the PE array executes the LOAD or STOREC parallel instructions. This is possible because between successive LOAD or STOREC instructions there are hundreds of processing cycles, so both tasks could be concurrently performed. Thus, the memory controller would access data in the idle cycles of SRAM.
154
G. Sánchez, J. Madrenas, and J.M. Moreno
Regarding hardware multipliers, three types have been considered: full parallel multiplier, radix-4 multiplier, and parallel-serial multiplier. Assuming the current 16bit precision of the PE registers, 16-bit multipliers would require 1, 8, and 16 clock cycles, respectively. The PE area overhead that hardware multiplier implementation requires would be almost insignificant for the radix-4 and parallel-serial multiplier. For the full-parallel implementation, it would require further analysis. In Table 5, the calculated clock cycle number for several architecture modifications is shown. Table 5. Clock cycle number calculation for modified architectures
The first column indicates the best case, with fully parallel LOAD and STNC instructions and fully-parallel hardware multipliers. Second column replaces the fullparallel multiplier by a radix-4 multiplier, and a parallel-serial multiplier in the third column. The fourth column keeps the parallel-serial multiplier but assumes that the LOAD and STNC instructions are row-serial and column-parallel (using row cache). Finally, the last column shows the current implementation figures. Table 6. Calculation of the clock cycle number in each subroutine for each proposed architecture change.
Performance Evaluation and Scaling of a Multiprocessor Architecture
155
In Table 6, the calculation of clock cycle number for each subroutine taking into account the proposed improvements is shown. Table 7 indicates the estimated performance improvement ratio for the different architecture modifications. Table 7. Performance improvement ratio for the proposed architecture changes (100 neurons and 300 synapses)
Considering LOAD and STNC parallel instructions, for any multiplier, the number of cycles follows the eq. (3) form: NT = K1 + K2S
(3)
Notice that NT does not grow anymore with the number of neurons. Of course, the expression is limited by the assumption that all the synapse parameters can be cahed during the synapse loop time. In case of column-parallel row-serial LOAD and STNC instructions (row cache) and parallel-serial multiplier, NT takes the form of eq. (4), where the growth depends on the square root of the number of neurons. N T = 661 + 10 N + 560 S + 4 S
N
(4)
6 Conclusion A detailed performance analysis has been carried out for the proposed SNN multi model emulation multiprocessor architecture. Results show that the 100 spike/s realtime emulation objective is reached at a 64 MHz clock. In fact, given the simple architecture of the PE in the multiprocessor mode, much higher operation frequency could be easily achieved with the same VLSI technology, and even much more with currently available CMOS technologies. Furthermore, much larger PE arrays could be easily implemented as well. The performance improvement could be close to two orders of magnitude. Beyond the performance increase by means of frequency clock boosting, including hardware multipliers in the architecture and reducing the RAM access bottleneck by means of cache blocks can improve the performance by a factor of 3. In the present work, the proposed architecture performance has been analyzed for the specific SNN model of the Perplexus project, but the architecture is flexible enough to efficiently emulate also other SNN models, such as the one proposed by Izhikevich [14] or others.
156
G. Sánchez, J. Madrenas, and J.M. Moreno
Acknowledgement. This work has been partially funded by the Spanish Ministry of Science and Innovation (Project TEC2008-06028/TEC) and by the European Union (PERPLEXUS project, Contract no. 34632). Giovanny Sánchez holds a research fellowship supported by the Catalan Department of Innovation, Universities and Companies, and the European Social Fund.
References [1] W. Maass, "Computation with spiking neurons", In the Handbook of Brain Theory and Neural Networks, M. A. Arbib, ed, pp 1080-1083. W. MIT Press (Cambridge), 2nd edition, 2003. [2] C. Teuscher, "FPGA Implementations of Neural Networks," in IEEE Transactions on Neural Networks, 2007., Eds.; 2006, pp. 18(5): pp. 1550-1550. [3] S. Bellis, et al.,"FPGA implementation of spiking neural networks - an initial step towards building tangible collaborative autonomous agents," in Field-Programmable Technology, 2004. Proceedings. 2004 IEEE International Conference on, 2004, pp. 449-452. [4] J. M. Moreno, Y. Thoma, and E. Sanchez, "Poetic: A Hardware Prototyping Platform With Bioinspired Capabilities," in Mixed Design of Integrated Circuits and System, 2006. MIXDES 2006. Proceedings of the International Conference, 2006, pp. 363-368. [5] J. Harkin, F. Morgan, S. Hall, P. Dudek, T. Dowrick, and L. McDaid, "Reconfigurable platforms and the challenges for large-scale implementations of spiking neural networks," in Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, 2008, pp. 483-486. [6] X.-C. Li and J.-F. Mao, "An area-efficient very large scale integration architecture for modified Euclidean algorithm with dynamic storage technique," International Journal of Electronics, vol. 96, pp. 837 - 842, 2009. [7] K. G. Naga, G. Jim, K. Ritesh, and M. Dinesh, "GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management," SIGMOD, 2006. [8] F. Morgan, S. Cawley, B. McGinley, S. Pande, L. J. McDaid, B. Glackin, J. Maher, and J. Harkin, "Exploring the evolution of NoC-based Spiking Neural Networks on FPGAs," in Field-Programmable Technology, 2009. FPT 2009. International Conference on, 2009, pp. 300-303. [9] S. B. F. L.A. Plana, S. Temple, M. Khan, Y. Shi, J. Wu, and S. Yang., "A GALS Infrastructure for a Massively Parallel Multiprocessor.," IEEE Design & Test of Computers, IEEE Transactions on, pp. 24(5):454-463,, Sept.-Oct. 2007. [10] E. Sanchez, A. Perez-Uribe, A. Upegui, Y. Thoma, J. M. Moreno, A. Napieralski, A. Villa, G. Sassatelli, H. Volken, and E. Lavarec, "PERPLEXUS: Pervasive Computing Framework for Modeling Complex Virtually-Unbounded Systems," in Adaptive Hardware and Systems, 2007. AHS 2007. Second NASA/ESA Conference on, 2007, pp. 587-591. [11] Javier Iglesias, "Dynamics of pruning in simulated large-scale spiking neural networks," Switzerland, 2005. [12] J. Madrenas, J. M. Moreno; “Strategies in SIMD Computing for Complex Neural Bioinspired Applications” AHS; Proceedings of the 2009 NASA/ESA Conference on Adaptive Hardware and Systems, pp. 376-381. Moscone Convention Center, San Francisco, California, USA, July 29 – August 1, 2009. [13] J.M. Moreno, J. Madrenas, “A Reconfigurable Architecture for Emulating Large-Scale Bio-inspired Systems” CEC 2009; Proc. IEEE Congress on Evolutionary Computation, pp. 126-133. Trondheim, Norway 18-21 May, 2009. [14] E. Izhikevich, "Polychronization: Computation with Spikes," The Neurosciences Institute, 10640 John Jay Hopkins Drive, San Diego, CA 92121, Neural Computation (2006) 18:245282, 2006.
Evolution of Analog Circuit Models of Ion Channels Theodore W. Cornforth1 , Kyung-Joong Kim2 , and Hod Lipson3 1
2
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, USA 14853 [email protected] Department of Computer Engineering, Sejong University, 98 Gunja-Dong, Gwangjin-Gu, Seoul 143-747, Republic of Korea [email protected] 3 Department of Mechanical and Aerospace Engineering, Cornell University, Ithaca, USA 14853 [email protected]
Abstract. Analog circuits have long been used to model the electrical properties of biological neurons. For example, the classic Hodgkin-Huxley model represents ion channels embedded in a neuron’s cell membrane as a capacitor in parallel with batteries and resistors. However, to match the predictions of the model with their empirical electrophysiological data, Hodgkin and Huxley described the nonlinear resistors using a complex system of coupled differential equations, a celebrated feat that required exceptional creativity and insight. Here, we use evolutionary circuit design to emulate such leaps of human creativity and automatically construct equivalent circuits for neurons. Using only direct electrophysiological observations, the system evolved circuits out of basic electronic components that accurately simulate the behavior of sodium and potassium ion channels. This approach has the potential to serve both as a modeling tool to reverse engineer complex neurophysiological systems and as an assistant in the task of hand-designing neuromorphic circuits.
1
Introduction
At least since the work of Lapicque in 1907 [1,23], analog circuits have been used as models to aid in understanding the behavior of biological neurons. Lapicque had knowledge of empirical observations as shown in Fig. 1. On the basis of such observations, he reasoned that the lipid bilayer membrane separating the intracellular space from the extracellular fluid is capable of storing charge and so acts like a capacitor. In addition, measurements of the membrane voltage in response to applied currents suggested the presence of a ‘leak’ conductance that acts like a resistor-battery combination in its tendency to slowly return the membrane voltage to a baseline resting value. An analog circuit matching this description is a capacitor in parallel with a resistor and battery, and is perhaps the simplest circuit capable of reproducing the fundamental electrical behavior G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 157–168, 2010. c Springer-Verlag Berlin Heidelberg 2010
158
T.W. Cornforth, K.-J. Kim, and H. Lipson
of a neuron (Fig. 2). Such a circuit is usually referred to as an ‘equivalent’ circuit in that it reproduces the essential behavior of a more complex electrical system in relatively simple form.
electrode
voltmeter VM
extracellular fluid
intracellular space
cell membrane
Fig. 1. Simple electrophysiology setup. Using a micromanipulator and microscope, a glass electrode can be positioned inside an individual neuron in such a way as to minimize damage to the cell membrane. This creates a simple circuit for measuring the potential difference between the intracellular space and the extracellular fluid. This potential difference is called the membrane voltage, VM . A stimulus current (the ‘input’ to the cell) can be delivered through the electrode and the resulting changes in VM (the ‘output’ from the cell) recorded for later study.
The concept of an equivalent circuit for a neuron has proven useful for the development of more sophisticated models of neuron behavior. Lapicque himself used the above circuit as the basis for the widely used integrate-and-fire model, in which all-or-nothing spikes in membrane voltage called action potentials are triggered when the capacitor is charged to some threshold potential [23]. Although useful for many purposes, the integrate-and-fire model does not describe the complex dynamics of the numerous membrane conductances in a typical neuron, of which the above-mentioned leakage conductance is only the simplest. Today, these different conductances are known to correspond to different types of membrane-spanning pores in the cell membrane called ion channels [10]. Ion channels are characterized by the type of ion that flows through them as well as by the factors that influence the degree to which the channel admits that ion. For example, a voltage-gated sodium channel admits only sodium ions and the intrinsic rate of sodium ion passage through the channel depends on the membrane voltage. The behavior of individual channel types and even individual channel molecules can be studied in isolation through different pharmacological and electromechanical techniques [18]. A major breakthrough in computational neuroscience was the detailed mathematical description of the primary conductances involved in action potential generation by Hodgkin and Huxley, for which they were awarded the 1963 Nobel Prize in Physiology or Medicine [15]. The equivalent circuit used by Hodgkin and
Evolution of Analog Circuit Models of Ion Channels
A
159
B
extracellular fluid
VM
10 mV
RL VM
stimulus current
CM
stimulus current
1 nA
EL 0 intracellular space
10
20 time (ms)
30
Fig. 2. A. Lapicque’s equivalent analog circuit for a neuron. With properly adjusted component parameters, this circuit reproduces the fundamental passive electrical behavior of a neuron. The battery-resistor combination represents the intrinsic driving force on ions comprising the leak current and the resistance to their flow across the membrane. Typical values for the components are on the order of 1 GΩ for the leak resistance (RL ), -70 mV for the leak potential (EL ), 1 pF for membrane capacitance (CM ) and 1 nA for the stimulus current. The node at the top of the circuit represents the extracellular fluid and the node at the bottom represents the intracellular space. A major simplifying assumption in this minimal model is that both the extracellular and intracellular spaces are isopotential compartments. B. Just as with a biological neuron, the circuit’s membrane voltage (top) in response to small step current inputs (bottom) behaves like the charging and discharging of an RC circuit.
Huxley, with components for sodium and potassium conductances in addition to the leakage conductance, is shown in Fig. 3A. In this case, the behavior of the nonlinear resistors is described by a complex system of coupled differential equations (Fig. 3B). This model is frequently cited as perhaps the most remarkable feat of creativity and insight in all of computational neuroscience, in part because Hodgkin and Huxley were unaware at the time of the correspondence between ion channels and membrane conductances [8]. If artificial intelligence were capable of such extraordinary leaps of creativity and insight in the face of little or no mechanistic understanding of the underlying physiology, it would be a very powerful tool in the neuroscientist’s toolbox. Although action potentials are important, they are just one among the hundreds if not thousands of dynamical systems that would need to be modeled to develop a reasonably complete picture of single cell neurophysiology. These include the dynamical systems underlying synaptic transmission, vesicle cycling, axon and dendrite growth, apoptosis and general cell metabolism [18]. This vast undertaking is of great interest not only for purely theoretical reasons, but for biomedical applications that require accurate neurophysiological models [5]. Our goal is to shift much of the burden of creating these models from humans to computers. Here, we apply the established technique of analog circuit evolution [22] to the task of creating equivalent circuit models for neurophysiological systems. The systems are measured through very simple, direct experiments of the kind that can easily be performed in a typical neuroscience ‘wet’ lab. With only this
160
T.W. Cornforth, K.-J. Kim, and H. Lipson
A
extracellular fluid
RL VM
R Na
RK stimulus current
CM EL
E Na
EK
intracellular space
B
Fig. 3. A. The Hodgkin-Huxley equivalent circuit. This circuit is similar to Lapicque’s circuit in Fig. 2 except that terms for sodium (RNa , ENa ) and potassium (RK , EK ) conductances have been added. Another significant difference is the nonlinear nature of the RNa and RK resistors, whose dynamics are described in the system of differential equations in B. B. The original Hodgkin-Huxley dynamical system. The variable v corresponds to the observed membrane voltage, the variable i corresponds to the stimulus current, and the hidden variables h, m, n correspond to gating parameters of the sodium and potassium conductances. Although the system is typically not presented in its full form without extensive explanation, we do so here to emphasize the complexity of the task faced by Hodgkin and Huxley and the magnitude of their achievement. For details, see the description in [15] or the accessible introduction in [17].
Evolution of Analog Circuit Models of Ion Channels
161
input data and with only the simplest electrical components such as resistors and capacitors, circuit evolution can automatically generate accurate equivalent circuits for ion channels of the type studied by Hodgkin and Huxley.
2 2.1
Methods Circuit Evolution
We employ a simple version of Koza’s circuit evolution technique [19,22]. Initially, a population of candidate equivalent circuits is created with two randomly selected, randomly connected electrical components. These randomly chosen components are placed in a variable portion of an otherwise invariant embryonic circuit as shown in Fig. 4. The fitness of individuals in this random population is evaluated by translating each individual into an equivalent Spice netlist, simulating its behavior using NGSpice20 [28], and comparing this behavior to that of the target neurophysiological system (see Neurophysiological Data below). A steady-state population-updating method is used in which the least fit half of the population is subject to mutation before the next generation of fitness evaluation and selection occurs. For all results reported, no recombination was used and populations had a size of 48. We represent circuits with a direct schematic-based encoding, in which components and their connections are stored as flat lists [6]. Seven low-level electrical components were used as the raw material or ‘building blocks’ with which evolution operates. These components and their variable parameter ranges are shown in Appendix A, Table 1. All models are default NGSpice20 models with the exception of the MOSFET models, which were generously provided by Mario Simoni [32]. From each circuit that is chosen to be mutated, one randomly selected component undergoes one of eight possible mutations. For details on these mutation operations, see our previous paper [19]. For some experiments, hill-climbing and random search controls were used. For hill-climbing, each ‘generation’ consisted of mutating a single circuit and replacing it with the mutated circuit if the mutation improved fitness. Similarly, for each generation in a random search, a circuit was randomly constructed from scratch and used to replace the best previously encountered circuit if the new circuit had higher fitness. For both hill-climbing and random search, the search was continued until the total number of circuits evaluated matched the number evaluated with a corresponding evolutionary procedure. 2.2
Neurophysiological Data
For our initial experiments, we simulated the 1952 Hodgkin-Huxley studies in which the contribution of the primary sodium and potassium currents to the squid giant axon action potential were studied in isolation using pharmacological techniques [12–16]. Using the NEURON 7.0 simulation environment [11], we inserted default HH sodium and potassium channels into a single compartment.
162
T.W. Cornforth, K.-J. Kim, and H. Lipson extracellular fluid
RL VM
S S
B*
variable sub circuit
A* stimulus current
CM EL
EX
intracellular space
Fig. 4. The embryonic circuit. This circuit is essentially a model of a cell membrane with no ion channels embedded except for those responsible for the leak current. It is the task of evolution to find a sub circuit such that the entire circuit reproduces the behavior of a target ion channel. The nodes labeled A* and B* are used to identify connection points for the evolved sub circuits shown in Fig. 6 and Fig. 7 below. We include the intrinsic driving force for the modeled ion channel (EX ) as this is easy to determine experimentally and is not difficult to model, unlike the portion of the circuit that would be equivalent to the nonlinear resistors in Fig. 3.
Such a simulated ‘cell’ has the equivalent circuit shown in Fig. 2A. We then set the conductance values for sodium to 0 to simulate pharmacological blockade of the voltage-gated sodium channels and stimulated the cell with a small step current of 1 nA. The membrane voltage response obtained in this way represents the target behavior of idealized voltage-gated potassium ion channels (Fig. 5). In other words, evolution is tasked with finding a variable sub circuit as in Fig. 4 that behaves in the same way as the Hodgkin-Huxley nonlinear potassium resistance RK . The voltage-gated sodium channel is targeted for evolution in a similar manner. To evaluate the fitness of a candidate equivalent circuit, we stimulate it with a step current of 1 nA in NGSpice20 and compare the resulting membrane voltage time series with the simulated membrane voltage time series obtained from NEURON (Fig. 5). Spice simulation data is recorded at 0.1 ms resolution for 40 ms and each of those 401 time points are compared with the corresponding time points from the target NEURON data. Fitness is then defined as the reciprocal of the sum of the absolute differences at each time point. To reduce fitting errors due simply to voltage offset or scaling, both of the membrane voltage time series involved in the comparison are normalized to the range 0-1. Results below are plotted on this relative voltage scale.
3
Results
Evolution produced compact circuits mimicking the behavior of both an idealized voltage-gated potassium channel and a voltage-gated sodium channel. Fig. 5A shows the step response of a typical evolved equivalent circuit for the
Evolution of Analog Circuit Models of Ion Channels
163
voltage-gated potassium channel compared with the target potassium channel step response. Fig. 6 is the schematic for this evolved equivalent circuit. Similarly, Fig. 5B shows the response of a typical evolved circuit for the voltage-gated sodium channel and Fig. 7 is the schematic for this evolved equivalent circuit. Performance of the evolutionary algorithm for potassium channel equivalent circuit evolution and sodium channel equivalent circuit evolution are shown in Fig. 8A and Fig. 8B, respectively. Typical runs showed convergence to final fitness values within 1000 evolutionary generations, which requires about 150 minutes on a modest 2GHz quad-core machine.
A
B
step current on
1.0 target ion channel
target ion channel
evolved circuit
evolved circuit
relative VM
relative VM
1.0
0.5
0.0
step current on
0
10
20 time (ms)
0.5
0.0
30
0
20 time (ms)
10
30
Fig. 5. Step response of the voltage-gated potassium (A) and sodium (B) channels (c.f. Fig. 2B). The responses of idealized NEURON ion channels to a step current of 1 nA are shown with solid black lines and the responses of embryonic circuits plus an evolved sub circuit are shown with dashed gray lines.
B*
R3 M7 L4
L5 C7 C5
L8 M5
L6
L10
M4
L7
C6 M9
C4
L9 M8
M6
M3
L3
A*
Fig. 6. Schematic of an evolved potassium channel equivalent circuit. No postprocessing or simplification of the circuit was performed. To conserve space, only the evolved variable sub circuit is shown. Parameter values are shown in Appendix A, Table 2. The nodes labeled A* and B* connect with the embryonic circuit as shown in Fig. 4.
164
T.W. Cornforth, K.-J. Kim, and H. Lipson B* C4
L2
C3
L4
R3 L5 M3
M4
M6
D3
M5
L3
C6
C5 R4
A*
Fig. 7. Schematic of an evolved sodium channel equivalent circuit. As in Fig. 6, only the evolved variable sub circuit is shown and no post-processing or simplification of the circuit was performed. Parameter values are shown in Appendix A, Table 3.
B
A 20-1 -1
fitness (sum abs. diff. )
-1
fitness (sum abs. diff. )
10-1 20-1 30
-1
40-1 50-1 evolution hill-climbing random search
60-1 70-1 0 10
101
102
103
104
105
evaluated individuals
40-1 60-1 80-1 evolution hill-climbing random search
100 -1 120 -1 0 10
101
102
103
104
105
evaluated individuals
Fig. 8. Equivalent circuit fitness as a function of the number of evaluated individuals. Potassium (A) or sodium (B) channel equivalent circuits were found with evolutionary search (black), hill-climbing (dark gray), and random search (light gray) for ten independent runs each. In each case, fitness was being maximized as described in Methods above. The mean fitness across runs of the best equivalent circuit is plotted as a solid line with error bars representing mean fitness ± standard error of the mean.
4
Discussion and Conclusion
We propose that circuit evolution can be used to automatically construct equivalent circuits for neurophysiological systems. We tested this idea on two systems,
Evolution of Analog Circuit Models of Ion Channels
165
voltage-gated sodium and potassium ion channels, and found that the evolved circuits can reproduce the behavior of the modeled systems to a large degree. Future work will confirm that the evolved equivalent circuits are accurate and robust models of the sodium and potassium ion channels. The behavior of the evolved sodium channel circuits in particular deviate somewhat from the desired behavior. One possible explanation is that Hodgkin-Huxley voltage-gated sodium channel dynamics involve both activation and inactivation processes, unlike the dynamics of voltage-gated potassium channels, which can be modeled with only activation. We are pursuing the hypothesis that recombination would allow a promising sub circuit for only the activation or only the inactivation process to replicate and then differentiate during evolution. We are also investigating more sophisticated representations for analog circuit evolution such as Analog Genetic Encoding [26] and graph grammar-based approaches [3]. We used step current inputs to the respective systems to characterize their output behavior, but it will be important to confirm that other types of input/output pairings can be reproduced by the equivalent circuits as well. Coevolution of input functions and equivalent circuit models is one possible way to maintain selection pressure for robustness [35]. Such a co-evolutionary approach was tried in our previous paper [19] with good results. Our longer-term goal is to move beyond ion channels simulated in NEURON and to use circuit evolution as one component of a closed-loop automated experiment system. Active learning would be used to probe a physical system of interest, such as a single neuron as shown in Fig. 1A. The inputs that cause the most disagreement between predicted and observed outputs would be used to evolve equivalent circuits. The accuracy and robustness of the model could then be refined by further cycles of active learning probes and circuit evolution. This approach is proposed and discussed in detail in Bongard and Lipson (2007) [2]. The use of circuit evolution to model neurophysiological systems has potential applications beyond those presented here. For example, the design of neuromorphic circuits that interface living tissue with electrical components is of growing importance in the medical field, yet as with the design of all novel analog circuits, neuromorphic circuit design is largely performed by hand and then only by experienced electrical engineers [4,33]. The complexity of hand-designed hardware implementations of the Hodgkin Huxley model attests to the difficulty of the task [21, 32]. The assistance of circuit evolution could be invaluable, especially as neuromorphic circuits customized to the needs of individual patients become a reality [31]. Although we use discrete components here, our approach is applicable to the design of integrated circuits, which would almost certainly be used in any practical neuromorphic system. One drawback of the equivalent circuit approach in general is the lack of a mechanistic correspondence to the underlying neurophysiological system. However, many applications may not require an understanding of the underlying biology in detail. For example, simulations used in drug design might benefit from an accurate and easily-produced model of a neurophysiological system with only the requirement that certain observed behavior be reproduced [29].
166
T.W. Cornforth, K.-J. Kim, and H. Lipson
Given the inherently electrical nature of systems in neurophysiology, the electrical components in equivalent circuits are likely to be natural building blocks with which evolution can construct accurate models. However, equivalent circuits are only one of many ways to represent neurophysiological systems. Others include differential equations as in Hodgkin and Huxley’s work, Markov kinetic models [9] and artificial neural networks [25]. Evolution-based search has now been successfully applied to the optimization of models using all those representations [7, 27, 34]. The more general idea of using software and hardware-based techniques to automatically study systems in biology, chemistry and physics has also had noteworthy successes [20, 24, 30]. Indeed, we envision that many nontrivial scientific tasks currently requiring significant human effort will begin to be augmented by sophisticated artificial intelligence and robotic systems in the near future. Acknowledgments. Support was provided by the Tri-Institutional Training Program in Computational Biology and Medicine.
References 1. Abbott, L.F.: Lapicque’s Introduction of the Integrate-and-Fire Model Neuron. Brain Res. Bull. 50, 303–304 (1999) 2. Bongard, J., Lipson, H.: Automated Reverse Engineering of Nonlinear Dynamical Systems. PNAS 104, 9943–9948 (2007) 3. Das, A., Vemuri, R.: A Graph Grammar Based Approach to Automated MultiObjective Analog Circuit Design. In: Proc. Design, Automation, Test Eur. Conf., pp. 700–705 (2009) 4. Douglas, R., Mahowald, M., Mead, C.: Neuromorphic Analog VLSI. Ann. Rev. Neurosci. 18, 255–281 (1995) 5. Ellner, S.P., Guckenheimer, J.: Dynamic Models in Biology. Princeton University Press, Princeton (2006) 6. Floreano, D., Mattiussi, C.: Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies. MIT Press, Cambridge (2008) 7. Gurkiewicz, M., Korngreen, A.: A Numerical Approach to Ion Channel Modeling Using Whole-cell Voltage-clamp Recordings and a Genetic Algorithm. PLOS Comp. Biol. 3, 1633–1647 (2007) 8. H¨ ausser, M.: The Hodgkin-Huxley Theory of the Action Potential. Nat. Neurosci. 3, 1165 (2000) 9. Hawkes, A.G.: Stochastic Modeling of Single Ion Channels. In: Feng, J. (ed.) Computational Neuroscience: a Comprehensive Approach. CRC Press, London (2003) 10. Hille, B.: Ion Channels of Excitable Membranes. Sinauer Associates, Sunderland (2001) 11. Hines, M.L., Carnevale, N.T.: NEURON: a Tool for Neuroscientists. The Neuroscientist 7, 123–135 (2001) 12. Hodgkin, A.L., Huxley, A.F.: Currents Carried by Sodium and Potassium Ions Through the Membrane of the Giant Axon of Loligo. J. Phys. 116, 449–472 (1952) 13. Hodgkin, A.L., Huxley, A.F.: The Components of Membrane Conductance in the Giant Axon of Loligo. J. Phys. 116, 473–496 (1952)
Evolution of Analog Circuit Models of Ion Channels
167
14. Hodgkin, A.L., Huxley, A.F.: The Dual Effect of Membrane Potential on Sodium Conductance in the Giant Axon of Loligo. J. Phys. 116, 497–506 (1952) 15. Hodgkin, A.L., Huxley, A.F.: A quantitative description of Membrane Current and its Application to Conduction and Excitation in Nerve. J. Phys. 117, 500–544 (1952) 16. Hodgkin, A.L., Huxley, A.F., Katz, B.: Measurement of Current-Voltage Relations in the Membrane of the Giant Axon of Loligo. J. Phys. 116, 424–448 (1952) 17. Hoppensteadt, F.C., Peskin, C.S.: Modeling and Simulation in Medicine and the Life Sciences. Springer, New York (2002) 18. Kandel, E.R., Schwartz, J.H., Jessell, T.M.: Principles of Neural Science. McGrawHill, New York (2000) 19. Kim, K.-J., Wong, A., Lipson, H.: Automated synthesis of resilient and tamperevident analog circuits without a single point of failure. Genet. Program. Evolvable Mach. 11, 35–59 (2010) 20. King, R.D., Rowland, J., Oliver, S.G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham, M., Pir, P., Soldatova, L.N., Sparkes, A., Whelan, K.E., Clare1, A.: The Automation of Science. Science 324, 85–89 (2009) 21. Kohno, T., Aihara, K.: Bottom-Up Design of Class 2 Silicon Nerve Membrane. J. Int. Fuzzy Sys. 18, 465–475 (2007) 22. Koza, J.R., Bennett III, F.H., Andre, D., Keane, M.A., Dunlap, F.: Automated Synthesis of Analog Electrical Circuits by Means of Genetic Programming. IEEE. Trans. Evol. Comput. 1, 109–128 (1997) 23. Lapicque, L.: Recherches quantitatives sur l’excitation ´electrique des nerfs trait´ee comme une polarisation. J. Physiol. Pathol. Gen. 9, 620–635 (1907) 24. Lindsay, R.K., Buchanan, B.G., Feigenbaum, E.A., Lederberg, J.: Applications of Artificial Intelligence for Organic Chemistry: The DENDRAL Project. McGrawHill, New York (1980) 25. Lockery, S.R., Wittenberg, G., Kristan, W.B., Cottrell, G.W.: Function of Identified Interneurons in the Leech Elucidated Using Neural Networks Trained by Back-Propagation. Nature 340, 468–471 (1989) 26. Mattiussi, C., Floreano, D.: Analog Genetic Encoding for the Evolution of Circuits and Networks. IEEE Trans. Evol. Comp. 11, 596–607 (2007) 27. Menon, V., Spruston, N., Kath, W.L.: A State-Mutating Genetic Algorithm to Design Ion-Channel Models. PNAS 106, 16829–16834 (2009) 28. NGSpice Circuit Simulator, http://ngspice.sourceforge.net 29. Noble, D.: Computational Models of the Heart and Their Use in Assessing the Actions of Drugs. J. Pharm. Sci. 107, 107–117 (2008) 30. Schmidt, M., Lipson, H.: Distilling Free-Form Natural Laws from Experimental Data. Science 324, 81–85 (2009) 31. Sicard, G., Bouvier, G., Fristot, V., Lelah, A.: An Adaptive Bio-Inspired Analog Silicon Retina. In: Proc. 25th Eur. Solid-State Cir. Conf., pp. 306–309 (1999) 32. Simoni, M.F., Cymbalyuk, G.S., Sorensen, M.E., Calabrese, R.L., DeWeerth, S.P.: A Multiconductance Silicon Neuron with Biologically Matched Dynamics. IEEE Trans. Biom. Eng. 51, 342–354 (2004) 33. Smith, L.S.: Neuromorphic Systems: Past, Present and Future. Adv. Exp. Med. Biol. 657, 167–182 (2010) 34. Stanley, K.O., Miikkulainen, R.: Evolving Neural Networks Through Augmenting Topologies. Evol. Comp. 10, 99–127 (2002) 35. Torresen, J.: A Scalable Approach to Evolvable Hardware. Genet. Prog. Evol. Mach. 3, 259–282 (2002)
168
T.W. Cornforth, K.-J. Kim, and H. Lipson
Appendix: Circuit Parameters
Table 1. Components and parameter ranges used in circuit evolution Component Parameter range inductor L = 1 - 1x109 kH C = 1 - 1x109 fF capacitor R = 1 - 1x109 kΩ resistor diode D V = 0 - 20 V EMF p-type MOSFET M, length = 10 μM, width = [5, 10, 20] μM n-type MOSFET M, length = 10 μM, width = [5, 10, 20] μM
Table 2. Components and parameter values for the evolved potassium channel equivalent circuit Component L3 M3 M4 R3 M5 M6 L4 M7 C3 L5 L6 L7 C4 M8 M9 L8 L9 C5 C6 L10
Parameter value 3.50x102 kH width = 10 μM width = 5 μM 3.93x106 kΩ width = 10 μM width = 20 μM 1.79x105 kH width = 5 μM 1.01x103 fF 4.58x105 kH 3.48x106 kH 4.38x102 kH 2.34x103 fF width = 10 μM width = 5 μM 1.48x101 kH 1.34x104 kH 3.14x102 fF 1.90x101 fF 1.64x102 kH
Table 3. Components and parameter values for the evolved sodium channel equivalent circuit Component L3 M3 R3 M4 M5 C3 R4 C4 M6 C5 C6 L4 L5 L6
Parameter value 3.52x103 kH width = 5 μM 2.16x106 kΩ width = 10 μM width = 10 μM 7.83x101 fF 5.93x107 kΩ 3.24x103 fF width = 5 μM 7.07x108 fF 4.03x101 fF 2.21 kH 2.76x104 kH 3.48 kH
HyperNEAT for Locomotion Control in Modular Robots Evert Haasdijk, Andrei A. Rusu, and A.E. Eiben Dept. of Computer Science, Vrije Universiteit Amsterdam, The Netherlands [email protected], [email protected], [email protected] http://www.cs.vu.nl/ci/
Abstract. In an application where autonomous robots can amalgamate spontaneously into arbitrary organisms, the individual robots cannot know a priori at which location in an organism they will end up. If the organism is to be controlled autonomously by the constituent robots, an evolutionary algorithm that evolves the controllers can only develop a single genome that will have to suffice for every individual robot. However, the robots should show different behaviour depending on their position in an organism, meaning their phenotype should be different depending on their location. In this paper, we demonstrate a solution for this problem using the HyperNEAT generative encoding technique with differentiated genome expression. We develop controllers for organism locomotion with obstacle avoidance as a proof of concept. Finally, we identify promising directions for further research.
1 Introduction The research presented in this paper was undertaken as part of the European research project SYMBRION: Symbiotic Evolutionary Robot Organisms.1 As the name suggests, a key objective of the project is the evolution of robot organisms – structures consisting of physically connected individual robots like those in Fig. 1 for tasks that an unconnected group of individual robots cannot cope with. In SYMBRION, individual robots are fully autonomous and viable as individuals, while they have the ability to dock with each other and so aggregate into organisms, becoming modules (cells) within the organism. Once in organism mode, the modules share energy and control, acting autonomously but in co-ordination. Co-ordination is inherently distributed, without central control. Such emergent organisms are not made to last forever: they can separate to become a swarm of individual robots once more. The individual robots are then available for the formation of new, possibly differently shaped, organisms. This high level of flexibility implies challenging requirements for robot controllers. Firstly, an individual robot needs a controller that works appropriately within differently shaped organisms. For instance, the robot should be able to act within a “snake”, a twenty-legged body, or a “dog” with four legs, a head and a tail. Furthermore, any robot should be able to function at different positions of any given organism shape, e.g., at the head as well as in the middle of a snake. As an example of a task for an organism that requires co-ordinated control of the robots/modules, consider locomotion; obviously a key ability for the organism to perform meaningful tasks. In this paper 1
EU grant agreement 216342.
G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 169–180, 2010. c Springer-Verlag Berlin Heidelberg 2010
170
E. Haasdijk, A.A. Rusu, and A.E. Eiben
Fig. 1. Illustration of possible SYMBRION organisms
we leave shared energy for what it is and address the challenge of robot controllers that work appropriately at different positions of a given organism shape. In particular, we seek an evolutionary method that can produce such controllers for arbitrary organism shapes (note: not a single controller for arbitrary shapes but a single developmental technique). Let us first clarify the difference between a robot controller and the evolvable code that represents that controller. In general, a robot controller is a structurally and procedurally complex entity that directly determines the robots behaviour. When using evolutionary methods for controller design, controllers are seen as phenotypes that are represented by (structurally simpler) pieces of code, called genotypes. The phenotypes are then perceived as expressions of the genetic code in the genotype through a possibly complex mapping. The fitness of an individual (typically: task performance) is then determined by the phenotype. Meanwhile, –conforming to biological principles of evolution– it is only the genotypes that undergo evolutionary operators (mutation and/or crossover), not the phenotypes. Distinguishing in this way between phenotype (the actual controller) and the genotype that encodes it allows us to rephrase the challenge: we seek an evolutionary method that is capable of generating genotypes that give rise to controllers that work appropriately at different positions of a given organism. Because it is unknown a priori at which location in an organism a particular robot will end up, the robots must have a single genotype that encodes appropriate controllers for each location: the group of robots is literally homogeneous. However, they should have the flexibility to show different behaviour depending on their position in an organism. This means that their phenotype should be different, depending on their location. For instance, a module that forms part of a quadruped’s backbone has a different role and thus requires a different controller than does a module that makes up, say, a hip joint (in biological terms, the expression of the genotypes must be influenced by the environment).
HyperNEAT for Locomotion Control in Modular Robots
171
We argue that an evolutionary algorithm with a generative encoding presents a natural way to meet these requirements. As noted by D’Ambrosio and Stanley in [3], variation on a policy theme distributed across space is reminiscent of the regular spatial patterns for which generative encodings are known [8,11]. For our purposes, generative encodings offer the benefit that the genome can be interpreted multiple times with variations. In our own bodies, this is exactly what happens when our DNA is expressed: for instance, variations in expression cause each of the segments of our spines to be similar yet specifically differentiated for their role within the spine as a whole. Enabling similar differentiation when expressing the genome as controllers for the organism’s modules allows the development of varying, specialised functionality. For this procedure of varying the expression of a genotype to create specialised controllers for the organism’s modules, we coin the phrase modular differentiation.2 Of course, specialisation can also be achieved by separately evolving specialist controllers for (collections of) joints, vertebrae, etc. and selecting the appropriate controller as needed. While such divide-and-conquer tactics have resulted in successful locomotion, the underlying decomposition is inherently specific to a particular morphology and must be performed manually. Also, it runs the risk of introducing constraints and biases that limit the quality of solutions cf. [7,2, and citations therein]. Locomotion of an organism that consists of autonomous modules can be viewed as a task of a co-operating team of individual agents, with each module constituting an autonomous agent. Although extending the scope of their findings to this scenario might be tenuous, Waibel et al. have shown that for tasks requiring co-operation, homogeneous teams outperform heterogeneous ones [12]. The modular differentiation approach allows us to enjoy the best of both worlds: it exploits the benefit that homogeneous teams enjoy without sacrificing the advantages of specialisation. The individual robots that make up the organism also have their own sensory capabilities that allow them, for instance, to steer the organism away from obstacles they detect. Consequently, we seek controllers that put sensor information –specifically, obstacle detection– to use: they should implement reactive control in addition to habitual motion patterns such as found in [4,6]. Summarising, the aim of this paper is to present an evolutionary algorithm that combines generative encoding and modular differentiation to evolve reactive, co-ordinated, autonomous modular controllers for organism locomotion.
2 Generative Encoding Description The generative encoding we use is called HyperNEAT [9], which evolves artificial neural networks with the principles of the widely used NeuroEvolution of Augmented Topologies (NEAT) algorithm [10]. HyperNEAT evolves a particular type of artificial neural network, called a Compositional Pattern Producing Network (CPPN). While traditionally, artificial neural networks typically contain only sigmoid functions, CPPNs can employ a mixture of many other functions. 2
The term modular differentiation was chosen as an analogy to developmental biology’s cellular differentiation, the process by which a less specialised cell becomes a more specialised cell type.
172
E. Haasdijk, A.A. Rusu, and A.E. Eiben
A CPPN defines a function that can be employed, for instance, to assign grayscale values to pixels in an image, as was done to generate the picture in Fig. 23 , which highlights important attributes of the CPPNs that evolve in HyperNEAT: they tend to produce designs with a large degree of regularity, symmetry and repetition. Often, patterns are repeated with slight variations and at varying scales. The consequent layout can be perceived as modular with variations. HyperNEAT uses the CPPNs as an indirect encoding, so the Fig. 2. A CPPN-generCPPNs do not constitute the controllers for the robot modules ated grayscale image themselves. Instead, the CPPNs are used to set up the artificial neural networks that do control the robots. To avoid confusion, these artificial neural nets that form the phenotype are usually referred to as substrates. To define a substrate, the CPPN specifies the weight for every possible connection in the template substrate; the connection weight between two nodes is determined by querying the CPPN with the two nodes’ co-ordinates, which then returns the required connection weight. Often, the distance between the nodes is passed into the CPPN as well. This method of generating the substrate assigns meaning to the location of the neural net’s nodes, implying that HyperNEAT has the unique ability to exploit the geometry of a problem [9]: if the geometric disposition of the nodes in the substrate represents relevant information, HyperNEAT can use that information. HyperNEAT has been successfully used in many applications, maybe most pertinently to develop gaits for four-legged robots by Clune et al. [2]. There, Clune et al. used HyperNEAT to develop monolithic, central controllers for a table-shaped robot. This robot did not, in contrast to the organisms considered here, consist of multiple modules, so modular differentiation could not play a role in controller development. Moreover, no obstacle detection was employed and therefore control could not avoid obstacles as we aim to do.
3 Experimental Set-Up We evolved controllers for locomotion of a quadruped organism consisting of 14 simple modules as a proof of concept. Experiments were conducted in the well-known Webots4 simulation platform. 3.1 Modules and Organism We based the modules on the YaMoR [6] oscillators. These consist of a solid body and an oscillating arm, offering one degree of freedom. We added two extra connectors, bringing the total to 4, which are situated as follows: one on the joint’s arm, one on the opposite side of the module, and two on the left and right of the mobile joint, in the motion plane. See Fig. 3(a) for a rendition of a module. For obstacle detection, we added 6 distance sensors with very limited range, indicated in figure 3(a) by the thin lines emerging from the module. They are distributed on all sides of the module. 3 4
See http://picbreeder.org/ for more examples and information. http://www.cyberbotics.com/
HyperNEAT for Locomotion Control in Modular Robots
(a) Basic module
173
(b) Quadruped organism Fig. 3. Module and organism
The module positions in the quadruped organism shown in Fig. 3(b) are inspired by joint disposition in natural insect legs. The central modules allow for mid-body flexibility. The organism is completely symmetrical around its center point. In these experiments, each module’s controller has the sole task of setting the target angle value for the actuator in each control step to achieve locomotion for the organism. The individual modules that make up our organism are simpler than those being developed in the SYMBRION project [5] but for the purposes of organism locomotion have a similar degrees of freedom and sensors. 3.2 Control Each module within the organism operates autonomously and with only local interaction. As described in Sec. 2, each module is controlled by its own neural network, or substrate, controller. The nodes of the substrate are arranged in three layers of nine nodes each as shown in Fig. 4. Links between nodes run only in one direction and only between consecutive layers. The nodes have sigmoid activation functions. Inputs consist of processed sensory information: when a new object appears in the range of the sensors, a ‘new presence’ flag in the centre of the input layer (labelled ‘self’ in Fig. 4), is set to −1. To compute the occurrence of a new object, the distance sensors are queried in each control step and the returned values are compared to the values in the previous Fig. 4. The substrate layout for the locomotion task. Connections are shown as illustration; ac- step. If at least one sensor gives a reading increase above 50% of the maximum tual connectivity is determined by the CPPN activation level, this is interpreted as the detection of a new object in the perceptual range of the module, and the center input layer node is activated (with values −1).
174
E. Haasdijk, A.A. Rusu, and A.E. Eiben
This scheme is loosely inspired by the biological processing of olfactory information, which triggers strong responses primarily at the initiation of new stimuli, but then develops adaptation (‘fatigue’).[13] Note that the continued presence or removal of an object from a sensor’s range is not signalled. Up to four adjacent controllers (of any directly connected modules) send their own flag: these values are set in the substrate input corresponding to the geometric position of connectors. In the 3x3 input layer the central node accounts for the current module, and nodes above and below, to the left and to the right of that node account for the modules connected using the front, back, left and right connectors, respectively. This very primitive, distributed, object detection scheme is intended to allow for simple but effective reactions to obstacles. If no perceptual changes are detected by the sensors of the current module, or the modules connected to it, the substrate inputs are 0, allowing for default non-reactive locomotive behaviour as specified by the output layer biases. Note, that this default behaviour actually requires no interaction with other modules at all and the organism moves by virtue of the modules acting in splendid isolation. Producing a successful gait with such a reactive framework is harder than a nonreactive one (which is actually implemented by the output layer’s biases), because the modules are subjected to potentially different “perceptual histories” at every evaluation. However, this scheme exposes the changes in behaviour to the evolutionary algorithm and allows for adjustments to the base angle, speed and amplitude as responses to perceived objects. The output layer provides three values for the computation of the target angle of the joint in each control step: α (reference angle), A (deviation amplitude from the reference angle) and ω (angular speed of the oscillation). The target angle is computed as follows: αtarget = α + A · sin(πωt + id) (1) with t the current time-step and id a number between 1 and 14 which identifies the current module within the organism with no geometric meaning. This encoding of the joint’s motion allows for both static and dynamic joints, with specific oscillation amplitudes and speeds. The modules are out of phase by a number of steps determined by their position in the organism. This is important for generating some motion in the initial stages of the evolutionary process. This encoding scheme was devised for its effective task decomposition into concepts of speed, amplitude and a base angle. 3.3 Modular Differentiation To achieve modular differentiation, we extend the information passed to the CPPN when determining the connection weight between two nodes in the substrate. Remember that normally the connection weight is determined by querying the CPPN with the two nodes’ co-ordinates, often passing the distance between the nodes into the CPPN as well. In addition, we pass the CPPN inputs locating the module for which we are generating connection weights within the organism. By virtue of these extra inputs, each module in the organism will have a different set of connection weights in its neural net controller, but the underlying phenotype (i.e., the CPPN) is the same throughout the organism.
HyperNEAT for Locomotion Control in Modular Robots
175
To be precise, the substrate weights are determined by querying the CPPN with the corresponding co-ordinates for the source and destination nodes in 3 dimensions x, y, z and the relative position of the module in the organism on a two dimensional plane t1 , t2 , il- Fig. 5. Distribution of modules in the t1 , t2 coordinate space. Modules are labelled with their ids lustrated in Fig. 5. We also use four delta inputs: Δx, Δy and Δz are the respective co-ordinate value differences, while Δt is the Euclidean distance to the centre of the organism shape. As an example, consider the link between two nodes at co-ordinates 1, 0, 1 and 0, 1, 0 in the substrate. To determine the weight for that connection the CPPN would be queried with nine values that pertain to the two nodes themselves: the six original co-ordinate values and three Δ-values that denote the differences for the x, y and z coordinates (Δx = 1, Δy = 1, Δz = 1). Additionally, we pass three values to differentiate between√modules: for module 6 in Fig. 5, for example, we pass t1 = 0.66, t2 = 0.25 and Δt = √ 0.662 + 0.252 , while for module 11 these values are t1 = 0, t2 = −0.25 and Δt = 0 + 0.252. Links for which the CPPN returns values below 90% are ignored, so the CPPN’s output is interpreted as a link’s relevance measure, and only very strong stimulatory and inhibitory links are kept. The 90% threshold was established empirically. The perceptual scheme introduces a lot of noise directly into the values that determine the motion patterns, so only very strong links are worth keeping.
Fig. 6. Experimental setting: a corridor with bricks and walls
176
E. Haasdijk, A.A. Rusu, and A.E. Eiben
3.4 Task and Evolution We ran a series of simulations in the arena depicted in Fig. 6. The task for the organism was to move the whole body along the corridor of which the walls are too high to scale. The corridor is littered with bricks. The organism starts roughly in middle of the corridor. Bricks and walls are detected when they are in the (short) range of each module’s distance sensors. Bricks can be moved, but walls cannot. This allows for a “perceptual” difference between them, since bricks are more dynamic and will typically activate sensors which walls will not, e.g. underneath the body. The organism needs to adjust its gait to steer away from walls, but not be deterred by mere stacks of bricks. Each evaluation lasts 20 simulated seconds for a total of around 80 control steps. Each CPPN is evaluated 3 times on the same task, to get a better approximation of its fitness. Fitness increases exponentially with the final distance from origin achieved by the organism, and the average height of the middle section, it is computed as follows: f (CP P N ) = e
d ( travelled −1) dorigin +h
(dorigin ∗0.95
avg )
(2)
with dorigin the distance from the origin after 20 seconds, dtravelled the total distance travelled and havg the average height from the floor of the body’s centre during the 20 second evaluation period. The dtravelled dorigin −1 part measures the effectiveness of the overall gait: the final distance from origin is scaled down to penalise ineffective gaits that do not move in one consistent direction. The inclusion of havg promotes individuals that can raise their bodies. As the distance-related part of the fitness formula quickly becomes an order of magnitude larger than the average height, its effects are felt mainly in the initial stages of evolution. We used Jason Gauci’s publicly available C++ implementation of HyperNEAT, version 2.6. 5 Apart from a population size of 10, we used the settings as found in that implementation’s TicTacToe experiment. We did not engage in further tuning of parameters or thorough analysis of alternative fitness calculations since the experiments provide a proof of concept rather than a comprehensive analysis.
4 Results and Analysis The evolved gaits we observed were smooth and seemed natural with the organism moving in a controlled, co-ordinated manner using cyclical motion patterns. In the later stages of evolution, motion patterns often exhibit left-right symmetry, replacing the initial phase difference to produce useful gaits. They gave the impression that the organism would happily walk for hours on end without faltering as the organism returned to a neatly poised stance after every step. The sensory input was often seen to be used with the organism lifting a leg higher than normal to avoid a brick, as illustrated in Fig. 7. Because the bricks can also be shoved aside, this kind of behaviour did not always emerge, but that it does at all is a clear indication that reactive controllers do evolve in this set-up. Figure 8 shows the development of fitness over 25 repeats of the experiment. The centre line shows the median of the best of every generation over 25 runs, with the bars 5
http://eplex.cs.ucf.edu/software.html#gaucij_HyperNEAT
HyperNEAT for Locomotion Control in Modular Robots
177
Fig. 7. Locomotion while negotiating an obstacle
Fig. 8. Fitness plot over 150 generations. The centre line shows the median of each generation’s best individual over 25 repeats, the bars extend from the lower to upper quartile
extending from the lower to the upper quartile. Considering the exponential nature of Eq. 2, the median fitness after 150 generations of circa 15 equates to more than 2.5 metres travelled. The lower quartile after 150 generations, at 10, equates to travelling ca. 2.3 metres. For values of 20 or higher, the organism actually reaches the end of the corridor after 3 metres. To analyse the effect of modular differentiation in the organism and the reactivity of the controllers we analyse the substrate outputs of a high fitness individual. The networks use sigmoid functions with outputs between -1 and 1, which are afterwards linearly rescaled to the full ranges of the effectors. For simplicity we omit the scaling here, and show raw network output values. Figure 9 shows substrate outputs for the three output nodes (see substrate in figure 4) for all 14 controllers over 80 control steps. The horizontal base lines indicate the substrate output in unexcited state, i.e., when no obstacles are detected. If the controllers were identical, these lines would obviously
178
E. Haasdijk, A.A. Rusu, and A.E. Eiben 0.5
0.463
0.45 0.4
0.462
0.35 0.461
0.3 0.25
0.46
0.2 0.459
0.15 0.1
0.458 0.05 0
0
10
20
30
40
50
60
70
80
(a) Base angle (α)
0.457
0
10
20
30
40
50
60
70
80
(b) Angular speed (ω)
0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8
0
10
20
30
40
50
60
70
80
(c) Amplitude (A) Fig. 9. Substrate outputs for all modules of a high-fitness organism over 80 time-steps
overlap for all modules: the different levels we see are the result of modular differentiation. Jags in the plots indicate reactions to perceptual input (detecting an obstacle), either direct or via a neighbouring module. Note that not all modules react with the same intensity or at the same time, further proof of modular differentiation. Figure 9(a) shows the outputs for the base angle α; many of the outputs remain constant throughout the experiment: controllers that do not use sensory inputs to set the base angles. The number of lines we can distinguish indicate that modular differentiation leads to some specialisation. The variation of the three non-constant plots results from obstacle detection, but the magnitude of the changes is small. Figure 9(b) shows that the outputs for angular speed (ω nodes) are almost equal for all modules (note the scale). Moreover, no perceptual information is used, since the outputs are constant. This parameter barely differentiates modules. By far the most diverse behaviour is shown in Fig. 9(c), which depicts the amplitude node outputs. All controllers use perceptual information to set amplitude values, and the magnitude of the changes is as big as 0.3 in absolute difference, in some cases.Also, there is a high degree of specialisation, since the default output levels range from -0.6 to 0.4.
HyperNEAT for Locomotion Control in Modular Robots
179
5 Conclusion and Future Work Using HyperNEAT’s generative encoding technique and modular differentiation, we have designed an evolutionary algorithm to develop homogeneous yet specialised controllers for modules within a multi-robot organism. We showed that this algorithm can successfully develop a reactive quadruped gait. The individual robots’ controllers act autonomously and with only local exchange of information but in a co-ordanated manner to allow successful locomotion of a given organism. Analysis of the substrate output of all modules over the course of an evaluation showed considerable differences in activation between modules, indicating adaptation of module controllers to their particular position in the organism as the result of modular differentiation. The controllers incorporate sensory feedback from the modules’ obstacle sensors, resulting in the CPPN encoding multiple motion patterns. The base pattern is determined by the substrate output layer biases (used when no obstacles are detected and the remaining controller network is not activated). The CPPN also encodes the changes to this default behaviour, different for each perceptual flag combination, which directly activates the network. Instead of exchanging information about the motion pattern, the modules send information about detected obstacles to any directly connected neighbours.This way perceptual information propagates locally and progressively, as the new object also enters the sensory range of adjacent modules. Analysis showed that the primitive “perceptual flag” sensory scheme can successfully switch policies for all modules, for this particular individual the most notable changes affecting amplitude values. Further study of the perceptual scheme described here is required to asses its effectiveness in arenas of different shapes and scales. A promising avenue of further research leads towards an implementation of the HyperNEAT modular differentiation approach for on-line adaptation of controllers for emergent rather than pre-defined organism morphologies. Future research will also combine the use of CPPNs to generate organism morphology as well as controllers for the constituent modules. Acknowledgements. This work was made possible by the European Union FET Proactive Intiative: Pervasive Adaptation funding the SYMBRION project under grant agreement 216342. The authors would like to thank Nicolas Bred`eche and other partners in the SYMBRION consortium, Jeff Clune and Selmar Smit for many inspirational discussions on the topics presented here. Jason Gauci, Ken Stanley and other members of the very active HyperNEAT community have provided invaluable help.
References 1. Proceedings of the 2009 IEEE Congress on Evolutionary Computation, Trondheim, May 18-21. IEEE Press, Los Alamitos (2009) 2. Clune, J., Beckmann, B.E., Ofria, C., Pennock, R.T.: Evolving coordinated quadruped gaits with the hyperneat generative encoding. In: CEC-2009 [1] 3. D’Ambrosio, D.B., Stanley, K.O.: Generative encoding for multiagent learning. In: Ryan, C., Keijzer, M. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2008). ACM, New York (2008)
180
E. Haasdijk, A.A. Rusu, and A.E. Eiben
4. Ijspeert, A.J.: Central pattern generators for locomotion control in animals and robots: A review. Neural Networks 21(4), 642–653 (2008) (Robotics and Neuroscience) 5. Kernbach, S., Meister, E., Scholz, O., Humza, R., Liedke, J., Ricotti, L., Jemai, J., Havlik, J., Liu, W.: Evolutionary robotics: The next-generation-platform for on-line and on-board artificial evolution. In: CEC-2009 [1], pp. 1079–1086 6. Marbach, D., Ijspeert, A.J.: Online optimization of modular robot locomotion. In: Proceedings of the IEEE Int. Conference on Mechatronics and Automation (ICMA 2005), pp. 248– 253. IEEE Press, Los Alamitos (2005) 7. Nolfi, S., Floreano, D.: Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. MIT Press, Cambridge (2000) 8. Stanley, K.O.: Compositional pattern producing networks: A novel abstraction of development. Genetic Programming and Evolvable Machines 8(2), 131–162 (June 2007), Special issue on developmental systems 9. Stanley, K.O., D’Ambrosio, D.B., Gauci, J.: A hypercube-based indirect encoding for evolving large-scale neural networks. Artificial Life 15(2), 185–212 (2009) 10. Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolutionary Computation 10(2), 99–127 (2002) 11. Stanley, K.O., Miikkulainen, R.: A taxonomy for artificial embryogeny. Artificial Life 9(2), 93–130 (2003) 12. Waibel, M., Keller, L., Floreano, D.: Genetic Team Composition and Level of Selection in the Evolution of Cooperation. IEEE Transactions on Evolutionary Computation 13(3), 648–660 (2009) 13. Whitfield, P., Stoddard, D.M.: Hearing, Taste, and Smell; Pathways of Perception. Torstar Books, Inc., New York (1984)
The Use of Genetic Algorithm to Reduce Power Consumption during Test Application Jaroslav Skarvada, Zdenek Kotasek and Josef Strnadel Brno University of Technology, Faculty of Information Technology, Božetěchova 2, 61266 Brno, Czech Republic {kotasek,skarvada,strnadel}@fit.vutbr.cz
Abstract. In this paper it is demonstrated how two issues from the area of testing electronic components can be merged and solved by means of a genetic algorithm. The two issues are the ordering of test vectors and scan registers with the goal of reducing switching activity during test application and power consumption as a consequence of the ordering. The principles of developing an optimizing procedure with the aim of achieving a solution satisfying the required value of power consumption during power consumption are described here. A basic description of the methodology together with the functions needed to implement the procedures is provided. Experimental results are also discussed. Keywords: test application, power consumption, optimizing procedure, fitness function, genotype, phenotype.
1 Introduction With the continuing increase in chip density, power dissipation has become one of the major design constraints for today’s VLSI circuits. Although there are many techniques for power minimization during normal (functional) operation, power minimization during testing is an emerging research area because power dissipation during testing is becoming a yield and reliability problem. Significantly more switching activity occurs during testing than during functional operation. The increased activity can decrease the reliability of the circuit under testing because it causes excessive temperature and current density which can cause problems in circuits designed with a power minimization requirement. Furthermore, as a result of high activity in circuits employing BIST, the voltage drop that occurs only during testing causes some good circuits to fail the testing process, leading to unnecessary manufacturing yield loss. To summarize, excessive switching activity during scan testing can cause average power dissipation and peak power during testing to be much higher than during a normal operation. This can cause problems both with heat dissipation and with current spikes. Various formulas to evaluate power consumption were developed and implemented [1], [2]. They are difficult to be used in practical designs, especially for complex circuits and a great volume of input data (test vectors). G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 181–192, 2010. © Springer-Verlag Berlin Heidelberg 2010
182
J. Skarvada, Z. Kotasek, and J. Strnadel
1.1 Power Consumption Metrics For the purposes of comparing various optimizing procedures that are aimed at power consumption reduction, power consumption metrics were developed and are used. It is evident that if the sequence of input data is reorganized as a result of applying a particular methodology and the implementation of the component is unchanged, then for the purposes of comparing various methodologies, an NTC (Number of Transition Count) parameter can be used. More precise techniques are based on the use of WNTC (Weighted Number of Transition Count) [3], and WSA (Weighted Switching Activity) [4]. These parameters can be evaluated by the following formulas:
1&
(1)
17& ¦QL L
In (1), n(i) represents the number of 0↔1 transitions between two states in i–1/f , i/f instants, Nc is the total number of clock pulses applied during test application. 1& 1* (2) :17& ¦ ¦Q L ) M
L M
M
In (2), the meaning of nj(i) and Nc is the same as in (1), Fj is the fan out factor of node j, NG is the total number of nodes in the component. 1& 1* (3) :6$ Q L &
¦ ¦
L M
M
M
In (3), Cj is normalized node capacity, while the meaning of other symbols is the same as in (1) and (2). 1.2 Low Power Approaches Two approaches for low power testing exist: the first ones are directed to reducing dynamic portion of power consumption (switching power), while the second group of methodologies have a goal of reducing its static portion (leakage power). It is important to say that in older implementations, dynamic portion of power consumption was higher than the static one – for example, in [30], it is reported that the dynamic portion of power consumption is about 90% of the total power consumption. As a consequence, in 90 nm technology [7] the dynamic portion of power consumption is only 58% of total power consumption (according to [8], 65 nm technology is seen as the technology in which the static power consumption begins to prevail over the dynamic one). It is even more evident in technologies with higher level of integration (32 nm, 25 nm) in which the static power consumption is much higher than the dynamic one [9]. Thus, to choose proper and effective optimizing procedures to decrease power consumption, the information about the target technology to which the design will be implemented, becomes significant. In this paper, attention is paid to the reduction of the dynamic portion of power consumption.
The Use of Genetic Algorithm to Reduce Power Consumption during Test Application
183
1.3 Complexity of the Problem Modern commercial tools are able to generate high quality sets of test vectors with a high degree of fault coverage which are not usually optimized to reducing power consumption. Therefore, various methods were developed to optimize the sequences of test vectors to reduce switching activities during test application. In combinational circuits the responses depend only on the set of test vectors being applied; therefore, it is possible to reorganize their sequence. The responses will be the same, but their sequence will be different. It means that the sequence of test vectors can be reorganized with the goal of minimizing power consumption. It can also be stated that fault coverage is the same as with the original sequence of test vectors generated by the test generator. In sequential circuits the situation is different due to the fact that the responses do not actually depend on the applied test vector but on those applied in previous steps as well. The test is generated by an SATPG (Sequential Automated Test Pattern Generator). If the sequence is modified in some way, then a completely different test will be gained with a different fault of coverage. If a scan register chain is inserted into the component then for the test generation process the component will be seen as combinational and ATPG (Automated Test Pattern Generator) can be used to generate the test. Fault coverage does not depend on the sequence of test vectors. The sequence of scan registers can be reorganized to reduce power consumption. The problem of identifying the proper sequence of test vectors/scan registers belongs to the category of NP-hard problems [5], its time complexity is O(n) = n!, where n is the number of elements the sequence of which is supposed to be optimized. To model both problems (i.e. the sequence of scan vectors and scan registers) separate graph models are often used. To solve the problem, a minimal Hamiltonian path must be identified in the graph. After it is found, it represents the solution of the problem, (i.e. the sequence of test vectors) for which the power consumption during test application is minimal is identified. Many methods exist which utilize the above described approach. For example, in [7], Hamming distance between test vectors is analyzed in order to optimize their sequence. 1.4 Problems Related to Power Dissipation Estimation It can be concluded that power consumption during the test application of test vectors is in some way associated with Hamming distance between test vectors. Anyway, examples can be found in which the results do not correlate. The switching activity is difficult to evaluate if the physical implementation of the component is not known. It can be shown that a change in one bit can cause higher switching activity than a change in several other bits (more than one bit). In [6], the problem of reordering scan registers in the scan chain is solved – a greedy search algorithm is used for this purpose. Methods combining BPIC (Best Primary Input Change time) approaches with test vectors reordering can achieve even higher reduction of power consumption. In [3], the method combining these two approaches is described – it uses simulated annealing to investigate state space. These methods require a special approach for test application which reduces their use in commercial diagnostic tools. Typically, optimizing methods are used sequentially (e.g. the sequence of registers in scan chains is optimized first, and then the same is done for the sequence of test vectors).
184
J. Skarvada, Z. Kotasek, and J. Strnadel
2 Motivation for the Research We analyzed the state of the art of existing methodologies with the aim of optimizing power dissipation during test application. To solve this issue, it is necessary to reorganize the sequence of test vectors and the sequence of scan registers. The drawbacks of previously published methodologies can be summarized in the following way: 1) in previous approaches these two issues (the reorganization of test vectors sequence and the sequence of scan registers) were solved separately; 2) the results of various methodologies are not evaluated on platforms to which they will be later implemented. The previously published approaches have test vectors as the only input data to the methodology without any information about the internal structure of the component under testing through which test vectors will propagate. The propagation of test vectors through the structure represents additional switching activity which can have a significant impact on power dissipation. Most methodologies are based on the evaluation of Hamming distance between input test vectors without any coupling with an implementation platform. As a result, the impact of test vectors reorganization on power dissipation through switching activity reduction is rather difficult to be precisely evaluated. We also see that both procedures (i. e. test vectors reorganization and scan chain reorganization) are performed in sequence as two separate procedures in previous methodologies. Our approach is based on concurrent optimization of both procedures. For this purpose a genetic algorithm (GA) was used.
3 Proposed Optimization Method For purposes of the methodology, a formal model was developed. It is based on the theory of sets. The model reflects structural (primary interface of CUA – Circuit Under Analysis, elements in CUA, the ports of these elements, connections existing in CUA), diagnostic (topology of scan chains, the list of test vectors and the sequence of applying), and electric (switching model, power consumption during switching) properties of CUA. Algorithms were developed which operate on the formal model. As already mentioned, GA was used to find the solution of the problem defined in this paper. In each step, candidate solutions are recognized (phenotypes) and encoded into genotypes which carry genetic information. Genetic operators are applied on genotypes. All solutions must satisfy the required quality. Therefore, principles of evaluating quality of individual solutions must be defined. The quality evaluation is performed in several steps. First, the genotype is transformed into phenotype. The quality of the particular phenotype is reflected by a real number, where a special function is defined for this purpose. The principle of problem encoding allows one to encode both partial problems (the sequence of test vectors and scan registers order) into one structure. The structure is scalable and can encode this information for several CUAs and/or for CUAs containing several scan chains. This principle was used in the methodology, the goal of which is the identification of testable blocks in CUA [10].
The Use of Genetic Algorithm to Reduce Power Consumption during Test Application
185
3.1 Encoding of the Problem When GA is used to find a solution of a particular problem, encoding plays an important role. An incorrect problem encoding can possibly bring about a GA malfunction. The goal of a GA application is to gain candidate solutions of certain (i.e. required) quality, so principles of quantitative evaluation of individual solutions must be clearly defined. The quality evaluation is performed in several steps: 1) the conversion of genotype to phenotype (for this purpose function Δ was defined); and 2) the fitness function Φ is used to evaluate phenotype quality by means of a real number – for details, see section 3.2. The principle of problem encoding that was applied here allowed us to encode the representation of both tasks to genotype: the ordering of test vectors together with the scan registers ordering into scan chain. The encoding is based on partitioning chromosome CH = (bi1, bi2, …, bil) into separate blocks, each block encodes the ordering of test vectors during test application or the ordering of scan registers into a scan chain. The number of blocks in the chromosome is equal to the sum of CUAs and the number of scan chains. Each block consists of one or more genes. As an example of the encoding, let a two-block chromosome (bi1, …, bik-1, bik, …, bil) be presented now which is typical for circuits with one scan chain. The first block, represented by the (bi1, …, bik-1) sequence, reflects the ordering of test vectors to be applied to CUA; while the second block, represented by the (bik, …, bil) sequence, reflects the ordering of scan registers in the scan chain. The values in blocks are encoded independently. A system of priorities is used in which the gen reflects the priority of either a test vector or a scan register. It must be possible to compare gens; therefore, comparison operators must be defined. By means of an ascending reordering of code sequences (performed separately in each block) we gain the order of entities in each block (the ordering of test vectors and the ordering of scan registers). To demonstrate the mechanism, see the examples in Tables 1 and 2. – the two-block chromosome with 3 test vectors v1, v2, v3 and 3 scan registers sc1, sc2, sc3 is expected here. Table 1. Impact of Chromosome Block Sequence on Test Vector Ordering Ordered code sequences bi1 ≤ bi2 ≤ bi3 bi2 ≤ bi3 ≤ bi1 bi1 ≤ bi3 ≤ bi2 bi3 ≤ bi1 ≤ bi2 bi3 ≤ bi2 ≤ bi1 bi2 ≤ bi1 ≤ bi3
Test vectors ordering v3 after v2 after v1 v1 after v3 after v2 v2 after v3 after v1 v2 after v1 after v3 v1 after v2 after v3 v3 after v1 after v2
In Table 1, all the alternatives of ordering code sequences bi1, bi2, bi3 together with corresponding ordering of test vectors for the chromosome are demonstrated. The bi1 value determines the order of applying v1 vector while bi2 value determines the order of applying v2, a similar relation is found between bi3 and the order of applying v3. A vector corresponding to a lower value of code sequence will be applied before a vector with a higher value of code sequence. For bi2 ≤ bi3 ≤ bi1, v2 will be applied first while v1 will be the last applied vector.
186
J. Skarvada, Z. Kotasek, and J. Strnadel Table 2. Impact of Chromosome Block Sequence on Scan Register Ordering Ordered code sequences bi4 ≤ bi5 ≤ bi6 bi5 ≤ bi6 ≤ bi4 bi4 ≤ bi6 ≤ bi5 bi6 ≤ bi4 ≤ bi5 bi6 ≤ bi5 ≤ bi4 bi5 ≤ bi4 ≤ bi6
Scan registers ordering sc3 after sc2 after sc1 sc1 after sc3 after sc2 sc2 after sc3 after sc1 sc2 after sc1 after sc3 sc1 after sc2 after sc3 sc3 after sc1 after sc2
In Table 2, all alternatives of ordering code sequences bi4, bi5 , bi6 together with corresponding ordering of scan chain for the chromosome are shown. The ordering reflects the values of bi4, bi5, bi6, the same mechanism as applied for test vectors is used to identify the sequence of scan registers (the comparison of bi4, bi5 , bi6 values). For a more detailed example to the encoding, see the following section (3.2). 3.2 Fitness Function The value of fitness is evaluated by Φ function which is designed to transform a genotype to phenotype. Algorithm Φ(TVS,SRS,CH,K,M) 01 Δ(CH,K,TA,SCS) 02 return 1/pwr(TVS,TA,SRS,SCS,SVS,M)
The phenotype reflects the test vector sequence and organization of registers within scan chains. A particular solution is assigned a fitness value proportional to particular power savings during the test application time. Power consumption related to a solution is estimated during the test application simulation phase. At the input, Φ takes the following data: TVS – test vector sequence, SRS – scan register sequence, SVS – scan vectors set, CH – chromosome, K – sequence used to divide CH into blocks, M – metric utilized for power consumption evaluation, M ∈ {NTC, WNTC, WSA}). Φ returns a real number from <0; 1> interval representing the fitness of CH. Φ works as follows: first (see line 01 of Φ’s code) where CH is transformed into separated TA (i.e., test vector application sequence) and SCS (ordering of registers in scan chains) sequences – Δ function is utilized for this purpose. Both TA and SCS values are needed for the succeeding phase (line 02) in which test process application is simulated – as a result, power consumption is quantified by the power consumption metric M. Fitness value is evaluated as a reciprocal of the value produced at the output of the simulation and after the evaluation, it is returned as the output of Φ. Algorithm Δ(CH,K, TA,SCS) 01 TA := () ; TA init 02 SCS := () ; SCS init 03 KK := K ; save K into KK 04 k1 := 0 ; the starting index of the actual block in CH 05 k2 := l ; the starting index of the next block in CH (|CH|=l) 06 if len(KK)≠0 then ; at least one scan chain exists in CH 07 k2 := car KK ; prepare it for processing 08 09 KK := cdr KK ; remove the first block from KK 10 TA := sort(subset(CH, k1, k2)) ; test application sequence block 11
The Use of Genetic Algorithm to Reduce Power Consumption during Test Application
187
12 While len(KK)≠0 do ; actualize block pointers k1, k2 13 k1 := k2 ; prepare index of the next block 14 k2 := car KK 15 KK := cdr KK ; remove actual block from KK 16 SCS push sort(subset(CH, k1, k2)) ; add the result to SCS’s end 17 18 if len(K)≠0 ; actualize block pointers k1, k2 19 k1 := k2 20 k2 := l ; k2 is the last one 21 SCS push sort(subset(CH, k1, k2)) ; add the result to SCS’s end
The Δ function works as follows: after the initialization phase (rows 01 – 09), the first block is processed and the sequence of indexes representing particular test vectors is produced in row 10. In rows 12 to 16, the second to penultimate blocks are processed. If there are more than two blocks within CH, the last block is processed in rows 18 to 21. It should be noted that: 1) len returns the length of a given sequence; 2) car returns the first element within a given sequence; 3) cdr returns a given sequence, except the car; 4) subset returns the sequence of indexes (k1, …, k2-1)∈CH; and 5) sort sorts a given sequence of indexes in an ascending order. An example: suppose CH = (12, 2, 8, 10, 20, 11, 5, 9) and K = (3, 6); thus, CH is composed of 3 blocks: • • •
Block_#1 starts at index 0 and is completed at 2 (i.e., (car K)-1=3-1=2) of CH. In (12, 2, 8), it encodes an application sequence of 2-0+1=3 test vectors (v1, v2, v3). The succeeding blocks describe a way in which registers are organized in chains. Block_#2 starts at 3 and is completed at 5 (i.e., (car (cdr K))-1=6-1=5) of CH. In (10, 20, 11), it encodes the organization of 5-3+1=3 registers in the first scan chain (sc1,1, sc1,2, sc1,3). Block_#3 starts at 6 and is completed at 7 (i.e., len(CH)-1) of CH. In (5, 9), it encodes the organization of 7-6+1=2 registers in the 2nd scan chain (sc2,1, sc2,2).
Then Δ(CH, K, TA, SCS) implies •
•
TA = (2, 3, 1), that is to say test vectors will be applied in the order given by their indexes: (v2, v3, v1). The result was achieved by the following process: the smallest value within the Block1 is 2, placed at index 1. This corresponds to vector v1+1=v2. Thus, v2 will be applied as the first one. The following higher number within Block1 is 8, placed at index 2. So, vector v2+1=v3 will be applied as the next one. 12 is the highest number within the block. It is placed at index 0, so v0+1=v1 will be applied as the last one. SCS = ((1, 3, 2), (1, 2)), i.e., the ordering of registers in the first scan chain will be (sc1,1, sc1,3, sc1,2), while in the second scan chain it will be (sc2,1, sc2,2).
VFVF U U VFDQFKDLQ 6&6 VF VF U U U U U VF VF VF VFDQFKDLQ
FKURPRVRPH&+
LQGH[ V\PERO 5HDO
Y Y
9 9
7$
Y Y
YYY
VF 5
VF U
VF U
Fig. 1. Illustration of the chromosome encoding example
188
J. Skarvada, Z. Kotasek, and J. Strnadel
3.3 Selection Operators In this section, two selection operators utilized in the methodology for crossover purposes are described in detail: roulette-wheel and tournament. If the roulette-wheel selection is applied, probability of selecting an individual (pCHi ∈ <0; 1>) is specified by the formula pCHi = phi(TVS,SRS,CHi,K) / Σλj=1 phi(TVS,SRS,CHj,K).
(4)
In tournament selection case, k1 individuals are selected for a tournament competition. As a result, k2 ≤ k1 individuals are selected according to predefined probability p. At the output, set (O) of k2 individuals is produced, |O| = k2. 3.4 Initialization of the Population In general, there are two ways how the initial population of individuals can be created: 1) in a random way or 2) in an intelligent way (non-random way). In the second case, the population can be generated by means of data gained from a design tool which is able to produce both a test vector sequence and an ordering of registers in scan chains – though the sequences are not optimal from the power consumption point of view. The sequences can be transformed into chromosome form by means of Δ-1 function (because of limited space of the paper, its description is omitted). At least one chromosome should be generated in this way while the others can be generated randomly. If the elitism is activated, it is guaranteed that the best solution found by the method will not be worse than the solution produced by the tool.
4 Experimental Results Unless stated otherwise, experiments presented as a result of our research were performed on a PC equipped with two AMD Opteron 2220 dual core CPUs operating at 2.8 GHz. Table 3. Relation between optimization type and search space size for a b15 circuit Optimization Test vectors ordering only Organization of scan chains only Both in sequence Both in parallel
b15 solution space size (|TVS|=1297, |SC|=416) Formula Enumeration len(TV S)! 1,44 × 103476 |SC|!
3,84 × 10910
len(TV S)! + |SC|! len(TV S)! × |SC|!
≈1,44 × 103476 5,54 × 104386
4.1 Problem Size In Table 3, search space size analysis is summarized for various optimizations related to a b15 circuit from an ITC99 benchmark set. In the first column, the type of optimization is seen. In the second column, a general formula (i.e., for any circuit) that can be utilized to evaluate search space size corresponding to the optimization is presented as a function of parameters. In the last column enumeration for a b15 is presented. It is evident that the procedure is the most time consuming if both optimizations are performed in parallel.
The Use of Genetic Algorithm to Reduce Power Consumption during Test Application
189
Table 4. Times needed to explore b15 search space in an exhaustive way Number of test vectors 8 9 10 12 15 1297
Number of scan registers 3 3 3 3 3 416
Exploration time 17,9 minutes 2,7 hours 26,8 hours 147,4 days 1102,5 years 4,71 × 104374 years
In Table 4, times needed to explore complete search space corresponding to various numbers of test vectors and scan registers are summarized. Values presented in the first three rows of the table were measured while values in the other rows were identified after extrapolation of data presented in Table 3. It is evident that search space cannot be explored in a reasonable time (Table 4). 4.2 Impact of GA Parameters During the experiments, the impact of generic algorithm parameters on both the quality of a produced solution and convergence speed were also investigated. For the experiments, a b02 circuit from ITC99 benchmark set was used, the results of which are summarized below.
Fig. 2. Number of generations impact
Fig. 3. Population size impact
In Fig. 2 (as in Fig. 3), average reduction values gained over 10 GA runs are presented on the vertical axis for various numbers of generations (population sizes) utilized during the runs. At the top, the impact of constant population size (100, 500, and 1000) is depicted in Fig. 2. It is evident that GA is able to converge relatively fast, so high quality results (i.e., those with small r) can be produced during the first several hundreds of generations using a relatively small population size. Similarly in Fig. 3, the impact of constant number of generations (290, 600, and 1200) is shown. It can be seen that the reduction grows with population size – after few oscillations, the value becomes stable if a bigger population size (e.g., 3000 or more) is utilized. Also, it can be seen the relation depends on the number of generations utilized.
190
J. Skarvada, Z. Kotasek, and J. Strnadel
4.3 Scalability of the Solution For the described experiment below, a computational system composed of two 4-core Intel Xeon X5355 CPUs (i.e., 2x4 = 8 CPUs in total) running on 2,66 GHz was utilized. The main goal of the experiment was to verify experimentally the scalability of the solved task on a real multiprocessor system. Execution times, speedups and overheads related to the multiprocessor environment are demonstrated in Fig. 4 and Fig. 5. In Fig. 4, the execution time and the speedup are shown as a function of CPUs within the multiprocessor environment, while corresponding overhead is presented in Fig. 5. Because execution times related to actions (download of dynamic libraries, circuit verification, generation of look-up tables utilized during simulation, initial simulation, etc.) are included in the overhead, it is evident that pure communication overhead will be less or equal to the presented values.
Fig. 4. Speedup
Fig. 5. Parallel execution overhead
4.4 Comparison with Other Approaches In the paragraph, experimental results gained by our approach are compared with results of other published methods. It should be noted that comparison in an objective way is difficult because parameters of the methods differ a lot – e.g., circuits analyzed by the methods are mapped onto various platforms, various test pattern generators with different settings are used by the methods or the methods differ in the way they summarize achieved results. Moreover, some data were not available for some methods, so it was impossible to guarantee the equality of input conditions to experiments. In all experiments related to our optimization method, the circuits were mapped onto AMI 0.5um library by means of Leonardo Spectrum tool. While a test vector set is the only input to the optimizing procedure for combinational circuits, organization of registers in scan chains must also be taken into account if sequential circuits are processed (the circuits were modified to their full scan versions by DFTAdvisor tool; for simplification, one scan chain was utilized). Test vectors under the stuck-at-fault model were generated by Flextest tool. For each of the methods, mean values of the best results attained over 20 GA runs are presented.
The Use of Genetic Algorithm to Reduce Power Consumption during Test Application
191
Fig. 6. Results achieved and compared for ITC99 benchmarks
In Fig. 6, results gained for a subset of ITC99 benchmarks are presented and compared to method A [11], method B [12] and method C [13]. It is evident in all cases that power consumption was reduced most by our method than by the others. In Fig. 7, results gained for a subset of ISCAS85/89 benchmarks are presented and compared to method A [14], method B [13] and method C [6]. It can be seen then (except in s27, s298, s641, s1488, c7552 circuits) that power consumption achieved by our method was reduced more than by the others.
Fig. 7. Results achieved and compared for ISCAS58/89 benchmarks
5 Conclusions In our research we analyzed methods which were used in modern approaches having power consumption reduction as their goal. It was recognized that all previous approaches were based on a separate analysis of test vectors and scan chain sequences. Based on this finding, the methodology merging these two possibilities together was defined, developed and implemented. It was also decided to verify the results on implementation platform instead of by means of Hamming distance between input test vectors.
192
J. Skarvada, Z. Kotasek, and J. Strnadel
Valuable experimental results were gained which indicate that our approach is better than previous methodologies. Acknowledgements. This work was supported by the Grant Agency of the Czech Republic (GACR) No. 102/09/1668 – SoC circuits reliability and availability improvement, by Research Project MSM 0021630528 – Security-Oriented Research in Information Technology and the grant BUT FIT-S-10-1.
References 1. Raghunathan, A., Jha, N.K., Dey, S.: High-Level Power Analysis and Optimization, p. 175. Kluwer Academic Publishers, Boston (1998) ISBN 0-7923-8073-8 2. Roy, K., Prasad, S.C.: Low-Power CMOS VLSI Circuit Design, p. 359. WileyInterscience publication, Hoboken (2000) ISBN 0-471-11488-X 3. Nicolici, N., Al-Hashimi, B.M.: Power-Constrained Testing of VLSI Circuits, p. 178. Kluwer Academic Publishers, Dordrecht (2003) ISBN 1-4020-7235-X 4. Debjyoti, G., Swarup, B., Kaushik, R.: Multiple Scan Chain Design Technique for Power Reduction during Test Application in BIST. In: 18th IEEE International Symposium on De-fect and Fault Tolerance in VLSI Systems, pp. 191–198 (2003) 5. Dabholkar, V., Chakravarty, S., Pomeranz, I., et al.: Techniques for Minimizing Power Dissipation in Scan and Combinational Circuits During Test Application. IEEE Trans. on Computer-Aided Design of Integrated Circuits 17(12), 1325–1333 (1998) 6. Chakravarty, S., Dabholkar, V.: Minimizing Power Dissipation in Scan Circuits During Test Application. In: Proceedings of International Workshop on Low-Power Design (1994) 7. Pangrle, B., Kapoor, S.: Leakage power at 90nm and below [on-line]. EE Times Asia (2005), http://www.eetasia.com/ARTICLES/2005JUN/B/2005JUN01_POW_ EDA_TA.pdf 8. Thompson, S., Packan, P., Bohr, M.: MOS Scaling: Transistor Challenges for the 21st Century. Intel Technology Journal 19 (1998) 9. Marongiu, A., et al.: Analysis of Power Management Strategies for a Large-Scale SoC Platform in 65nm Technology. In: Proceedings of the 11th Euromicro Conference on Digital System Designing Architectures, Methods and Tools, pp. 259–266 (2008) 10. Vranken, H., Waayers, T., Fleury, H., Lelouvier, D.: Enhanced Reduced-Pin-Count Test For Full-Scan Design. In: Proceedings of IEEE International Test Conference, pp. 738–747 (2001) 11. Almukhaizim, S., Makris., Y., Yang, Y.-S., Veneris, A.: Seamless Integration of SER in Rewiring-Based Design Space Exploration. In: Proceedings of International Test Conference, pp. 1–9 (2006) 12. Babighian, P., Kamhi, G., Vardi, M.: PowerQuest: Trace Driven Data Mining for Power Optimization. In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition, pp. 1–6 (2007) 13. Girard, P., Landrault, C., Pravossoudovitch, S.: Reducing Power Consumption During Test Application by Test Vector Ordering. In: Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 296–299. IEEE Computer Society, Los Alamitos (1998) 14. Jelodar, M.S., Aavani, A.: Reducing Scan Base Testing Power Using Genetic Algorithm. In: Proc. of 11th Iranian Computer Engineering Conference, vol. 2, pp. 308–312 (2006)
Designing Combinational Circuits with an Evolutionary Algorithm Based on the Repair Technique* Houjun Liang1,2, Wenjian Luo1,2, Zhifang Li1,2, and Xufa Wang1,2 1
Nature Inspired Computation and Applications Laboratory, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, Anhui, China 2 Anhui Key Laboratory of Software in Computing and Communication, University of Science and Technology of China, Hefei 230027, Anhui, China {ahlhj,zhifangl}@mail.ustc.edu.cn, {wjluo,xfwang}@ustc.edu.cn
Abstract. Evolutionary Algorithms are often expected to design combinational circuits with the fault-tolerant and self-repair ability. However, the repair idea has never been adopted during evolutionary design of combinational circuits. In this paper, an evolutionary Algorithm based on the repair technique, called as the EA-Repair, is proposed. Different from existent algorithms, the EA-Repair firstly evolves an almost but incompletely correct circuit with an evolutionary algorithm, and then generates the complete correct circuit with the repair technique. The experimental results demonstrate the efficiency of the proposed algorithm. Keywords: evolvable hardware, evolutionary algorithm, combinational logic circuits, repair.
1 Introduction The EHW (Evolvable Hardware) is a novel technique which could automatically reconfigure the structure of the hardware (e.g. FPGA) to adapt to dynamical environment with Evolutionary Algorithms (EAs) [1]. Although the ultimate objective of the EHW is to design the self-adaptive hardware and system, recently, the EHW is often viewed as an alternative circuit design technique. So far, the circuits that have been evolved in the EHW community are relatively small. The scalability problem is one of the most important problems in the EHW community [1-4]. In order to overcome the scalability problem, several approaches have been proposed in recent years. Typically, Torresen [5] introduced the divideand-conquer method and with this method he designed the circuits for the number recognition. Stomeo and Kalganova put forward the BIE (Bidirectional Incremental Evolution) [6] and the GDD (Generalized Disjunction Decomposition) [7], which are effective for the evolutionary design of combinational logic circuits. Higuchi and his colleagues [8, 9] introduced the concept of the function level evolution, which can *
This paper is based on Houjun Liang's PhD dissertation (in Chinese) in School of Computer Science and Technology at University of Science and Technology of China in May, 2009.
G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 193–201, 2010. © Springer-Verlag Berlin Heidelberg 2010
194
H. Liang et al.
reduce the search space greatly by using high-level functions as the building blocks, such as the multiplier, divider, shifter, etc. All the above four methods are designed for digital combination circuits. Meanwhile, the EHW technique has also been applied to analog circuits [10]. Currently, the divide-and-conquer method is the best way to improve the scalability. For example, by the GDD [7], the 17-bit parity circuit, the six-bit multiplier and the alu4 (14 inputs and eight outputs) have been evolved. They belong to the most complex circuits which have never been evolved before. The key idea of the divideand-conquer method is to decompose a large circuit to some small subcircuits, and generate the subcircuits with evolutionary algorithms. However, if the evolvability of evolutionary algorithms is too weak, a large circuit has to be decomposed into too many small subcircuits. Therefore, to avoid excessive decomposition, it is still very important to improve the performance of evolutionary algorithms. When EAs are used to evolve the combinational circuits, the fitness of candidates often increases quickly at the initial stage, but stalls after some evolutionary generations. This phenomenon is called as the Stalling effect, and point out by Stomeo and Kalganova in [7]. The Stalling effect [7] reflects the characteristics of evolutionary design of combinational logic circuits. Unfortunately, so far, there is no effective way to overcome this problem. Meanwhile, Evolutionary Algorithms are often expected to generate the circuits with the fault-tolerant and self-repair ability. However, so far, the repair idea has never been adopted during evolutionary design of combinational logic circuits. In this paper, an evolutionary algorithm based on the repair technique, i.e. the EARepair, is proposed to enhance the evolvability of EAs. As for the EA-Repair, it is not necessary to find a 100% correct solution. Contrarily, the EA-Repair firstly generates an almost but incompletely correct circuit with an evolutionary algorithm, and then generates completely correct circuit with the repair technique when the Stalling effect occurs. The rest of this paper is organized as follows. Section II introduces the repair technique for the combinational logic circuits that is incompletely correct. Section III introduces the EA-Repair algorithm. Experimental results are given in Section IV. Section V gives some discussions, and Section VI briefly summarizes the whole paper.
2 The Repair Technique In this section, firstly, the stalling effect [7] in the EHW field is discussed. Secondly, the detail of the repair technique is described. 2.1 The Stalling Effect The Stalling effect [7] is a common phenomenon when EAs are used to solve complex problems, which means the fitness increases rapidly at the initial stage, but rarely go up later. The Stalling effect in the evolutionary design of combinational logic circuits is firstly reported by Stomeo and his colleagues in [7]. It is a difficult problem in the EHW field. Fig. 1 demonstrates an example of the Stalling effect of a 3×3 multiplier in one run. As for the experimental results in Fig. 1, the adopted algorithm is the (4, 128) ES
Designing Combinational Circuits with an Evolutionary Algorithm
195
1.2 1 0.8 sse tni 0.6 F 0.4 0.2 0
0
000 3
000 6
000 9
000 12
000 15
000 18
000 21
000 24
000 27
000 30
000 33
000 36
000 39
000 42
Generation
000 45
000 48
000 51
000 54
000 57
000 60
000 63
000 66
000 69
000 72
000 75
Fig. 1. The Stalling effect phenomenon of the 3×3 multiplier
(Evolution Strategy), the chromosome encoding is based on the CGP model, the mutation rate is 5%. The gate array used in this experiment is 1×100 array. The fitness evaluation method, the detailed chromosome encoding and the mutation operator are the same as those in [2]. It is noted that the fitness in Fig. 1 is of the best individual in each generation and has been normalized. From Fig. 1, it can be observed that, the best fitness increases very rapidly at first, but there are no remarkable changes during a huge amount of generations when its best fitness is close to the maximal fitness. It is noted that the satisfied solution is not found when the evolutionary generation reaches 75000. Therefore, most evolutionary generations contribute nothing after the best fitness is larger than 95%. Generally, the fitness values of the candidates increase very quickly at first and very slowly when the fitness reaches some relatively better values. Therefore, it is easy for EAs to design a partially correct circuit. But the evolutionary process should stop when the stalling effect occurs. At that time, a repair process starts to turn the partially correct circuit into a completely correct one. 2.2 The Principle of the Repair Technique To repair an incomplete correct circuit to a complete correct circuit, the repair component should be designed and combined with the circuit evolved by EAs. The repair technique is based on the XOR (exclusive-or) operation which has the following characteristics. ⎧⎪ f ( x1 x 2 ⋅ ⋅ ⋅ x n ) ⊕ 0 = f ( x1 x 2 ⋅ ⋅ ⋅ x n ) ⎨ ⎪⎩ f ( x1 x 2 ⋅ ⋅ ⋅ x n ) ⊕ 1 = f ( x1 x 2 ⋅ ⋅ ⋅ x n )
(1)
In formula (1), f ( x1 x2 ⋅ ⋅ ⋅ xn ) is any Boolean expression of n variables. The XOR of “0” and any Boolean expression is equal to the Boolean expression itself, and the XOR of “1” and any Boolean expression is equal to the inverse of the Boolean expression. For convenience, an example is given to illustrate how to design the repairing component. This example is given in Table 1, and its output is only one bit.
196
H. Liang et al. Table 1. The truth table of the circuit as an example Input 000 001 010 011
Output 0 0 1 1
Input 100 101 110 111
Output 1 0 0 1
Two different cases are discussed as follows. One is that a circuit is incorrect only for one input, and the other is that a circuit is incorrect for more than one input. (1) A circuit is incorrect only for one input. Suppose a candidate solution, which is generated by EAs for the circuit in Table 1, cannot generate a correct output for the input “101”. That is to say, its output is “1”, but the expected output is “0”. Meanwhile, for the other 7 inputs, its outputs are correct. Fig. 2 gives the repair component for a circuit that is incorrect only for one input. The repairing component is composed of a block of “Block_AND” and an XOR gate. When the input is “101”, the output of “Block_AND” is “1”, and the output of the circuit evolved (i.e. “The original output” in Fig. 2) is also “1”. The output of the XOR gate is “0”, i.e. the output of the whole circuit is “0”. Therefore, the error of the circuit can be corrected by the repair component.
Fig. 2. The repair component for a circuit that is incorrect only for one input
As for other 7 inputs, the output of “Block_AND” is always “0”. Therefore, the output of XOR will be the same as that of the “The original output”. To sum up, the hybrid of the circuit evolved and the repair component can generate correct outputs for the circuit given in Table 1. (2) A circuit is incorrect for more than one input. When a circuit is incorrect for more than one input, an example of the corresponding repair component is shown in Fig. 3. In Fig. 3, the output of the circuit evolved is incorrectly only for three input-output combinations, i.e. “000---1”, “001---1” and “011---0”. For this example, the repair component is composed of “Block_AND1”, “Block_AND2”, “Block_AND3”, “Block_OR” and an XOR gate. Compared with the former example, a new subcircuit “Block_OR” is added to the repair component.
Designing Combinational Circuits with an Evolutionary Algorithm
197
Fig. 3. The repair component for a circuit that is incorrect for more than one input
When the input is “000”, the output of the circuit evolved is “0”, while the expected value is “1”. The output of “Block_AND1” is “1”. Thus, no matter what the values of the outs of “Block_AND2” and “Block_AND3” are, the output of “Block_OR” will be “1”. Consequently, considering the function of the XOR gate, the value of the circuit in Fig. 3 will output “1” for the input “000”. Similarly, when the input is “001” or “011”, the repair component can also correct the wrong output of the circuit evolved. As for other five inputs, the outputs of “Block_AND1”, “Block_AND2” and “Block_AND3” are always “0”. Therefore, the output of “Block_OR” will be “0”. Therefore, from Fig. 3, it can be observed that the partially correct circuit and the repair component constitute a fully correct circuit. As an example, the above repair component is given for a circuit with output of only one-bit. As for a circuit with output of multiple bits, the repair component can be easily designed by repairing each output bit with the above technique. 2.3 Gates Used in the Repair Component Generally, for a circuit with In inputs and Out outputs, if there are OutRepair output bits with errors, the number of gates used in the repair component can be calculated as formula (2).
∑
(
∑ Num _ Zero
ij
+ Num _ Inpi × ( In − 1) + ( Num _ Inpi − 1) + 1)
1≤i ≤OutRepair 1≤ j ≤ Num _ Inpi
(2)
In formula (2), Num_Inpi (1≤i≤OutRepair) means the number of errors of the ith output, Num_Zeroij (1≤i≤OutMend, 1≤j
3 The Evolutionary Algorithm Based on the Repair Technique The adopted EA is (1+λ) Evolution Strategy (ES) and the candidate circuit layout is based on the CGP model [7, 11]. The parameter λ is the offspring population size.
198
H. Liang et al.
The evolutionary algorithm based on the repair technique, i.e. the EA-Repair, is given in Fig. 4.
Fig. 4. The flowchart of the EA-Repair algorithm
As shown in Fig. 4, the EA-Repair includes two phases as follows. (1) Design a circuit with the (1+λ) ES. The ES is adopted to synthesize the circuit. When the stalling effect occurs, the evolution is terminated, and a partially correct circuit is obtained. (2) Repair the incompletely correct circuit. The repair technique proposed in Section II is adopted to repair the circuit. In this paper, when the best fitness reaches 95% of the maximal fitness value, the evolutionary process terminates and the repair process is activated.
4 Experiments In order to test the efficiency of the EA-Repair, experiments on multipliers and adders are done. The EA is (1+4) ES. Since the parameter settings of the ES are not primary focuses in this paper, the fitness evaluation method, the detailed chromosome encoding and the mutation operator are the same as those in [2]. The mutation rate is 5%. The experiments are run on a desktop PC with a Dual-Core (Intel E7400) at 2.8 GHz and 2GB of RAM. The program is written in C language. In experiments, the (1+4) ES without any repair technique is adopted for comparisons. 4.1 Multiplier The “3×3multiplier” is taken as an example, and the candidate circuit layout is 1×100. Each experiment runs independently 20 times. The experimental results are given in Table 2. In Table 2, “Ave.Gen.” means the average evolutionary generation, “Ave.Time” means the average time cost in seconds,
Designing Combinational Circuits with an Evolutionary Algorithm
199
Table 2. The results of both EA and EA-Repair for the Multiplier circuit
Ave. Gen. 902,628 (911,589)
3×3 mul
EA Ave. Time (s) 425
Ave. Gate 45.40 (3.36)
Ave. Gen. 21,386 (23,912)
EA-Repair Ave. Time (s) 17
Ave. Gate 187 (19)
and “Ave.Gate” means the average number of gates used in the final solution. The data in parentheses are the standard deviation. From Table 2, it can be observed that, compared with the traditional EA, i.e. the (1+4) ES, the EA-Repair needs much less generations and time cost. However, the EA-Repair costs more gates because it needs a repair component. 4.2 Adder Since adders are easier to be evolved than multipliers, beside the “3×3 adder”, both the “4×4 adder” and “4×5 adder” are tested in experiments. The candidate circuit layout is 1×240. Except the “4×5 adder”, each experiment is independently run 20 times. The experiments on the “4×5 adder” are independently run 5 times. From Table 3, it can be observed that the EA-Repair has much better performance than the traditional EA in terms of average evolutionary generations and the time cost. The disadvantage of the EA-Repair is that more gate resources are needed. As for “4×5 adder”, total 452,446 generations and 74 gates are needed for the traditional EA, while only 31,260 generations are needed for the EA-Repair, and the gates used run up to about 3,626 for the EA-Repair. Table 3. The results of both EA and EA-Repair for the Adder Circuit Name 3×3 Adder 3×4 Adder 4×4 Adder 4×5 Adder
EA Ave. Gen. 71,542 (65,257) 99,352 (60,905) 181,413 (176,078) 452,446 (249,112)
Ave. Time (s) 83 210 753 3,896
Ave. Gate 61 (11) 56 (12) 68 (9) 74 (8)
Ave. Gen. 2,608 (1,779) 12,517 (4,719) 9,447 (5,139) 31,260 (14,301)
EA-Repair Ave. Time (s) 2.8 26 39 262
Ave. Gate 239 (54) 703 (280) 1,648 (662) 3,626 (553)
5 Discussions The evolutionary design of circuits is now faced with the scalability problem. Based on the phenomenon of the stalling effect, the proposed EA-repair algorithm switches to the repair process after the initial evolutionary process. In order to repair the partially correct circuit obtained by the evolutionary algorithms, a simple regular approach to designing the additional repair component is given in this paper. The repairing component is combined with the circuit generated by the evolutionary algorithms, which has some errors, to constitute a 100% functionally correct circuit.
200
H. Liang et al.
The EA-Repair algorithm will cost more gates than the traditional EA. However, the EA-Repair algorithm would be a good approach when the candidate circuit evolved only has a few incorrect outputs. For example, the repair component in Fig. 2 only costs 4 gates, and the repair component in Fig. 3 only costs 15 gates. In this paper, the repairing technique is initially designed for combinational logic circuits. In fact, it can be used for designing sequential circuits. But for other circuits, such as analog circuits and digital filters, how to design the repair components is still a problem and more works should be done. Additionally, if the repair technique is combined with the decomposition methods such as the GDD [7], larger circuits can be evolved.
6 Conclusion Based on the repair technique, an efficient evolutionary design algorithm for combinational logic circuits is proposed in this paper. The experimental results demonstrate the proposed algorithm is effective and can accelerate the design process obviously. The primary contribution of this paper is to firstly introduce the repair technique into the evolutionary design of the combinational logic circuits. In the future, more woks should be done to analyze the performance of the EA-Repair algorithm, and to design more effective repair techniques that need less gate resources. Acknowledgements. This work is partly supported by the National Natural Science Foundation of China (No. 60404004), and the 2006-2007 Excellent Young and Middle-aged Academic Leader Training Program of Anhui Province Research Experiment Bases.
References 1. Yao, X., Higuchi, T.: Promises and Challenges of Evolvable Hardware. IEEE Transaction on Systems Man and Cybernetics-Part C: Applications and Reviews 29(1), 87–97 (1999) 2. Liang, H., Luo, W., Wang, X.: A Three-Step Decomposition Method for the Evolu-tionary Design of Sequential Logic Circuits. Genetic Programming and Evolvable Machines 10(3), 231–262 (2009) 3. Stoica, A., Zebulum, R., Keymeulen, D., Ferguson, M.I., Guo, X.: Scalability Issues in Evolutionary Synthesis of Electronic Circuits: Lessons Learned and Chal-lenges Ahead. American Association for Artificial Intelligence (2003) 4. Vassilev, V.K., Miller, J.F.: Scalability Problems of Digital Circuit Evolution - Evolvability and Efficient Designs. In: Proceedings of the 2nd NASA/DOD Workshop on Evolvable Hardware, Los Alamitos, CA, pp. 55–64 (2000) 5. Torresen, J.: A Divide-and-Conquer Approach to Evolvable Hardware. In: Sipper, M., Mange, D., Pérez-Uribe, A. (eds.) ICES 1998. LNCS, vol. 1478, pp. 57–65. Springer, Heidelberg (1998) 6. Kalganova, T.: Bidirectional Incremental Evolution in Extrinsic Evolvable Hard-ware. In: Proceedings of the Second NASA/DoD Workshop on Evolvable Hardware (EH 2000), Palo Alto, CA, USA, pp. 65–74. IEEE Computer Society, Los Alamitos (2000)
Designing Combinational Circuits with an Evolutionary Algorithm
201
7. Stomeo, E., Kalganova, T., Lambert, C.: Generalized Disjunction Decomposition for Evolvable Hardware. IEEE Transactions on Systems, Man and Cybernetics, Part B 36(5), 1024–1043 (2006) 8. Higuchi, T., Murakawa, M., Iwata, M., Kajitani, I., Liu, W., Salami, M.: Evolvable Hardware at Function Level. In: Proceedings of the IEEE International Conference on Evolutionary Computation, pp. 187–192. IEEE, Los Alamitos (1997) 9. Higuchi, T., Iwata, M., Kajitani, I., Murakawa, M., Yoshizawa, S., Furuya, T.: Hardware Evolution at Gate and Function Levels. In: Proceedings of the Proceedings of Biologically Inspired Autonomous Systems: Computation, Cognition and Action. Durham, North Carolina (1996) 10. Gallagher, J.C.: The Once and Future Analog Alternative: Evolvable Hardware and Analog Computation. In: 2003 NASA/DoD Conference on Evolvable Hardware, pp. 43–49 (2003) 11. Sekanina, L.: Evolutionary Design of Gate-Level Polymorphic Digital Circuits. In: Rothlauf, F., Branke, J., Cagnoni, S., Corne, D.W., Drechsler, R., Jin, Y., Machado, P., Marchiori, E., Romero, J., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2005. LNCS, vol. 3449, pp. 185–194. Springer, Heidelberg (2005)
Bio-inspired Self-testing Configurable Circuits Andr´e Stauffer and Jo¨el Rossier Ecole polytechnique f´ed´erale de Lausanne (EPFL), Logic Systems Laboratory CH-1015 Lausanne, Switzerland Tel.: (+41 21)693 26 52 [email protected]
Abstract. Inspired by the basic processes of molecular biology, our studies resulted in defining a configurable molecule implementing mechanisms made up of simple processes. The goal of our paper is to demonstrate how these bio-inspired mechanisms and their underlying processes perform on cellular architectures. The hardware description of the molecule with all its bio-inspired mechanisms leads to the simulation of an image processing array and an arithmetic and logic unit.
1
Introduction
Borrowing the structural principles from living organisms, we have already shown how to grow cellular systems [3]. These cellular systems are endowed with bioinspired properties like configuration, cloning, cicatrization, and regeneration. In a previous work [5], the configuration mechanisms (structural and functional growth), the cloning mechanisms (cellular and organismic self-replication), the cicatrization mechanism (cellular self-repair), and the regeneration mechanism (organismic self-repair) were devised as the result of simple processes like growth, load, branching, repair, reset, and kill. The goal of this paper is to demonstrate how they perform on circuits made up of self-testing configurable molecules. Starting with the cellular architecture of the configurable circuits, Section 2 will point out how the bio-inspired properties like cloning, cicatrization, and regeneration apply to these kind of circuits. We define then the detailed molecular architecture of the configurable circuits (Section 3) and introduce digital simulations, based on their hardware description, to illustrate the bio-inspired mechanisms (Section 4). Section 5 applies the bio-inspired mechanisms in the building and maintaining of an image processing array and an arithmetic and logic unit. A brief conclusion (Section 6) summarizes our paper and opens new research avenues.
2 2.1
Bio-inspired Properties Cellular Architecture
Configurable circuits are frequently made up of identical processing elements [1]. They can be seen as multicellular organisms made up of identical cells. Each G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 202–213, 2010. c Springer-Verlag Berlin Heidelberg 2010
Bio-inspired Self-testing Configurable Circuits
(a)
(b)
203
(c)
Fig. 1. Minimal architectures. (a) Cell. (b) Organism. (c) Population.
(a)
(j)
(b)
(k)
(c)
(l)
(d)
(m)
(e)
(n)
(f)
(o)
(g)
(p)
(h)
(q)
(i)
(r)
Fig. 2. Data input selection. (a) Northward. (b) Eastward. (c) Southward. (d) Westward. Molecular modes. (e) Living. (f) Spare. (g) Faulty. (h) Repair. (i) Dead. Molecular types. (j) Internal. (k) Top. (l) Top-left. (m) Left. (n) Bottom-left. (o) Bottom. (p) Bottom-right. (q) Right. (r) Top-right.
element processes at least one data bit and corresponds to a cell made up of functionally configurable molecules. The minimal cell consists of two rows of three molecules with two columns of application specific molecules to the left and one column of spare molecules to the right (Fig. 1a). The corresponding molecular modes are shown in Fig. 2e-i. Fig. 2j-r represents the molecular types defining the borders of the cell. The minimal multicellular organism is made up of two identical cells and represents a processing column computing at least two data bits (Fig. 1b). Such an organism allows self-repair at the cellular level. The minimal population of organisms is made up of three organisms with two columns of living cells dedicated to the application specifications to the left and one column of spare cells to the right (Fig. 1c). Such a population computes at least four data bits and allows self-repair at the organismic level. 2.2
Cloning
The cloning or self-replication can be implemented at the cellular level in order to build a multicellular organism and at the organismic level in order to generate a population of organisms. The cloning of the minimal cell displayed in Fig. 1a results thus in the organism of Fig. 1b. The cloning of this organism defines the population of Fig. 1c. 2.3
Cicatrization
The introduction in the cells of the minimal organism of one column of spare molecules (Fig. 1b), defined by a specific structural configuration, and the automatic detection of faulty molecules allows cicatrization or self-repair at the cellular
204
A. Stauffer and J. Rossier
(a)
(b)
Fig. 3. Self-repair. (a) Cicatrization of the organism. (b) Regeneration of the population.
level: each faulty molecule is deactivated, isolated from the network, and replaced by the nearest right molecule, which will itself be replaced by the nearest right molecule, and so on until a spare molecule is reached (Fig. 3a). The number of faulty molecules handled by the cicatrization mechanism is necessarily limited: in the example of Fig. 1b, we tolerate at most one faulty molecule per row. 2.4
Regeneration
In order to implement regeneration, that is self-repair at the organismic level, we need at least one spare organism to the right of the original population of organisms (Fig. 1c). The existence of two faulty molecules in the same row of a given cell identifies the faulty organism which is deactivated (Fig. 3b). The functionality of the configurable circuit is now performed to some extent by the spare cells of the organism to the right.
3 3.1
Configurable Molecule Configuration Layer
Each molecule of the bio-inspired circuits is made up of a configuration layer and an application layer. The configuration layer, which implements the bio-inspired mechanisms as well as their constituting processes [5], results from the interconnection of the following resources (Fig. 4a): (1) An input multiplexer DIMUX, selecting one out of the four northward N DI, eastward EDI, southward SDI or westward W DI configuration input data (Fig. 5). (2) A 2-level stack organized as 18 genotypic registers R1 to R18 (for mobile configuration data), and 18 phenotypic registers R19 to R36 (for fixed configuration data). (3) An output buffer DOBUF producing the configuration output data DO. (4) An encoder ENC for the northward N SI, eastward ESI, southward SSI, and westward W SI input signals. (5) A decoder DEC defining the mode and the type of the molecule. (6) A register I for the memorization of the input selection (Fig. 2a-d). (7) A register S for the transmission of the signals. (8) A register M for the molecular modes (Fig. 2e-i). (8) A register T for the molecular types (Fig. 2j-r). (9) A generator GEN producing the northward N SO, eastward ESO, southward SSO, and westward W SO output signals.
Bio-inspired Self-testing Configurable Circuits
3.2
205
Application Layer
The application layer, which implements the logic design of the application under development as well as its routing connections between neighboring and distant molecules, results from the interconnection of the following resources (Fig. 4b): (1) An input multiplexer AIMUX, selecting four inputs out of the four northward N AI, eastward EAI, southward SAI, westward W AI application data, and the routing data RO. (2) A 16-bit look-up table LUT. (3) A D-type flip-flop DFF for the realization of sequential circuits. (4) An output multiplexer AOMUX selecting the combinational or the sequential data as application output AO. (5) An output multiplexer ROMUX selecting the five outputs N RO, ERO, SRO, W RO, and RO out of the four northward N RI, eastward ERI, southward SRI, westward W RI routing input data, and the application output data AO. NDI EDI SDI WDI
DI R1:18
DIMUX
R19:36 DOBUF
NSI ESI SSI WSI
DO
I ENC
S R17 R35 R36 WSI
M
R35
DEC
GEN
NSO ESO SSO WSO
T (a)
NAI EAI SAI WAI RO
AIMUX
LUT DFF
AOMUX
AO
NRI ERI SRI WRI AO
ROMUX
NRO ERO SRO WRO RO
(b)
Fig. 4. Configurable molecule. (a) Configuration layer. (b) Application layer.
Using the VHDL language at the register transfer level, we have realized the hardware description of the configuration layer and the application layer of the configurable molecule. These descriptions define an intellectual property block (IP) which can be synthesized and implemented in any integrated circuit. They lead also to the hardware simulations of the basic mechanisms and the applications presented in the following sections.
4 4.1
Bio-inspired Mechanisms Configuration Test
Performed on a given array of molecules, the purpose of the configuration test mechanism is to kill all the columns of molecules having at least a faulty one
206
A. Stauffer and J. Rossier
among them. A molecule is faulty when the shift operation performed by its configuration registers, the genotypic registers R1 to R16 as well as the phenotypic registers R17 to R36 (Fig. 4a), presents an incorrect behavior. In order to check these registers, we define a test configuration string made up of a test connect flag TC followed by 16 empty data and a test pattern TP (Fig. 6). This test configuration string must be applied twice for each molecule, that is 2 × (CA + RA − 1) for an array of CA columns by RA rows. flag data: test connect
south connect
north connect
west connect
east connect
east connect and north branch
west connect and east branch north connect and branch activate
test data: empty
test pattern
structural data: living internal
living bottom-left
spare top-right
living top
living bottom
spare right
living top-left
spare internal
spare bottom-right
living left
spare top
spare bottom
Fig. 5. Configuration data
Fig. 6. Test configuration string
During the configuration, the detection of errors related to the malfunction of the registers R1 to R18 is achieved on mobile data. The resulting error signal M E is function of the data DI delivered by the input multiplexer DIMUX as well as the data shifted in the registers R1 and R18: M E = (DI = TC).(R1 = TP) ⊕ (R18 = TC) On the other hand, the detection of errors related to the malfunction of the registers R19 to R36 is realized on fixed data. The corresponding error signal F E depends on the data stored in the registers R19 and R36: F E = (R19 = TP) ⊕ (R36 = TC) As long as the registers R1 and R19 of the stack perform correctly, the signals M E and F E report the malfunction of any other of its registers R2 to R18 and R20 to R36.
Bio-inspired Self-testing Configurable Circuits
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
207
Fig. 7. Configuration test mechanism
The proposed configuration test mechanism will therefore be made up of a growth process followed by a kill process for each detected error and finally a reset process. Executed using growth signals and according to the predefined test configuration string, the growth process starts the building of tree shaped datapaths all over the array until a faulty molecule is detected (Fig. 7a-b). The building of the datapaths resumes after the death of the second left column of molecules (Fig. 7e-g). As soon as a malfunction of its configuration registers occurs, the molecule enters the dead mode and sends kill signals northward and southward in order to trigger the death of the whole column of molecules. Fig. 7c-d illustrates the kill process involved in the configuration test mechanism by the incorrect behavior of the second lower left molecule. At the end of the growth process, all the molecules of any column having at least a faulty one are dead. Performed on the array resulting from the malfunction of the second lower left molecule, the reset process starts from the lower left molecule and propagates reset signals eastward and northward in order to destroy the datapaths built among the healthy molecules (Fig. 7h-k). This tissue, comprising now one column of dead molecules, is ready for being configured. 4.2
Structural Configuration
The goal of the structural configuration mechanism is to define the boundaries of the cell as well as the living mode or spare mode of its constituting molecules. This mechanism is made up of a growth process followed by a load process. The growth process starts when an external growth signal is applied to the lower left molecule of the cell. This molecule selects the eastward data input (Fig. 8a) and according to the first flag data of the structural configuration string or structural genome (Fig. 9) generates a northward oriented growth signal. Depending on the other flag data of the string, each molecule of the cell chooses then successively an input and produces an internal growth signal in order to create a datapath among the molecules of the cell (Fig. 8a-f). For each molecule, the configuration string is made up of a flag data followed by a structural data and 16 empty data.
208
A. Stauffer and J. Rossier
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 8. Structural configuration mechanism
Fig. 9. Structural configuration string
Fig. 9 represents the string which is applied twice in order to configure the minimal cell made up of six molecules (Fig. 1a). When the connection path between the molecules closes (Fig. 8g), the lower left molecule delivers a close signal to the nearest left neighbor cell. The structural configuration string is now moving around the datapath and ready to be transmitted to neighboring cells. The load process is triggered by the close signal applied to the lower right molecule of the cell. Load signals propagate then westward and northward through the cell (Fig. 8g-i) and each of its molecules acquire a molecular mode (Fig. 2ei) and a molecular type (Fig. 2j-r). We finally obtain an homogeneous array of molecules defining both the boundaries of the cell and the position of its living mode and spare mode molecules (Fig. 8j). This array is ready for being configured by the functional configuration data. 4.3
Functional Configuration
The goal of the functional configuration mechanism is to store in the homogeneous array, which already contains structural data (Fig. 8j), the functional data needed by the specifications of the current application. This mechanism is a growth process, performed only on the molecules in the living mode while the molecules in the spare mode are simply bypassed. It starts with an external growth signal applied to the lower left living molecule. According to the functional configuration string or functional genome (Fig. 11), the living molecules then successively generate an internal growth signal, select an input, and create a closed path among
Bio-inspired Self-testing Configurable Circuits
(a)
(b)
(c)
(d)
209
(e)
Fig. 10. Functional configuration mechanism AO NRO
ERO
SRO WRO
P3:0
F3:0
00 00 00 00 00 00 00 00 00 00 00 00 00 0F 0F 0F 0F 00 00 00 00 00 00 00 00 00 00 00 00 00 0F 0F 0F 0F 00 00 00 00 00 00 00 00 00 00 00 00 00 0F 0F 0F 0F 00 00 00 00 00 00 00 00 00 00 00 00 00 0F 0F 0F 0F
Fig. 11. Functional configuration string
them in the cell (Fig. 10a-e). The functional configuration data are now moving around the datapath and ready to be transmitted to neighboring cells. For each molecule, the configuration string is made up of a flag data followed by the 17 functional data needed to configure the application layer of the molecule (Fig. 11): (1) AO controls the selection of the combinational data or sequential data operated by the output multiplexer AOMUX. (2) N RO, ERO, SRO and W RO control the selection of the routing outputs performed by the output multiplexer ROMUX. (3) P 3:0 controls the selection of the look-up table inputs realized by the input multiplexer AIMUX. (4) F 3:0 corresponds to the truth table of the application specific combinational function implemented by the look-up table LUT. Applied twice to the minimal cell of Fig. 8j, the functional configuration string of Fig. 11 will end up with four living molecules generating the logic constant 1 (Fig. 10e). 4.4
Cloning
The cloning mechanism or self-replication mechanism is implemented at the cellular level in order to build a multicellular organism (Fig. 1b) and at the organismic level in order to generate a population of organisms (Fig. 1c). This mechanism suppose that there exists a sufficient number of molecules in the array to contain at least one copy of the additional cell or of the additional organism. It corresponds to a branching process which takes place when the structural and the functional configuration mechanisms deliver northward and eastward growth signals on the borders of the cell during the corresponding growth processes. Fig. 12a and Fig. 12c show respectively the structural and functional branching processes performed in order to self-replicate the initial minimal cell northward. The corresponding eastward self-replication results from the structural and functional branching processes shown in Fig. 12b and Fig. 12d.
210
A. Stauffer and J. Rossier
(a)
(b)
(c)
(d)
Fig. 12. Cloning mechanism
4.5
Control Test
In order to correct deterioration that could affect the mobile functional configuration data, we define a control test mechanism which is made up of a reset process followed by a functional growth process. The error detection is realized when the datapath within the cell is closed. It is done by comparing the data moving around the cell with the ones moving around its western and southern neighbors. This comparison is performed by the lower left living molecule of the cell when its eastward and northward input data are not empty ones or when at least one of them is not empty (= 0). The resulting error signal DE is thus function of the non empty westward W DI, eastward EDI and northward N DI data inputs of the molecule: DE = (EDI = N DI).(W DI = EDI) + (N DI = 0).(W DI = EDI) + (EDI = 0).(W DI = N DI) According to this relation, the signal DE reports that at least one of the data moving around the cell is different from the data delivered by the neighboring cells, these data being considered as the correct ones. Performed on the cell having a deteriorated mobile configuration data, the reset process starts from the lower left molecule of the cell and propagates eastward and northward in order to destroy the datapath build among the healthy molecules of the array. Fig. 13a-d displays the reset process applied to the minimal cell made up of two columns of living molecules and one column of spare molecules. Performed according to the functional configuration string corresponding to the specifications of the current application, the growth process rebuilds the datapath among the living molecules of the cell. This process starting from the lower left molecule is shown in Fig. 13e-i. It renews the mobile data moving around the cell as well as the fixed data of its molecules.
(a)
(b)
(c)
(d)
(f)
(g)
(h)
(i)
Fig. 13. Control test mechanism
(e)
Bio-inspired Self-testing Configurable Circuits
4.6
211
Processing Test
The processing test mechanisms are introduced in order to repair a cell having molecules that present an incorrect behavior at the functional application level. Depending on the number of faulty molecules in a same row between two spare columns, the processing test mechanism results respectively in a cicatrization mechanism or in a regeneration mechanism. In order to introduce error detection at the application layer level, the architecture of this level (Fig. 4b) has to be doubled. The error detection is then made by comparing the application data AO1 and AO2 of the their two output multiplexers AOMUX. The resulting error signal AE depends on these two data: AE = AO1 ⊕ AO2 Starting with the normal behavior of Fig. 10e, we suppose that the upper left molecule becomes suddenly faulty and triggers a cicatrization mechanism. This mechanism is made up of a repair process involving eastward propagating repair signals (Fig. 14a-c) followed by a reset process, starting from the upper right molecule, performed with westward and southward propagating internal reset signals (Fig. 14d-f). This array, comprising now one molecule in the faulty mode and two molecules in the repair mode, is ready for being reconfigured by the functional configuration data. This implies a growth process bypassing the faulty molecule (Fig. 14g-k).
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
Fig. 14. Cicatrization mechanism
Our cell comprises a single spare molecule per row and tolerates therefore only one faulty molecule in each row. A second faulty molecule in the same row will activate a regeneration mechanism and cause the death of the whole cell. Starting with the normal behavior of the cicatrized cell (Fig. 14k), a new molecule, the upper right one, detects an error. Being previously already in the repair mode, this molecule enters the dead mode and triggers kill signals which propagate northward, westward and southward (Fig. 15a-d). Finally, all the molecules of the array are dead as well as the entire cell.
212
A. Stauffer and J. Rossier
(a)
(b)
(c)
(d)
(e)
Fig. 15. Regeneration mechanism
5 5.1
Applications Arithmetic and Logic Unit
The circuit that perform arithmetic and logic operations on two 3-bit data A and B can be considered as a one-dimensional artificial organism composed of three identical cells. Each cell is made up of six application specific molecules (Fig. 16a): (1) A C molecule computing the carry output. (2) A G molecule computing the generate carry signal. (3) A P molecule computing the propagate carry signal. (4) An R molecule computing the result. (5) An O molecule recovering the result performed by the living organism. (6) A D molecule generating a deactivation signal in order to bypass the cells of the neighboring spare organism to the right. Fig. 16b shows the implementation of a living organism to the left computing the arithmetic and logic functions while the spare organism to the right recovers the results performed by the living one. Ci+1 C
G
P
R
O
D
Bi Ai
Oi Ci M
S3 S2 (a)
S1 S0 (b)
Fig. 16. Arithmetic and logic unit. (a) Basic cell. (b) Implementation.
5.2
Image Processing Array
The circuit that perform thresholding and boolean operations [4] on a 3×3 array of 2-bit pixels P can be considered as a population comprising three artificial organisms. The three cells of each organism are made up of four application specific molecules (Fig. 17a): (1) A G molecule performing the lower threshold operation P ≥ L. (2) A L molecule performing the upper threshold operation P ≤ H. (3) A B molecule performing the boolean operation. (4) A D molecule controlling the display. Fig. 17b shows the implementation of the population with the three organisms to the left performing the image processing operations and one spare organism to the right.
Bio-inspired Self-testing Configurable Circuits L1:0
213
H1:0 B
D
G
L
D
P1:0 LD
(a)
(b)
Fig. 17. Image processing array. (a) Basic cell. (b) Implementation.
6
Conclusion
This paper is a contribution to the embryonic project [2] which is dedicated to the building of bio-inspired circuits in silicon. It supplies the detailed architecture of a configurable molecule made up of a configurable layer and an application layer. This molecule allows the design of circuits endowed with bio-inspired mechanisms. Using the VHDL description language, we have realized the hardware implementation of the configuration layer and the application layer of the configurable molecule. The hardware simulations of the image processing array and the arithmetic and logic unit presented in the paper are performed on circuits made up of such molecules. The configurable molecule, based on the register transfer level descriptions of its layers, defines an intellectual property block (IP). Such a block can be synthesized and implemented in any integrated circuit.
References 1. Andrejas, J., Trost, A.: Reusable DSP functions in FPGAs. In: Gr¨ unbacher, H., Hartenstein, R.W. (eds.) FPL 2000. LNCS, vol. 1896, p. 456. Springer, Heidelberg (2000) 2. Canham, R., Tyrrell, A.M.: An embryonic array with improved efficiency and fault tolerance. In: Lohn, J., et al. (eds.) Proceedings of the NASA/DoD Conference on Evolvable Hardware (EH 2003), pp. 265–272. IEEE Computer Society, Los Alamitos (2003) 3. Mange, D., Stauffer, A., Petraglio, E., Tempesti, G.: Self-replicating loop with universal construction. Physica D 191(1-2), 178–192 (2004) 4. Russ, J.C.: The image processing handbook. CRC Press, Boca Raton (2007) 5. Stauffer, A., Mange, D., Rossier, J.: Design of self-organizing bio-inspired systems. In: Arslan, T., Stoica, A., Suess, M., Keymeulen, D., Higuchi, T., Magness, R., Aydin, N., Erdogan, T. (eds.) Proceedings of the 2007 NASA/ESA Conference on Evolvable Adaptative Hardware and Systems (AHS 2007), pp. 413–419. IEEE Computer Society, Los Alamitos (2007)
Evolutionary Design of Reconfiguration Strategies to Reduce the Test Application Time ˇ aˇcek, Luk´ Jiˇr´ı Sim´ aˇs Sekanina, and Luk´ aˇs Stareˇcek Brno University of Technology, Faculty of Information Technology Boˇzetˇechova 2, 612 66 Brno, Czech Republic {isimacek,sekanina,starecek}@fit.vutbr.cz
Abstract. Recently, a method has been presented that allows a significant test application time reduction if some of gates of a digital circuit are reconfigured before test is applied. Selection of the gates for reconfiguration was performed using a very time consuming deterministic recursive search algorithm. In this paper, a new method is proposed for selection of the gates in order to reduce the test application time. The method utilizes an evolutionary algorithm which is able to discover very competitive reconfiguration strategies while the time of optimization is considerably reduced with respect to the original algorithm. Moreover, the user can easily balance the trade off between the number of test vectors and amount of logic that has to be reconfigured. Experimental results are reported for the ISCAS85 benchmark suite.
1
Introduction
One of the most significant properties of biological organisms is the ability to modify the conformation. Optimally-chosen conformation helps the biological organisms to perfectly manage elementary functions such as reproduction, sensing, communication, competition with others etc. Conformity between an organism and its environment constitutes what biologists call adaptation [1]. Reconfigurable architectures in fact implement the same concept in the world of electronic circuits. Different configurations of the same hardware can be activated to optimally perform quite different tasks, thus providing an obvious advantage to single-purpose circuits. A circuit support for diagnostics and testing is one of the functionalities that are embedded in modern electronic chips. However, this functionality can be considered as time/area overhead because it is not directly utilized by end users. In a typical scenario, a singe purpose “user circuit” is equipped with a diagnosis subsystem that performs diagnostics and testing after fabrication (to identify faulty chips) and during lifetime of the system (such as BIST, on-line testing etc.). Unfortunately, costs of electronic chip testing have been growing steadily and typically amount to 40% of today’s overall product cost [2]. In particular, the test application time which strongly depends on the number of test vectors needed to test a digital circuit significantly influences the product cost. Hence automatic test pattern generator tools have been used to reduce the number of G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 214–225, 2010. c Springer-Verlag Berlin Heidelberg 2010
Evolutionary Design of Reconfiguration Strategies
215
test vectors for a long time. High-quality test sequences in combination with full/partial scan techniques and other methods allowed designers to reduce the test application time significantly [3, 4, 5, 6, 7]. The reduction of test data volume is traditionally achieved by means of test data compaction [8, 9]. An unconventional approach to the test time reduction was introduced in [10]. The idea is to apply a smart circuit reconfiguration (i.e. to change the organism’s conformance in biological terminology) before test is applied in order to reduce the number of test vectors. More precisely: It is known that automatic test pattern generator (ATPG) tool can generate k-vector test sequence T leading to p% fault coverage where k and p depends on structure and properties of a given circuit C, fault model used and user requirements. However, it was shown in [10, 11] that if logic function of some gates of C can be changed then a much shorter test T can be generated, i.e. k can be significantly decreased (tens of percent, depending on circuit) simultaneously with having p almost unchanged. Therefore, the second circuit configuration is used only during test application to reduce the test application time. It is important to note that circuit topology remains unchanged during this reconfiguration. A method was proposed in [11] to find suitable gates for reconfiguration. However, the method is based on a deterministic recursive search which is very time consuming (days for mid-size circuits) and so impractical for designers and test engineers. The method was validated using only four benchmark circuits of the ISCAS85 benchmark suite [12]. The goal of this paper is to propose a new method for reduction of test vectors volume. The method should lead to a comparable quality of results wrt the previous approach; however, the time of computation has to be reduced. As there are many successful applications of evolutionary computing to hardware optimization and design [13, 14] the proposed method will be based on the evolutionary computing paradigm. Our goal is to generate a new circuit configuration, which differs from the original one as little as possible, but possesses better properties in terms of volume of required test vectors. The proposed algorithm utilizes a simple weight function to allow balancing the trade off between the number of test vectors and amount of logic that has to be reconfigured. The rest of the paper is organized as follows. Previous work in the area of test time reduction using circuit reconfiguration is summarized in Section 2. Proposed method intended for selection of gates that will be reconfigured before test is applied is presented in Section 3. Section 4 gives an overview of experiments performed to evaluate the proposed method. It also compares the results with paper [11]. Section 5 is devoted to the analysis of obtained results. Conclusions are given in Section 6.
2
Previous Work
Conventional approaches to reduction of test application time are well covered in literature. This section surveys the method which is relevant for our research.
216
2.1
ˇ aˇcek, L. Sekanina, and L. Stareˇcek J. Sim´
Reconfiguration Before Test Application
The principle of the method (which was initially proposed in [10]) is to identify gates of a circuit whose function has to be reconfigured before test is applied in order to reduce the number of test vectors. The reconfiguration should have the following properties: (i) The number of test vectors is reduced as much as possible. (ii) The number of reconfigured gates has to be minimized. (iii) The fault coverage is not influenced significantly (for a given fault model). (iv) Circuit connections remain unchanged. (v) Reconfiguration does not change the number of inputs and outputs of gates (for example, a two-input/one-output gate can be replaced only by a two-input/one-output gate). After test is applied, the circuit is reconfigured back to its original configuration. Only the gates that have to be reconfigured will be implemented as reconfigurable; other gates remain implemented using a standard library. 2.2
Search Algorithm
Since the number of possible reconfigurations is nr , where n is the number of gates of the circuit and r is the average number of possible replacements of a gate, an exhaustive search for an optimal configuration is intractable for real world circuits. Because the original method was based on enumeration, only the results for very small circuits (up to 13 gates) have been reported in [10]. In order to solve larger problem instances, a recursive search algorithm was proposed in [11]. This algorithm systematically reconfigures gate by gate, measures the resulting test length and fault coverage. When a particular gate reconfiguration leads to an improvement, the configuration is fixed and the algorithm is executed recursively from the next gate. Promising results have been reported for some of the ISCAS85 circuits even if the algorithm is terminated before the end of the complete search space exploration. In both cases the FlexTest tool was used to generate test vectors and calculate the fault coverage. As discussed in [10, 11], the basic assumption of the proposed method is that gates are considered as black boxes and only the circuit structure is tested because it is expected that failures in components will propagate outside the component. A possible problem is that demanded reconfigurable two-function gates may be functional in one mode and damaged in the other. This could lead to undetectable faults or false alarms. However, that strongly depends on the implementation of reconfigurable gates. Recall that a 100% fault coverage is not nowadays achievable for complex real-world circuits. Hence some faults will always remain unrecognized. Although the method can leave some faults unrecognizable too, it allows reducing of the test vectors volume for a reasonable cost. 2.3
Example
Figure 1 shows a 3-input/8-output decoder (dec3to8) which consists of eleven gates (seven 3-input NOR gates, three inverters and a 3-input AND gate). FlexTest was utilized to derive the test with 100% fault coverage. A stuck-at-fault
Evolutionary Design of Reconfiguration Strategies
217
Fig. 1. Circuit dec3to8. Reconfigured gates are shown as boxes
model was considered for AMI 1.2 um technology. The resulting test contains eight vectors: 100, 000, 111, 110, 011, 101, 010 and 001, i.e. it is the trivial test. Logic function of four gates of this circuit was modified as shown in Figure 1. Modified gates are shown in boxes. The three inverters were reconfigured to operate as simple wires (buffers) and the AND gate now operates as the NOR gate. Note that in the X/Y notation, X denotes the original function and Y denotes the modified function. Again, FlexTest was used to find a test with 100% fault coverage. The new test contains only four test vectors (100, 000, 010 and 001) which represents a 50% reduction. Other experiments are summarized in papers [10, 11]. 2.4
Possible Implementation Scenarios
An open problem (not addressed in this paper) is how to implement the reconfigurable gates. A straightforward approach is to employ multiplexing of the “user” and “test” function for selected gates (Fig. 2). This solution has a reasonable overhead, especially when the reconfigurable gate is optimized at the transistor level as shown in [15]. However, the select inputs of multiplexers are not considered during test pattern generation by ATPG. Hence it is necessary to use additional test vectors to test the select-inputs which increases the overall test time application. This solution is acceptable only in some cases. Another solution could utilize so-called polymorphic gates. Polymorphic gates are unconventional circuit components that are not supported by existing synthesis tools. A polymorphic gate is capable of switching among two or more logic functions. However, the selection of the function is performed unconventionally. The logic function of a polymorphic gate depends on some external factors, e.g. on the level of the power supply voltage (Vdd ) [16, 17, 18, 19]. Figure 3 shows the
218
ˇ aˇcek, L. Sekanina, and L. Stareˇcek J. Sim´
A B Sel Fig. 2. Reconfigurable NAND/NOR gate based on a multiplexer and its optimized transistor-level implementation according to [15]
Fig. 3. Polymorphic NAND/NOR gate controlled by Vdd and its measured behavior according to [19]
NAND/NOR gate controlled by Vdd which was fabricated using AMIS CMOS 0.7 micron technology. In case of reduction of test vectors volume it is assumed that the circuit will operate with a slightly different Vdd in test mode. Selected gates will then perform differently wrt the user mode and the test application time will be shorter (under our assumptions). As there are no select signals for polymorphic gates the problem with their testing does not exist. On the other hand it must be investigated whether all faults of the normal mode of the gate remain detectable in the test mode. There are other issues such as that a
Evolutionary Design of Reconfiguration Strategies
219
polymorphic circuit may have different timing parameters or power consumption during normal operation and test application.
3
Proposed Method
The proposed method utilizes a steady-state evolutionary algorithm (EA) which operates over chromosomes composed of n integers. New individuals are created using mutation applied on best-scored individuals of the population. Crossover is not utilized. The fitness function integrates the criteria given in Section 2.1 using weight coefficients. 3.1
Notation
Let G be a set of all gate types which can appear in the target design and C be a gate classification function such that g1 , g2 ∈ G belong to the same class if and only if they have the same number of inputs and outputs and g1 can be replaced by g2 (and vice versa) in the target circuit. We lift C to the sequence of gates such that C(g1 . . . gn ) = C(g1 ) . . . C(gn ). Additionally, len(s) denotes the number of symbols in a sequence s and δ(u, v) denotes the number of positions, where strings u and v differ. 3.2
Circuit Configuration
A digital circuit consists of the finite number of gates and an interconnection network. In this work, we only consider the sequence of gate types g1 . . . gn (in the order given by a circuit’s netlist) as a circuit configuration, because the interconnection network is always fixed. Each chromosome is then composed of just one circuit configuration. 3.3
Fitness Function
Our goal is to generate a new circuit, which differs from the original one as little as possible, but the test lenght is reduced. Fault coverage should not be modified significantly. Thus, we want to minimize the function f (nc, oc) = A ∗ (1 − tCov(nc)) + B ∗
δ(nc, oc) vc(nc) +C ∗ , vc(oc) len(oc)
where nc, oc denote the new and original configuration respectively, tCov(x) ∈ 0, 1 is a value which expresses the fault coverage of a given configuration, and vc(x) denotes the volume of required test vectors. Coefficients A, B, C represent the weight of each property.
220
ˇ aˇcek, L. Sekanina, and L. Stareˇcek J. Sim´
Algorithm 1. Evolutionary Algorithm Input: input configuration c, population size s, and mutation probability pmut Output: output configuration x 1 2 3 4 5 6 7 8 9 10 11 12 13
3.4
/* seeding phase */ P ← ∅; while |P | < s do P ← P ∪ {modify(c, pmut )}; while terminating condition not satisfied do /* reproduction phase */ P ← ∅; while |P | < s do select x ∈ P randomly; P ← P ∪ {modify(x, pmut )}; /* reduction phase */ P ← P ∪ P ; while |P | > s do select x ∈ P such that f (x, c) ≥ f (y, c) for any y ∈ P ; P ← P \ {x}; return x ∈ p such that f (x, c) ≤ f (y, c) for any y ∈ P ;
Mutation
Mutation takes an input configuration and flips each gate with the probability pmut . A new gate is selected randomly from the set of all gates belonging to the same class (according to C). 3.5
Evolutionary Algorithm
The evolutionary algorithm (Algorithm 1) starts with seeding of population P by randomly modified input configuration c. The modifications as well as mutations are performed by function modify. Then, it repeats reduction and reproduction phases. In the former the worst individuals wrt f are being iteratively removed until the size of the population meets the required criterion. The reproduction phase then generates new individuals by modifying configurations which are picked randomly from the original population. The condition which terminates the main loop of EA can be either the program running time or the number of generations being generated. As a result, the algorithm picks the best individual from the last population.
4
Experimental Results
As for experiments reported in [11], G contains 56 standard gates (with up to 4 inputs) which are also supported by the AMIS library. In addition, 7 gates
Evolutionary Design of Reconfiguration Strategies
221
Table 1. Summary of experiments for c499 with basic setup: A = 1000, B = 100, C = 10, popsize = 1000, pmut = 0.005, 10 independent runs, 1000 generations Modification wrt gates test vectors basic setup min. max. mean min. max. mean 500 gen. 10 32 23.8 29 34 31.2 11 30 31.4 30 34 31.4 250 gen. 100 gen. 19 27 24 30 33 31.4 24 45 33.9 28 32 29.6 250 gen./pmut = 0.01 21 17.5 30 34 32.3 250 gen./pmut = 0.0025 14 250 gen./A = 700 14 40 24.3 29 34 30.8
fault coverage [%] Mean min. max. mean t [h] 99 100 99.6 3.29 100 100 100 1.80 99.73 100 99.97 0.72 99.33 100 99.91 1.75 99.47 100 99.82 1.72 98.94 100 99.77 1.87
(with up to 8 inputs) were included to G to cover all the gates used in the ISCAS85 circuits. The FlexTest tool is used to generate test vectors and calculate the fault coverage. The results of proposed evolutionary algorithm are compared with the recursive search algorithm using the ISCAS85 benchmark suite. Main features of the ISCAS85 circuits (such as the number of gates, test length and fault coverage) are given in Table 2 (column ‘Original circuit’). All experiments were permormed on a server with 2 x Dual Core AMD Opteron 2220. Performing a single experiment is very time consuming because of using the FlexTest tool in the loop. Hence we have firstly investigated different settings of our EA on circuits c499 and c1355 and then performed a final set of experiments with all the benchmark circuits. For the first experiments we have used A = 1000, B = 100, C = 10, the probability of mutation pmut = 0.005 and the population of 1000 individuals. Various modifications of EA were tested using the c499 circuit. Resulting values are given in Table 1. Figure 4 shows the relation between the number of test vectors and the number of reconfigured gates for the c499 circuit. It can be seen that EA produces various solutions and one can easily identify a Pareto front in the figure. Figure 5 shows the results for the c1355 circuit obtained from 10 independent runs. Note that applying the FlexTest on this circuit (which consists of 546 gates) leads to the 108-vector test sequence and 99.49% fault coverage. The fault coverage was slightly reduced after using the proposed method; however, the test length was significantly reduced to 27 – 37 test vectors when 39 – 70 gates are reconfigured. The average runtime is 4.2 hours for 500 generations. Table 2 summarizes the results obtained for the complete set of ISCAS85 circuits using the proposed evolutionary algorithm and the recursive search. EA has been applied with the following setting: A = 1000, B = 100, C = 10, popsize = 1000, pmut = 0.005, 1000 generations, a single run per circuit. The time of evolution depends on the complexity of a particular circuit. An example of EA run is given in Fig. 6 which shows the progress of fitness score and the number of test vectors for the best individual in case of the c7552 circuit. As the recursive algorithm presented in [11] is deterministic we allowed the algorithm to run (i) for the same time as EA and (ii) for the maximum limit of 96 hours. In most cases of (ii), the computation was not terminated within this time limit.
222
ˇ aˇcek, L. Sekanina, and L. Stareˇcek J. Sim´
Fig. 4. Test length vs the number of modified gates for the c499 circuit (60 runs with different setting)
Fig. 5. Results of 10 independent runs for the c1355 circuit
Evolutionary Design of Reconfiguration Strategies
223
Fig. 6. The progress of fitness score and the number of test vectors for the best individual (the c7552 circuit)
Table 2. Parameters of the original ISCAS85 circuits and the circuits modified using the evolutionary algorithm and the recursive search (the time allowed as for EA vs 96 hours allowed). Notation: rg – the number of reconfigured gates, tl – test length, fc – fault coverage [%], t - runtime
c17 c432 c499 c880a c1355 c1908 c2670 c3540 c5315 c6288 c7552
5
Original Circuit gates tl fc 6 9 100 160 102 99.24 202 67 98.94 383 104 100 546 108 99.49 880 163 99.52 1269 189 95.74 1669 252 96.00 2307 190 98.88 2416 46 99.56 3513 371 98.26
Evol. Algorithm rg tl fc t[h] 4 5 100 3.5 22 50 99.07 4.9 31 30 100 4.9 63 49 99.50 5.0 67 31 98.52 7.7 90 48 95.54 7.0 99 92 95.33 8.7 103 171 94.32 9.8 123 111 92.32 9.0 94 35 98.31 9.5 152 207 91.49 34.7
Recursive rg tl 3 5 22 63 34 33 24 74 0 108 26 119 29 155 22 214 7 182 7 39 14 351
(t[h]) fc 100 99.82 100 100 99.49 99.74 96.50 96.48 98.92 99.56 98.40
Recursive (96h) rg tl fc t[h] 3 5 100 0.1 27 57 99.82 96.0 36 28 100 96.0 30 66 100 96.0 0 108 99.49 1.9 30 114 99.79 96.0 59 109 96.73 75.5 57 196 97.20 96.0 58 127 98.99 96.0 10 36 99.56 31.5 36 325 98.41 96.0
Discussion
We can see from Table 2 that the proposed method has reduced the number of test vectors by 49.0% and reconfigured 6.4% of gates in average (all benchmarks counted). Note that only a single run was performed (9.5 hour per circuit in average). The recursive algorithm has achieved a reduction of 31.4% test vectors and reconfigured only 2.6% of gates (when 96 hours were allowed). The recursive algorithm has increased the fault coverage measure by 0.4% in average and the EA has decreased the same measure by 1.9%. wrt the original fault coverage.
224
ˇ aˇcek, L. Sekanina, and L. Stareˇcek J. Sim´
Some particular results are interesting: No result was discovered by the recursive algorithm for the c1355 circuit; however, the EA has found a significant reduction of test vectors volume. Both algorithms have achieved similar results for the c499 circuit. The recursive algorithm has produced much better results for the c6288 circuit (where only 10 gates have to be reconfigured (94 in case of EA) to get the same test length). A significant test volume reduction has been achieved for c7552 circuit using EA; however, the solution is not probably acceptable as fault coverage is decreased by 6.77%. We expect that better results will be obtained in case that EA is executed multiple times. In summary, the proposed evolutionary algorithm has two main advantages in comparison with the deterministic recursive search algorithm. Firstly, it produces many different solutions which allow the designer to balance the trade off between the number of test vectors and amount of logic that has to be reconfigured. Secondly, the EA generates a reasonable solution for larger circuits much faster than the recursive algorithm. On the other hand, as EA does not strictly keep the fault coverage equal to or higher than the original value, the resulting solution can exhibit slightly lower fault coverage. However, this behavior can be eliminated by setting stronger requirements in the fitness function.
6
Conclusions
In this paper, we have presented an alternative method for selection of gates that have to be reconfigured before test is applied in order to reduce the test application time. We have shown on the ISCAS85 benchmark suite that the proposed method is able to achieve competitive results while the time of optimization is reduced wrt the deterministic search. In future, we plan to use a truly multi objective algorithm to easily discover the Pareto-optimal solutions.
Acknowledgments This work was partially supported by the Czech Science Foundation under contract numbers GP103/10/1517 and GD102/09/H042, the BUT FIT grant FIT10-S-1 and the research plan Security-Oriented Research in Information Technology, MSM 0021630528.
References [1] Fisher, R.A.: The Genetical Theory of Natural Selection. Clarendon Press, Oxford (1930) [2] Wang, L.T., Stroud, C.E., Touba, N.A.: System-on-Chip Test Architectures: Nanometer Design for Testability. Morgan Kaufmann, San Francisco (2007) [3] Park, S.: A partial scan design unifying structural analysis and testabilities. Int. J. Electronics 88(12), 1237–1245 (2001) [4] Xiang, D., Patel, J.H.: Partial scan design based on circuit state information and functional analysis. IEEE Trans. Computers 53(3), 276–287 (2004)
Evolutionary Design of Reconfiguration Strategies
225
[5] Efthymiou, A., Bainbridge, J., Edwards, D.A.: Test pattern generation and partial-scan methodology for an asynchronous soc interconnect. IEEE Trans. VLSI Syst. 13(12), 1384–1393 (2005) [6] Makris, Y., Orailoglu, A.: Property-based testability analysis for hierarchical rtl designs. In: Proceedings of IEEE ICECS 1999, 6th IEEE International Conference on Electronics, Circuits and Systems, pp. 1089–1092. IEEE Computer Society, Los Alamitos (1999) [7] Lee, J., Touba, N.A.: Low power test data compression based on lfsr reseeding. In: 22nd IEEE International Conference on Computer Design: VLSI in Computers & Processors (ICCD 2004), pp. 180–185. IEEE Computer Society, Los Alamitos (2004) [8] Das, S.R., Ramamoorthy, C.V., Assaf, M.H., Petriu, E.M., Wen-Ben, J., Sahinoglu, M.: Fault simulation and response compaction in full scan circuits using HOPE. IEEE Tran. on Instr. and Meas. 54(6), 2310–2328 (2005) [9] Pomeranz, I., Reddy, S.M.: Static test compaction for multiple full-scan circuits. In: 21st International Conference on Computer Design (ICCD 2003), VLSI in Computers and Processors, pp. 393–396. IEEE Computer Society, Los Alamitos (2003) [10] Sekanina, L., Starecek, L., Kotasek, Z., Gajda, Z.: Polymorphic gates in design and test of digital circuits. Int. J. of Unconventional Computing 4(2), 125–142 (2008) [11] Starecek, L., Sekanina, L., Kotasek, Z.: Reduction of test vectors volume by means of gate-level reconfiguration. In: Proc. of 2008 IEEE Design and Diagnostics of Electronic Circuits and Systems Workshop, pp. 255–258. IEEE, Los Alamitos (2008) [12] Brglez, F., Fujiwara, H.: A neutral netlist of 10 combinational benchmark circuits and a target simulator in Fortran. In: Proceedings International Symposium on Circuits and Systems (ISCAS), Kyoto, Japan, pp. 695–698 (1985) [13] Drechsler, R.: Evolutionary Algorithms for VLSI CAD. Kluwer Academic Publishers, Boston (1998) [14] Zebulum, R., Pacheco, M., Vellasco, M.: Evolutionary Electronics – Automatic Design of Electronic Circuits and Systems by Genetic Algorithms. The CRC Press International Series on Computational Intelligence (2002) [15] Starecek, L., Sekanina, L., Gajda, Z., Kotasek, Z., Prokop, R., Musil, V.: On properties and utilization of some polymorphic gates. In: Proc. of 6th Electronic Circuits and Systems Conference, FIIT STU, Bratislava, pp. 77–81 (2007) [16] Stoica, A., Zebulum, R.S., Keymeulen, D.: Polymorphic electronics. In: Liu, Y., Tanaka, K., Iwata, M., Higuchi, T., Yasunaga, M. (eds.) ICES 2001. LNCS, vol. 2210, pp. 291–302. Springer, Heidelberg (2001) [17] Stoica, A., Zebulum, R.S., Keymeulen, D., Lohn, J.: On polymorphic circuits and their design using evolutionary algorithms. In: Proc. of IASTED International Conference on Applied Informatics AI 2002, Insbruck, Austria (2002) [18] Stoica, A., Zebulum, R., Guo, X., Keymeulen, D., Ferguson, I., Duong, V.: Taking Evolutionary Circuit Design From Experimentation to Implementation: Some Useful Techniques and a Silicon Demonstration. IEE Proc.-Comp. Digit. Tech. 151(4), 295–300 (2004) [19] Ruzicka, R., Sekanina, L., Prokop, R.: Physical demonstration of polymorphic self-checking circuits. In: Proc. of 14th IEEE International On-Line Testing Symposium, pp. 31–36. IEEE, Los Alamitos (2008)
Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis Jo¨el Rossier and Carlos Pena Reconfigurable and embedded Digital Systems (ReDS), HEIG-VD, Yverdon, Switzerland [email protected], [email protected]
Abstract. From quite some time, biologists have been gathering big amounts of biomarker data from patients suffering specific illnesses and from healthy people. Their problem now lies in the processing of that huge amount of data that will enable them extracting meaningful information about the links and thus the rules enabling a diagnosis based on specific biomarkers. In this paper we propose an approach to this problem using fuzzy logic to model the diagnostic systems and evolutionary computing to find such systems. Moreover, the speed of execution of the proposed design which is based on several Virtex5 FPGAs with respect to a standard software computation, enables the realization of thousands of successive evolutionary runs within a reasonable time and thus permits to obtain robust statistical information enabling the selection of meaningful biomarkers for the diagnosis of specific diseases.
1
Introduction
In the process of finding efficient systems to obtain accurate diagnosis of diseases, biologists essentially face two major challenges. The first resides in the selection and extraction of a small set of relevant biomarkers for the actual diagnosis from a usually much bigger set of recorded biomarkers. Such a reduction is becoming crucial when wanting to propose diagnosis kits that could be economically suitable for mass production, i.e. the less biomarkers are used, the cheaper the resulting kit. Secondly, biologists have to find some accurate model, using the selected biomarkers, able to efficiently predict the presence or not of an illness in individuals. Indeed, to be usable on a large scale, a disease-diagnosis model must be very accurate. In this paper, we present an evolutionary-based method providing an answer to these two challenges using reconfigurable hardware based on the three following propositions: 1. First of all, we propose to model the diagnosis system using fuzzy logic which, because it does not use precise values but somehow extracts general tendencies, can inherently account for uncertainty –noise, variability, impreciseness– in the data (that is usually quite present in biological measurements). G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 226–237, 2010. c Springer-Verlag Berlin Heidelberg 2010
Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis
227
2. Then, in order to find a set of parameters for the fuzzy systems enabling them to accurately diagnose specific diseases, we propose to use an evolutionary algorithm. 3. Finally, to cope with the size of the databases and extract a pertinent subset of variables that could be used to obtain a diagnosis at a quite low price, we propose to run many times our evolutionary algorithm in such a way that the number of resulting systems enables us to perform robust statistical analyses. It is obvious that such a methodology requires a huge amount of computation, i.e. to obtain statistically representative data, we have to run thousands of evolutionary runs. Then, each of the evolutionary attempts must run for thousands of generations on thousands of individuals to obtain accurate-enough diagnostic systems. Finally each of the individuals of the evolutionary runs represents a complete fuzzy system, whose fitness must be determined in checking the prediction of the system for a big number of patients against the real recorded database data. As a result, our method would take months if executed on a standard software platform. We thus implement its functionalities using reconfigurable circuits in order to be able to efficiently parallelize and pipeline the computation so as to accelerate execution and obtain an acceptable computational time. The following section presents a summary of the fuzzy systems paradigm while Section 3 focuses on the application of such fuzzy systems for disease diagnosis. Section 4 then exposes the general hardware/software architecture and partitioning of our evolutionary system while the next section details the actual hardware architecture of the computational core. This paper continues with a section describing some of the results obtained, followed by a presentation of the speedup of our system with respect to a full software implementation. Finally, Section 8 consists of short conclusion that ends this paper.
2
Fuzzy Systems
A fuzzy system is a rule-based system that uses fuzzy logic, rather than Boolean logic, to reason about data [1]. Fuzzy logic is a computational paradigm that provides a mathematical tool for representing and manipulating information in a way that resembles human communication and reasoning processes. It is based on the assumption that, in contrast to Boolean logic, a statement can be partially true (or false), and composed of imprecise concepts. A fuzzy system uses the concept of linguistic variables, also called fuzzy variables, which are characterized by their name tag, a set of linguistic values (also known as fuzzy values or labels), and the membership functions of these labels [2]. Figure 1 shows an example of such a fuzzy variable concerning the temperature. As it can be seen, a certain value of the actual temperature (e.g. 19◦ ) is assigned a certain membership value µL (x) for each of its corresponding fuzzy labels L (e.g. Cold, Warm and Hot) according to membership functions (e.g. µCold (19◦ ) = 0.33, µW arm (19◦ ) = 0.67 and µHot (19◦ ) = 0).
J. Rossier and C. Pena Membership %
228
Cold
Warm
Hot
67 33 19°
Temperature °
Fig. 1. Example of a membership function for the linguistic variable “Temperature”
Several of such fuzzy variables are then combined as inputs to logical expressions, i.e. fuzzy rules that have the form: if (V ara is Labelx ) and (V arb is Labely ) and ... then (Output is Labelz )
Each rule is then assigned an activation level µrule that, in the case of using the minimum function as and operator, corresponds to the equation: µrule = µLabel x (V ara ) and µLabel y (V arb ) and · · · = min(µLabel i (V arj )) Note that additionally to and, other logical operators can used to construct the different rules. In addition, note that according to their output membership functions, there are different kinds of fuzzy systems: Mamdani [3,4], TSK [5,6], and singleton [7] and that the rule outputs can be respectively classical fuzzy sets, mathematical expressions, or constant values. Then, the outputs of different such rules belonging to the fuzzy system are aggregated to give the actual fuzzy output of the system. The last operation implies such fuzzy output to be defuzzified in order to obtain a final crisp value representing the answer of the system to its input fuzzy conditions. In the fuzzy system field, one finds many different techniques to actually implement the aggregation and defuzzification processes (MOM, COA, singleton, etc. [8,9]).
3
Fuzzy Systems for Disease Diagnosis
A major class of problems in medical science involves the diagnosis of diseases, based on various tests performed upon the patient. A good diagnosis system should possess two characteristics, which are often in conflict. First, the system must attain the highest possible performance, i.e. the accuracy in the correct disease prediction. Second, it is highly beneficial for such a diagnosis system to be interpretable. This means that the physician is not faced with a black box that simply gives answers with no explanations; rather, it is better for the system to provide some insight as how it derives its outputs. As a result, given a set of biomarkers measured at a given moment on an individual, we intend to find a system that can accurately predict the presence or absence of a certain illness or biological trait. At the same time, we expect such a system to provide, in some way, an explanation of its predictions. We propose to use Fuzzy logic modeling, mainly for the three following reasons:
Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis
229
1. Multivariate: Fuzzy logic is a multivariate approach, in contrast with several variance-based approaches which are univariate. 2. Feature extraction versus dimensionality reduction: PCA, SVM and other similar approaches rely on some kind of combination (transformation) of a high-dimensional space to a reduced space but do not allow directly using a reduced set of variables. Fuzzy logic uses directly the original variables allowing for gene-pool selection at the same time that produces classification. 3. Interpretability: as explained above, due to the utilization of linguistic values, Fuzzy systems might be highly interpretable. Moreover, the decision to use fuzzy logic came from bio-informatics people as other commonly-used techniques were not producing results in the specific data set. We thus want to find a fuzzy system which, taking as input the measured level of the different biomarkers (Bx ), applying specific rules to different combinations of these inputs and aggregating the rule outputs, is able to indicate in an understandable way if an individual is ill or not (D). An example of such a fuzzy system could be the following: if (Ba is Low) and (Bb is M edium) and (Bc is Low) then (D is High) if (Ba is M edium) and (Bc is V eryHigh) then (D is High) if (Bb is V eryLow) and (Bc is High) then (D is Low) else (D is Low)
As shown above, each of the biomarker input values must be first fuzzified according to its corresponding membership function, the resulting fuzzy values are then used to state the level of activation of each of the rules. These activation levels are then aggregated and defuzzified to indicate if the tested individual suffer from a specific illness. Note the presence of the “else ...” line, which represents the default rule whose activation is inversely proportional to the other rules activations, i.e. it is as highly activated as the other rules are lowly activated. In order to build such diagnostic system, one has to solve several challenges: 1. Which biomarkers do allow obtaining a system that correctly discriminates between illness and healthiness? 2. What are the different membership functions that represent each biomarker, i.e. what are the threshold values for Bx to be considered as VeryLow, Low, Medium, etc. ? 3. What biomarker interactions and effects are represented by each rule, i.e. which biomarkers and corresponding linguistic labels have to be taken as inputs for a specific rule and which is the output of that rule (Low or High)? 4. What is the best rule combination for the fuzzy diagnostic system to be accurate enough? In order to find a valid answer to these questions, we decided to use an evolutionary method, i.e. to encode the possible systems as genomes and then let an evolutionary algorithm finding the best genome and providing the corresponding fuzzy system. The next subsection describes the specificities and the chosen encoding of our fuzzy systems.
230
3.1
J. Rossier and C. Pena
Specification of the Search Space
Disease
BiomarkerX
Biomarker3
Biomarker2
Biomarker1
The basic material to conduct our disease-diagnosis fuzzy system search consists of a database (top left of Figure 2) containing the measured levels of different biomarkers (1 to X in the figure) for a set of patients (1 to N) and their corresponding illness information (Di ).
B
Patient1 B11 B12 B13 Patient2 B21 B22 B23
B1X D1 B2X D2
Var1 Var2
PatientN BN1 BN2 BN3
BNX DN
Var9 Var10
Ø
A
and Ø and (Var10 is L101) then Out1
if (Var1 is L11) and(Var2 is L21) and if (Var1 is L12) and
Ø
and Ø and (Var10 is L102) then Out2
and
C if
Ø
and(Var2 is L210)and
and Ø and(Var10 is L 1010) then Out10
else Outdefault
D
Fig. 2. Specification of our fuzzy systems
The database might contain a very big amount X of biomarkers –e.g., microarray-based gene expression profiles. As mentioned above, one of our main goals is to find a reduced but predictive set of biomarkers, we thus first decided to limit the maximum number of different biomarkers used by each system. Preliminary software tests showed that an upper limit of ten biomarkers was a good tradeoff to enable a high enough classification performance while keeping the search space relatively small and thus enabling the convergence of the evolutionary process. As a result, each system has to select up to ten biomarkers from the whole database content. This process corresponds to step A in Figure 2. This requirement can be coded into the genome using 10 · log2 (X) bits. For example, with a database containing 1000 different biomarkers, the encoding of their selection for our fuzzy system needs 100 bits. Note that if the encoded value points to a biomarker index greater than X, the effective number of variables used by the system is reduced (e.g. V ar9 in the figure). Then, we have to define a membership function for each of the selected biomarkers (step B in Figure 2). Based on interpretability considerations [10], and also for the sake of maintaining the search space reasonably small, we decided to allow each lingustic variable to have up to five membership functions, e.g. VeryLow, Low, Medium, High and VeryHigh. In the same sense, we decided
VeryLow
P1
Low Medium High
P2
P3
P4
VeryHigh
P5
Membership
Membership
Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis Low
P1
High
Medium
P2
231
P3
Fig. 3. Two examples of membership functions: left with five linguistic labels, right with three linguistic labels
to use semantically-correct fuzzy variables [10] defined by triangular membership functions bound by Z- and S-shaped membership functions, as illustrated in Figure 3. As a result, the encoding of each membership function contains up to five different threshold values P 1 to P 5, as shown in Figure 3. Note that, in order to reduce the amount of information within the genome, we limit the precision of the P x values to 6 bits. Moreover, it should be possible for each linguistic variable to be defined using less than the maximum five linguistic labels: this implies to code within the genome the number of membership functions used (i.e. with 2, 3, 4 or 5 labels), which requires only two bits per biomarker. Consequently, for our example of 1000 biomarkers, if we fix at ten the maximum number of rules of the system, the encoding of the membership functions requires 10 · (5 · 6 + 2) = 320 bits. For each of the 10 rules, information is then needed to indicate whether or not the variables of the current fuzzy system –i.e., those defined in the previous step– must be used as inputs, and if it is the case, to indicate which linguistic label must be applied (step C in Figure 2). This information is encoded using 3 bits per variable per rule: when the 3-bit value points to an existing linguistic value for the variable, it is used as for example L11 means medium, L12 means low, etc. When the 3-bit value points to a linguistic label index that is not implemented for the specific variable, the latter is not used within the rule inputs (e.g. L22 ) resulting in a rule shorter than the allowed maximum of ten inputs. For example, if the membership function of a selected biomarker is defined to have two different linguistic labels (V ar2 ) and if the 3-bit value for that biomarker for a specific rule is greater than two, the biomarker isn’t used as input (as in the second rule). It results from that encoding that the rule input definition needs 3 · 10 · 10 = 300 bits for a complete rule input specification. Finally, we decided to use fuzzy systems with singleton output type [7,8]. This choice enabled a simplification in the defuzzification process, while still offering good performance both numerically (accuracy) and linguistically (interpretability). Concerning the output variable, we decided to enable five distinct singleton values that are each encoded with 8 bits. In consequence, for each rule we have to encode which output value is used as conclusion of the rule (step D in Figure 2). This is done using 3 bits per rule (for the 10 standard rules plus the default rule). Once again, if this value points to a non-existent output index, this means that the corresponding rule isn’t used within the fuzzy system. The rule output encoding thus needs 5 · 8 + 11 · 3 = 73 bits.
232
J. Rossier and C. Pena
It results from the chosen encoding that a whole fuzzy system is fully encoded with a genome of 793 bits. Note also that the chosen encoding encourages the emergence of ∅ values, which facilitate thus to the evolutionary algorithm finding fuzzy systems with less than 10 rules and/or less than 10 variables.
4
HW/SW Implementation
The targeted HW platform is the Pico Computing EC7BP board that contains seven separate Xilinx V5LX50 FPGAs built around a PCIe bridge. The general architecture of our system shown in Figure 4 then consists of two interacting parts, one on hardware implementing the fuzzy computational core and one on software performing among other functions the evolutionary algorithm. The hardware part consists of the seven FPGAs that receive the parameters defining the different fuzzy systems, compute their results according to the sample data within the database and send the results back to the software. On its side, the software part contains a thread responsible for the generation of the successive generations and a thread responsible for the gathering of the individual fitnesses. Moreover the software part also contains a producer and a consumer thread per FPGA. The producer threads receive the genome of individual fuzzy systems, decode them to generate the corresponding parameters and then send this information to the hardware part.
create new generation
SW decode genome & send parameters of fuzzy system
HW
...
decode genome & send parameters of fuzzy system
compute fuzzy system outputs
...
compute fuzzy system outputs
receive outputs & compute fitness
...
receive outputs & compute fitness
gather fitnesses
Fig. 4. General architecture of the system
The parameters sent to the system essentially consist in the following data: the output constants (Outi in Figure 2) for each of the rules and the indication whether or not the rule is used within the system that is being computed and then, for each different variable taking part in the computation of the output, the P1 to P5 values defining its membership function (Figure 3) as well as the type of the membership function, some additional pre-computed values used for
Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis
233
the fuzzification process (see Section 5.1) and finally some data indicating which one of its linguistic labels (if any) takes part in the computation of each rule activation. With these parameters, the fuzzy system is fully defined. The consumer threads, on their side, receive the output values computed within the hardware: the data essentially consists of the result of the system computation for each of the patients of the database, i.e. whether or not the patient suffers from the disease according to the current fuzzy system? The consumer threads then compute the fitness values of the corresponding fuzzy systems and forward them to the gathering thread. As it can be seen in the figure, a first level of parallelism is included in the structural architecture of our system, since the decoding of the genome, the computation of the fuzzy system outputs and the computation of the different fitness values occur simultaneously within the different data flows. Moreover, some structural pipelining is also introduced, as the two software threads and the hardware part of each data flow can execute simultaneously their respective tasks on different successive fuzzy systems. The detailed presentation of the hardware architecture is given in the next section.
5
Hardware Architecture
The hardware architecture of our fuzzy computational module is shown in Figure 5. It is divided into three main stages: fuzzification, aggregation and defuzzification. The fuzzification stage takes as input the parameters of the current biomarker membership function sent from the software part and computes, in parallel, the membership values for each of its linguistic labels. Moreover, these computations are pipelined and this module can thus generate a whole set of all the fuzzification values for a single biomarker in each clock cycle. Note that the architecture of the fuzzification modules is detailed below. Fuzzy System parameters
Rule1
x
0 mux 1
min
1
Adder Tree
x
Rule2
+ + +
+ +
+
x
biomarker values
+
1
x
-
max
+
default rule Rule10
+
0 mux 10 1
min
+
Adder Tree
+ + +
+ + +
Fig. 5. Architecture of the hardware part of the system
Div
234
J. Rossier and C. Pena
Remember that the computation of the rules activation follows the equation: µrule = µLabel a (V1 ) and µLabel b (V2 ) and · · ·
Moreover, as stated in Section 3.1, our implementation of a fuzzy system contains at most ten different input biomarkers. We thus propose to compute the membership values of a specific biomarker for all the patients, i.e. we load the parameters corresponding to the first biomarker of the system and compute its fuzzification for all the values of the different patients. These results are subsequently stored within a shift register according to the information of their participation in the computation of the rules (if a variable does not appear in a rule, the default 1 value is stored). Then we load the parameters of the second biomarker and aggregate with a min operator the results (once again following each rule’s parameter for the variable selection) with the previously computed activation for each patient. We then repeat this process until all the input membership values have been aggregated within the shift register of each rule. The aggregation level thus consists, for each rule, of an input multiplexer enabling the choice of the biomarker linguistic label taking part in the determination of the rule (selected with the parameters sent from the software part), a shift register containing the aggregated rule-activation value for each patient in the database and the min operator (corresponding to the and function) which is used to aggregate the successive variable values for each patient. Finally, as we use singleton-type outputs (see Section 3.1), the defuzzification process implies, for each of the different patients, to take all the final ruleactivation values from the shift register (if the rule is used or ‘0’ if it is not, the correct value being selected by a multiplexer controlled by the parameters sent from the software), and then compute the final value following the equation: Output =
i
(µrulei CSTrulei ) + µdef CSTdef µrulei + µdef
µdef =1−max(µrule ) i
i∈{1..10}
To realize this computation in hardware, our defuzzification module thus needs, as shown in Figure 5, a max operator and a substractor to compute the default rule activation µdef , eleven multipliers for the products by the constants representing the singleton output values (the CSTi parameters), and finally two adder trees and a divider to obtain the final output value. Note that a final comparison of the division result with a threshold value (not shown in the figure) enables our system to output only one bit (i.e. ill or not) for each of the patient. Note also that the whole defuzzification block is pipelined and can thus give the results for each patient at a one-clock-cycle rate. 5.1
Fuzzification Module
The fuzzification module has to generate the membership value µ of a specific linguistic label for a particular input biomarker level V . In our implementation, the membership functions might be of three different types (A, B or C) as shown in Figure 6. They are defined by the parameters L, M, and/or H.
Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis
1
A
ABC
sel OUT
H M 1
H
B L
1
M
C
H
V M L 1/(M-L)
L
M
235
<
x 0 1
µ(V)
1/(H-M)
Fig. 6. Detailed architecture of the fuzzification module
As the maximum value of the membership function equals ‘1’, µ(V ) becomes: V −L when (B or C) and (L < V < M) else M −L H−V when (A or B) and (M < V < H) else H−M 1 when (A and V < M) or (C and V > M) else 0 1 1 If we provide the values M−L and H−M , the computation of the membership value can then be executed only with a floating point substractor, a comparator, a multiplier and some multiplexers and selection logic, as shown in Figure 6. Note that, as mentioned above, the whole fuzzification module is pipelined and thus needs some delay elements (the gray blocks in the figure) in order to synchronize the required data at the corresponding pipeline levels.
6
Results
The modeling problem involves discriminating two categories of patients based on their gene-expression profiles. It admits a relatively high number of variables and consequently, a huge search space. An initial, exploratory number of software-based evolutionary fuzzy modeling runs, and the subsequent analysis showed that many different systems were capable of satisfactorily solving the pursued discrimination problem. Furthermore, we observed that there exist many, radically different, pools of genes that may lead to highly accurate models (i.e., 100% classification with very few rules and variables). This fact, besides being unusual for a fuzzy modeling project, obliged us to redefine our main modeling goal. We thus focused our experiments on detecting highly-frequent models and genes across a large number of fuzzy modeling runs in order to unveil common patterns, which implied performing many evolutionary runs. In addition, we took advantage of the multiple evolutionary runs to perform cross-validation analysis. Finally, in order not to be stuck by a too long computational time, we had to apply our hardware diagnosis-finding system to this problem. Concretely, the sample database consisted of 1016 biomarkers values for 32 patients suffering (or not) from cancer. To assess overfitting in the data, we conducted our evolutionary runs with only 31 patients and used the remaining one for cross-validation. We thus executed 3200 successive runs, i.e. 100 runs with
236
J. Rossier and C. Pena
each one of the patients used as a cross-validator, and we considered only the resulting systems that had a 100% correct classification on the 31 patients but also a correct prediction on the left-out case (68% of 3200). The evolutionary runs were conducted having a population size of 300 individuals, an elitism value of 1 (the best individual is copied to the next generation), a rank-based probabilistic selection of the ancestor participating to the crossover and a probability of mutation of 1/400 for each bit of the genome (793 bits). The fitness was defined as (specificity+0.8·sensitivity)/1.8 if the system is not a perfect classifier yet and 1+f(size) in the other case. Note that in about 400 generations each run gave rise to perfect classifiers, and the remaining generation were thus used to reduce the size of the system (number of used rules and biomarkers). This procedure thus gave us 2176 prefect classifiers, each of them containing specific biomarkers combinations (with 97% of them using less than 5 biomarkers). From this data we could study the frequency of appearance of specific biomarkers or specific pairs of biomarkers through all the selected runs. This gave us some hints about the significance of the different biomarkers in the diagnosis of the disease, but this discussion lies beyond the scope of this paper.
7
Speedup
We have defined our system using different kinds of pipelining and parallelization for its execution to be really fast. At the HW/SW design level, we have several parallel data flows using specific threads and hardware resources, and each of them is structurally pipelined (see Figure 4). One level below, as shown in Figure 5, we have designed our computational cores to be also pipelined, comprising a fuzzification stage, an implication stage and an aggregation/defuzzification stage. Moreover, as it can also be seen in this figure, each of these stages contains in itself some parallelism (e.g. the membership functions, respectively the rules computation as well as the multiplications in the defuzzification block, are all computed concurrently). Finally, at the lowest level of the system, we also implemented some pipelining behaviors (see Figures 5 and 6). With the efficient use of the above mentioned hardware acceleration techniques, we implemented and exhaustively tested our system with the setup exposed in the previous section. It exhibited a speedup of about 150 with respect to a standard C++ software implementation. To give an order of magnitude, for this specific modelling project, the computation of 3200 evolutionary runs, each of them consisting of 1000 generations with 300 individuals (i.e. 300 fuzzy systems encoded following the explanations given in Section 3.1) for the 1016biomarker 32-patient database, takes about 9.3 hours on our system, a time that compares very well with the about 2 months !!! required to perform the same computation with the software-only implementation.
8
Conclusion
In addition to finding fuzzy systems based on a database of biomarker samples and able to give a valid diagnosis for an unknown patient, the computational
Extrinsic Evolution of Fuzzy Systems Applied to Disease Diagnosis
237
speedup exhibited by our implementation enables us to also use our system to generate statistically representative measures. Indeed, when repeating several thousand times the evolutionary process of finding a valid fuzzy system, we end up with several thousand sets of biomarkers allowing a correct disease diagnosis. Analyzing these sets then enables us to propose to the biologists the most significant biomarkers for a specific disease out of all the ones that have been sampled within the database. This information can then be used by biologists to give them some hints about particular relations between the specific biomarkers and the disease itself. Moreover this frequency-based analysis enables us to further reduce the size of the database, diminishing the number of biomarkers that have to be measured to correctly guess the presence of the disease in a patient. To summarize, we can say that the evolutionary process used within our system enables us to quite “easily” find simple and accurate fuzzy systems applied to the diagnosis of specific diseases. Moreover, the use of fuzzy logic to realize the diagnosis systems permits to end up with linguistically meaningful predictive systems that are understandable by biologists. Finally, the hardware implementation of the computational core of our system and the great resulting speedup allows obtaining statistically representative information in a tractable time (as opposed to a full software implementation). The combination of evolutionary techniques, fuzzy logic systems and specific hardware design has thus proven its great potential in answering several kinds of complex biological questions.
References 1. Zadeh, L.A.: Fuzzy sets. Information and Control 8, 338–353 (1965) 2. Zadeh, L.A.: The concept of a linguistic variable and its applications to approximate reasoning. Information Science, Parts I 8,199–249, II 8, 301–357, III 9, 43–80 (1975) 3. Mamdani, E.H.: Application of fuzzy algorithms for control of a simple dynamic plant. Proc. of the IEE 121(12), 1585–1588 (1974) 4. Mamdani, E.H., Assilian, S.: An experiment in linguistic synthesis with a fuzzy logic controller. Int. Journal of Man-Machine Studies 7(1), 1–13 (1975) 5. Sugeno, M., Kang, G.T.: Structure identification of fuzzy model. Fuzzy Sets and Systems 28(1), 15–33 (1988) 6. Takagi, Y., Sugeno, M.: Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Systems, Man and Cybernetics 15, 116–132 (1985) 7. Zadeh, L.A.: A fuzzy-set-theoretic interpretation of linguistic hedges. Cybernetics and Systems 2(3), 4–34 (1972) 8. Yager, R.R., Filev, D.P.: Essentials of fuzzy modeling and control. John Wiley & Sons, New York (1994) 9. Mendel, J.M.: Fuzzy logic systems for engineering: A tutorial. Proc. of the IEEE 83(3), 345–377 (1995) 10. Pena-Reyes, C.-A., Sipper, M.: Fuzzy CoCo: Balancing accuracy and interpretability of fuzzy models by means of coevolution. In: Accuracy Improvements in Linguistic Fuzzy Modeling. Studies in Fuzziness and Soft Computing, vol. 129, pp. 119–146 (2003)
Automatic Code Generation on a MOVE Processor Using Cartesian Genetic Programming James Alfred Walker, Yang Liu, Gianluca Tempesti, and Andy M. Tyrrell Intelligent Systems Group, Department of Electronics, University of York, Heslington, York, YO10 5DD, UK {jaw500,yl520,gt512,amt}@ohm.york.ac.uk
Abstract. This paper presents for the first time the application of Cartesian Genetic Programming to the evolution of machine code for a simple implementation of a MOVE processor. The effectiveness of the algorithm is demonstrated by evolving machine code for a 4-bit multiplier with three different levels of parallelism. The results show that 100% successful solutions were found by CGP and by further optimising the size of the solutions, it is possible to find efficient implementations of the 4-bit multiplier that have the potential to be “human competitive”. Further analysis of the results revealed that the structure of some solutions followed a known general design methodology.
1
Introduction
In the past decade, evolvable hardware has attracted interest from both circuits design and evolutionary computation. Generally there are two main branches of applications: optimising the elementary parameters of a circuit [2], and more interestingly, creating a circuit from smaller compositional units [4,11]. In the latter, Cartesian Genetic Programming (CGP)[5] has demonstrated great potential to create combinatorial circuits. However, previous evolutionary digital circuit designs are limited by the fine-grained nature of the devices. Gate level programmability provides the system with high flexibility but also enlarges the overall search space and thus increases the degree of complexity tending to reduce the overall functionality that can be achieved. Lifting the granularity from gate level to processor architecture level provides another form of evolutionary medium (substrate). A typical coarse-grained programmable machine is a general purpose processor (GPP). At this level, Nordin et al proposed a graph-based genetic programming technique, known as Linear Genetic Programming (LGP), to automatically generate machine code for GPPs [7]. The behaviour of a processor is expressed by executing the evolved instructions. However, except for conventional computer applications, GPPs are not always feasible due to performance reasons or the overall manufacturing cost. Usually, a highly customised computing architecture is more suitable for specific domains. Between the fine-grained and coarse-grained architectures, another architecture exists, which is known as Application Specific Instruction Processors (ASIP). G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 238–249, 2010. c Springer-Verlag Berlin Heidelberg 2010
Automatic Code Generation on a MOVE Processor
239
As the name suggests, an ASIP usually performs application oriented functionalities and the instructions are coded for particular operations. In order to reduce the complexity of the decoder logic and the length of the execution pipeline, an operation of an ASIP is usually an atomic control of the internal data flow of the processor. In this paper, a simple implementation of an ASIP, called MOVE, is evolved focussing on automatic machine code generation using Cartesian Genetic Programming (CGP). Section 2 reviews the transport triggered architecture (TTA). Section 3 describes the CGP algorithm for code generation. Section 4 presents a demonstration of how to create a particular function using a sequence of evolved instructions. Finally, Section 5 concludes the paper and proposes the future work.
2
Transport Triggered Architecture
The transport triggered architecture (TTA) was created in the 1980’s, as a development of the Very Long Instruction Word (VLIW) architecture ([1]). A MOVE processor is an instantiation of the TTA. It has been used in other bio-inspired systems, such as the POEtic Project [9,6], because of its simplicity and modularity. It has also been adopted by the SABRE1 project [3], as high modularity provides intrinsic fault tolerance. In this section, the basic components and typical features of the TTA are briefly reviewed. The TTA usually contains an instruction decoder, an interconnection network and a number of functional units (FU), as shown in Fig. 1. Functional units and the decoder are connected through data and address buses. A single data/address bus pair is called a slot. A processor can have multiple slots to allow parallelism. An input/output interface of a functional unit is called a port. Ports are usually globally addressable. The connection of a port to a slot is called a socket and the number of conjunctions in a socket is flexible. In the TTA, the instruction decoder is a unit that fetches code from the memory, decodes the instructions and controls the interconnecting network. The
Fig. 1. Transport triggered architecture [6] 1
Self-healing cellular Architectures for Biologically-inspired highly Reliable Electronic systems.
240
J.A. Walker et al.
Fig. 2. Instruction format
decoder has a program counter (PC) which points to the address of next instruction in the memory. The structure of a decoder is generally very simple, because instructions of a TTA only contain two types of information: the source port (SRC) address (or intermediate value) and the destination port (DST) address, as shown in Fig. 2. A source address and a destination address together indicate a data flow from one port to another. From this perspective, the TTA has similar characteristics to both control flow and data flow machines. Compared with conventional RISC, CISC or VLIW architectures, which are also called operation triggered architectures (OTA), TTA only has one instruction, which is: move destination, source. Some destination ports only store the incoming data, whilst others receive the data and also trigger the operation. Operations are not explicitly expressed in the instructions, but implicitly referred to by the address of the destination port. Once the decoder retrieves the addresses from an instruction, it will open the corresponding source and destination ports. In order to illustrate the implicit operations, we compare the TTA and RISC instruction formats. Most single RISC instructions can be represented by: opcode, result, operand0, operand1. It can be decomposed into 3 separate move operations, as shown in Table 1. The ports are presented by the name of a functional unit with subscripts. The triggered inputs of the functions are denoted xxxt and the non-triggered inputs are denoted by a number, for example, xxx0 . The outputs of the functions are denoted as xxxr for the result. There are some interesting features which distinguish TTA from conventional architectures. Firstly, a TTA processor does not need to explicitly transport the result from a FU to a general purpose register, unless the data is still useful after the next operation on the same FU has started. As shown on the right column in Table 1, a result of an FU can be directly transported into another FU as an operand. Therefore, there is a high probability that a TTA processor will use Table 1. Examples of RISC and TTA instructions. The adder, subtractor and registers are denoted by addx , subx , and r1 to r3, respectively) RISC
TTA
add r2, r2, r1
move move move move move move
sub r3, r3, r2
Optimised TTA add0 , r1 addt , r2 r2, addr sub0 , r2 subt , r3 r3, subr
move add0 , r1 move addt , r2 move sub0 , addr move subt , r3 move r3, subr
Automatic Code Generation on a MOVE Processor
241
fewer general purpose registers than conventional OTA processors. Secondly, the granularity of functional units is highly flexible. The bitwidth and the number of data buses can range from 1 up to hundreds, and the functionality can range from a simple boolean operation up to a complicated integration transform. However, the instruction format and the bus control mechanism remains the same.
3
Generating MOVE Code Using CGP
There are various approaches that use GP techniques to evolve machine code for processors. For example, LGP is deliberately designed for OTA machines [7]. As mentioned in section 2, most RISC instructions can be decomposed into 2 or 3 MOVE instructions. Due to the intrinsic relationship of LGP and its target OTA machines, it is appropriate to use LGP to generate a RISC code and then translate this into a series of MOVE instructions. However, it may need an extra optimiser to bypass the redundant use of general purpose registers during or after the translation. Alternatively, CGP does not favour in any specific hardware architecture. Therefore, it is possible to apply CGP directly to a TTA. CGP was originally developed by Miller and Thomson [5] for the purpose of evolving digital circuits. It represents a program as a directed graph (that for feed-forward functions is acyclic). The benefit of this type of representation is that it allows the implicit re-use of nodes, as a node can be connected to the output of any previous node in the graph, thereby allowing the repeated re-use of sub-graphs. This is an advantage over tree-based GP representations (without ADFs) where identical sub-trees have to be constructed independently. Originally, CGP used a program topology defined by a rectangular grid of nodes with a user defined number of rows and columns. However, later work on CGP showed that it was more effective when the number of rows is chosen to be one [12]. This one-dimensional topology is used in this paper. In CGP, the genotype is a fixed length representation consisting of a list of integers which encode the function and connections of each node in the directed graph. However, CGP uses a genotype-phenotype mapping that does not require all of the nodes to be connected to each other, this results in the program (phenotype) being bounded but having variable length. Thus there maybe genes that are entirely inactive, having no influence on the phenotype, and hence the fitness. Such inactive genes therefore have a neutral effect on genotype fitness. This phenomenon is often referred to as neutrality. The influence of neutrality in CGP has been investigated in detail [5,12] and has been shown to be extremely beneficial to the efficiency of the evolutionary process on a range of problems. In this paper, the CGP genotype decodes to a phenotype that resembles a linear string of MOVE instructions. This is similar to the idea of applying CGP to the lawnmower problem [10]. However, the main difference between the work in [10] and this paper is how the linear string of instructions is constructed. In [10], the terminals to the CGP program represent the control instructions (such as move forward, turn right, etc) and the function set consisted of program nodes and other manipulative functions (such as vector addition) to assemble
242
J.A. Walker et al.
Fig. 3. A CGP genotype and corresponding phenotype for a string of MOVE instructions. The inactive areas of the genotype and phenotype are shown in grey dashes.
the control instructions. In this paper, the terminals to the CGP program don’t actually represent anything, they simply act as a method for terminating the current branch of execution. It is the function of each node that performs the MOVE operations (using values from a lookup table) and it is how the nodes are decoded that produces the linear string of instructions. The CGP genotype is decoded using a recursive approach, which starts at the output terminal and iterates backwards through the graph, ensuring that the first input connection of the node is always evaluated before the second node input. An example is shown in Fig. 3. Once the linear string of instructions has been constructed, it is evaluated by iterating through the sequence and performing each MOVE instruction, which will transfer values between the input, function, and output registers. Once all of the instructions have been performed, it is the value in the output register that is compared with the perfect solution.
4
Experiment
To demonstrate the approach described in section 3, the algorithm is applied to a “4-bit multiplier” problem. The aim of the CGP algorithm is to find an efficient solution to the 4-bit multiplier problem using a MOVE architecture. In order to achieve this, a two stage fitness function is used. Initially, the fitness function evaluates all possible input combinations and performs a summation of the number of instances where the output of the evolved solution differs from that of the perfect solution. Once a solution is found, the fitness function minimises the length of the instruction sequence produced by evolution whilst keeping the functionality of the solution fixed. This is similar to the approach used in [4] for finding efficient logic circuits.
Automatic Code Generation on a MOVE Processor
243
Table 2. The parameters used by CGP for the 4-bit multiplier problem Parameter
Value
Population size 5 Genotype length (nodes/genes) 200/600 Mutation rate (% of genes) 2 Run length (generations) 10,000,000 Number of runs 50
4.1
Parameters
The parameters used for the experiment are shown in Table 2. The CGP algorithm uses the (1 + 4) evolutionary strategy (a population size of 5) that is normally associated with the technique [5,12,10,4]. Each CGP node was allowed two inputs, so that the implicit re-use of nodes in the genotype was possible. The mutation rate was chosen based on previous experience and no crossover operator is used. The run length was chosen to allow the CGP algorithm enough time to find a solution and then optimise the solution size. The permitted MOVE operations for a single slot, which the CGP algorithm is allowed to use for the 4-bit multiplier with this function set are shown in Table 3. The function set used consists of the functional units: addition (two inputs, one output), shift left with carry (one input, two outputs), shift right with carry (one input, two outputs) and a multiplexer (three inputs, one output). All functional units have a triggered input and possibly other non-triggered inputs depending on the functional unit. In addition to the function set, there are also three input registers (the third providing a constant value of ”0”) and an output register. As the 4-bit multiplier produces a 8-bit output, all input, output and function registers are 8-bit. In addition to the notations used to describe the MOVE operations, as mentioned in Section 2, xxxc is used to present the carry out. 4.2
Results
Three versions of the CGP algorithm were run on the 4-bit multiplier problem in order to investigate whether performing MOVE operations in parallel was beneficial to both evolution and the overall solution size. In terms of the number of slots implemented in the MOVE processor, the three different variants of CGP are denoted as sequential for a single-slot, parallel2 for a double-slot and parallel3 for a triple-slot. The parallel2 and parallel3 algorithms were implemented with extended function sets that allowed some of the permitted MOVE operations to occur in parallel. For example, move operations to FUs with two input ports in parallel2 and move operations to FUs with two or three input ports in parallel3. In future work, it is intended to parallelise all possible combinations of permitted MOVE operations.
244
J.A. Walker et al. Table 3. The permitted move operations for the function set Addition (add) in0 in1 in2 addr shlr shlc shrr shrc muxr in0 in1 in2 addr shlr shlc shrr shrc muxr addr
→ → → → → → → → → → → → → → → → → → →
add0 add0 add0 add0 add0 add0 add0 add0 add0 addt addt addt addt addt addt addt addt addt out0
Shift Left (shl) in0 in1 in2 addr shlr shlc shrr shrc muxr shlr shlc
→ → → → → → → → → → →
shlt shlt shlt shlt shlt shlt shlt shlt shlt out0 out0
Shift Right (shr) in0 in1 in2 addr shlr shlc shrr shrc muxr shrr shrc
→ → → → → → → → → → →
shrt shrt shrt shrt shrt shrt shrt shrt shrt out0 out0
Multiplexer (mux) in0 in1 in2 addr shlr shlc shrr shrc muxr in0 in1 in2 addr shlr shlc shrr shrc muxr in0 in1 in2 addr shlr shlc shrr shrc muxr muxr
→ → → → → → → → → → → → → → → → → → → → → → → → → → → →
mux0 mux0 mux0 mux0 mux0 mux0 mux0 mux0 mux0 mux1 mux1 mux1 mux1 mux1 mux1 mux1 mux1 mux1 muxt muxt muxt muxt muxt muxt muxt muxt muxt out0
All three versions of the CGP algorithm were capable of finding 100% successful solutions to the 4-bit multiplier problem within the generation limit. Fig. 4 is a box and whisker plot showing the time taken for all three algorithms to find a successful solution. From the figure, it can be seen that the sequential algorithm performs best on average and that as more MOVE operations are performed in parallel the performance of the algorithm degrades. This could be attributed to the fact that the search space increases in proportion to the number of parallel MOVE operations, as the function sets scale from 69 MOVE operations for the sequential algorithm to 190 and 638 MOVE operations for the parallel2 and parallel3 algorithms respectively. This highlights the trade-off between improving the run-time performance of the solution and increasing the solution complexity. Fig. 5 shows a comparison between the size of the solution when it was first discovered and the size of the efficient solution after optimisation. From Fig. 5(a), it can be seen that on average, allowing parallel MOVE operations decreases the
Automatic Code Generation on a MOVE Processor
245
Fig. 4. The number of generations required to find a solution for the sequential, parallel2 and parallel3 algorithms
(a) Before optimisation
(b) After optimisation
Fig. 5. The number of instructions per solution before (a) and after (b) optimisation for sequential, parallel2 and parallel3
size of the solution found. From Fig. 5(b), it can be seen that all three algorithms reduce the size of the solutions found in Fig. 5(a) between 13.7 and 16.6 times on average, whilst still maintaining the trend that parallelism decreases the size of the solution. This allows for the evolution of efficient and feasible designs for the 4-bit multiplier on the MOVE architecture using both sequential and parallel MOVE operations. The most efficient solutions found for the sequential, parallel2 and parallel3 versions of CGP are shown in Table 4.
246
J.A. Walker et al.
Table 4. The best efficient solutions evolved by the sequential, parallel2 and parallel3 versions of CGP Clock
Sequential
Cycle
Slot 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
in0 in2 in1 shrr muxr shrr shrr muxr shrr shlr addr shrr muxr shlr shlr in1 addr muxr shlr addr
→ → → → → → → → → → → → → → → → → → → →
mux1 mux0 shrt muxt add0 shrt muxt shlt shrt addt add0 muxt shlt shlt addt muxt shlt add0 addt out0
Clock Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14
4.3
Parallel2 Slot 1 in1 in0 shrr shrc addr shrc shlr shrr muxr shlr shrc shlr muxr shrr in0 muxr in1 muxr
→ → → → → → → → → → → → → → → → → →
shrt add0 shrt mux0 shlt muxt add0 shrt mux0 shlt muxt add0 mux0 muxt add0 mux0 muxt out0
Slot 2 |
in0 → addt
| addr → mux1 | muxr → addt | addr → mux1 | muxr → addt | addr → mux1 | muxr → addt | addr → mux1
Parallel3 Slot 1 in0 in1 shlc shlr muxr shrr shlr shlr muxr shrr shlr addr muxr muxr
→ → → → → → → → → → → → → →
shrt shlt mux0 add0 mux0 shrt shlt add0 mux0 shrt add0 addt mux0 out0
Slot 2
Slot 3
| in1 → mux1 | shrc → muxt | muxr → addt | addr → mux1 | shrr → muxt | muxr → addt | addr → mux1 | shrr → muxt | muxr → addt | addr → mux1 | shrr → muxt
Further Analysis of Solutions
In order to determine whether it is possible to extract any general design methodology from the evolved solutions, two results of the sequential algorithm are
Automatic Code Generation on a MOVE Processor
247
Table 5. A Comparison between the structure of an efficient and a left-shift solution discovered by sequential CGP Clock Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Efficient Solution in0 in2 in1 shrr muxr shrr shrr muxr shrr shlr addr shrr muxr shlr shlr in1 addr muxr shlr addr
→ → → → → → → → → → → → → → → → → → → →
mux1 mux0 shrt muxt add0 shrt muxt shlt shrt addt add0 muxt shlt shlt addt muxt shlt add0 addt out0
Left-shift Solution in0 shlr shlr shlr shlr in1 shlr addr shlr shlc muxr shlr addr shlr shlc muxr shlr addr shlr shlc muxr shlr addr shlr shlc muxr
→ → → → → → → → → → → → → → → → → → → → → → → → → →
shlt shlt shlt shlt shlt add0 addt mux1 mux0 muxt shlt addt mux1 mux0 muxt shlt addt mux1 mux0 muxt shlt addt mux1 mux0 muxt out0
compared in Table 5. The left column shows the most efficient solution discovered by the sequential algorithm. However, it is hard to determine whether it follows a general design methodology due to its ad hoc structure. The right column of Table 5, shows another solution discovered by the sequential algorithm. Although the number of instructions is larger than the efficient solution, repetitive patterns (also referred to as building blocks) can be observed throughout the solution. On examining the structure of the solution, it was found to follow a general design methodology known as the left-shift algorithm [8], which is one form of the shift-and-add multiplication algorithms. This general design methodology can be used to generate larger multipliers (i.e. 32-bit) for a MOVE processor. Alternatively, it may be possible to implement some of the discovered building blocks in the CGP function set in order to improve the scalability of the algorithm on larger multipliers. This will be investigated in future work. Generally, a repetitive pattern in a CGP design implies CGP node re-use, which usually facilitates the evolution speed. As previously mentioned, the CGP algorithm did not involve any loop branches (creating a cyclic graph) due to
248
J.A. Walker et al.
the nature of representation. However, node re-use is effectively equivalent to a “for-loop” structure in higher level programming languages because at runtime a for-loop is also executed sequentially. The practical difference between node re-use and a for-loop only resides in their respective static forms, namely the size of the code. The size of the code is also affected by the number of slots available in the MOVE processor. This is very similar to the VLIW architecture. The total program storage is calculated by multiplying the total number of slots by the number of long word instructions. For instance, in Table 4, slots 2 and 3 in Parallel3 are free at the first clock cycle. However, additional “NOP” instructions have to be inserted into the free slots. Therefore, the actual size of the code for the three algorithms in Table 4 is 20, 36 and 42. Although Parallel3 runs faster than the others, it occupies a larger memory space. Speed and memory space are two significant criteria on which to evaluate a piece of machine code.
5
Conclusions and Future Work
This paper has presented for the first time the application of evolving machine code on a MOVE architecture using CGP. The results show that CGP is capable of evolving machine code that consists of sequential and parallel operations for the 4-bit multiplier. It has also been shown that by modifying the fitness function once a solution is found, it was also possible to discover efficient solutions that could potentially be classed as “human competitive”. In order to further our exploration in generating more effective code, there are a number of directions for future work. Firstly, a multi-objective optimisation technique will be introduced to comprehensively evaluate the result, as this would allow us to optimise the solutions for both performance and memory footprint. Secondly, the scalability of the CGP approach will be investigated on larger multipliers (e.g. 8-bit, 16-bit, 32-bit) and other complicated problems, in order to assess the computational feasibility of the approach for “real world” problems. Finally, the implementation of conditional loops should be investigated, as it can drastically affect the size of the code, especially when the number of loops is very large. Also, as MOVE is a highly customised processor, some performance-critical code (for example, a repetitive pattern of instructions) may also be transformed to a hardware functional unit to speed up the execution time.
References 1. Corporaal, H.: Microprocessor Architectures: From VLIW to TTA. John Wiley & Sons, Inc., New York (1998) 2. Hilder, J., Walker, J., Tyrrell, A.: Optimising variability tolerant standard cell libraries. In: IEEE Congress on Evolutionary Computation, CEC (2009) 3. Liu, Y., Timmis, J., Qadir, O., Tempesti, G., Tyrrell, A.: A developmental and immune-inspired dynamic task allocation algorithm for microprocessor array systems. In: Hart, E. (ed.) ICARIS 2010. LNCS, vol. 6209, pp. 199–212. Springer, Heidelberg (2010)
Automatic Code Generation on a MOVE Processor
249
4. Miller, J.F., Job, D., Vassilev, V.K.: Principles in the evolutionary design of digital circuits - part I. Genetic Programming and Evolvable Machines 1(1), 8–35 (2000) 5. Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 6. Mudry, P.A.: A hardware-software codesign framework for cellular computing. Ph.D. thesis, EPFL (2009) 7. Nordin, P.: Evolutionary Program Induction of Binary Machine Code and its Applications. Ph.D. thesis, Universitat Dortmund am Fachereich Informatik (1997) 8. Parhami, B.: Computer Arithmetic: Algorithms and Hardware Designs. Oxford University Press, New York (2000) 9. Rossier, J., Thoma, Y., Mudry, P.A., Tempesti, G.: MOVE processors that selfreplicate and differentiate. In: Ijspeert, A.J., Masuzawa, T., Kusumoto, S. (eds.) BioADIT 2006. LNCS, vol. 3853, pp. 160–175. Springer, Heidelberg (2006) 10. Walker, J.A., Miller, J.F.: Embedded cartesian genetic programming and the lawnmower and hierarchical-if-and-only-if problems. In: Proceedings of the 2006 Genetic and Evolutionary Computation Conference (GECCO). ACM, New York (2006) 11. Walker, J.A., Miller, J.F.: The automatic acquisition, evolution and reuse of modules in cartesian genetic programming. IEEE Transactions on Evolutionary Computation 12, 397–417 (2008) 12. Yu, T., Miller, J.F.: Neutrality and the evolvability of boolean function landscape. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 204–217. Springer, Heidelberg (2001)
Coping with Resource Fluctuations: The Run-time Reconfigurable Functional Unit Row Classifier Architecture Tobias Knieper1 , Paul Kaufmann1 , Kyrre Glette2 , Marco Platzner1 , and Jim Torresen2 1
University of Paderborn, Department of Computer Science, Warburger Str. 100, 33098 Paderborn, Germany {tknieper,paul.kaufmann,platzner}@upb.de 2 University of Oslo, Department of Informatics, P.O. Box 1080 Blindern, 0316 Oslo, Norway {kyrrehg,jimtoer}@ifi.uio.no
Abstract. The evolvable hardware paradigm facilitates the construction of autonomous systems that can adapt to environmental changes and degrading effects in the computational resources. Extending these scenarios, we study the capability of evolvable hardware classifiers to adapt to intentional run-time fluctuations in the available resources, i.e., chip area, in this work. To that end, we leverage the Functional Unit Row (FUR) architecture, a coarse-grained reconfigurable classifier, and apply it to two medical benchmarks, the Pima and Thyroid data sets from the UCI Machine Learning Repository. We show that FUR’s classification performance remains high during changes of the utilized chip area and that performance drops are quickly compensated for. Additionally, we demonstrate that FUR’s recovery capability benefits from extra resources.
1
Introduction
Evolvable hardware (EHW) denotes the combination of evolutionary algorithms with reconfigurable hardware technology to construct self-adaptive and selfoptimizing hardware systems. The term evolvable hardware was coined by de Garis [1] and Higuchi [2] in 1993. While the majority of EHW related work focus on the evolution of functional correct circuits or circuits with a high functional quality, some authors investigates the robustness of EHW. The related literature spans this area from offline evolution of fault-tolerant circuits able to withstand defects in silicon [3] without increasing circuit’s size significantly [4] or compensating supply voltage drifts [5] by recurrent re-evolution after a series of deteriorating events as the wide-band temperature changes or radiation beams treatments [6,7]. Evolvable hardware has a variety of applications, one of which are classifier systems. A number of studies report on the use of EHW for classification applications such as character recognition [8], prosthetic hand control [9], sonar G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 250–261, 2010. c Springer-Verlag Berlin Heidelberg 2010
Coping with Resource Fluctuations: The Run-time Reconfigurable FUR
251
return classification [10,11], and face image recognition [10]. These studies have demonstrated that EHW classifiers can outperform traditional classifiers such as artificial neural networks (ANNs) in terms of classification accuracy. For the electromyographic (EMG) signal classification, it has been showed that EHW approaches can perform close to the modern state-of-the-art classification methods such as support vector machines (SVMs) [9]. In this work we focus on robust EHW-based classifiers. The novelty is that we investigate classifier systems able to cope with changing resources at run-time and evaluate their classification performance while changing the size of the utilized chip area. To this end, we leverage the Functional Unit Row (FUR) architecture, a scalable and run-time reconfigurable classifier architecture introduced by Glette et al. [12]. During optimization, we increase and decrease the number of pattern matching elements included in FUR and study the development of the resulting classification accuracy and, specifically, the recovery capability of FUR. In contrast to most previous work that studies self-adaptation in response to stimuli from outside the system, we explicitly build our analysis on the assumption of resource competition between different tasks run inside an adaptable system. The paper is structured as follows: Section 2 presents the FUR architecture for classification tasks, its reconfigurable variant and the applied evolutionary optimization method. Benchmarks together with an overfitting analysis as well as the experiments with the reconfigurable FUR architecture are shown in Section 3. Section 4 concludes the paper and gives an outlook on future work.
2
The Reconfigurable Functional Unit Row Architecture
The Functional Unit Row (FUR) architecture for classification tasks was first presented by Glette in [12]. It is an architecture tailored to online evolution combined with fast reconfiguration. To facilitate online evolution, the classifier architecture is implemented as a circuit whose behavior and connections can be controlled through configuration registers, similar to the approach of Sekanina [7]. By writing the genome bitstream produced by a GA to these registers, one obtains the phenotype circuit which can then be evaluated. In [13], it was shown that the partial reconfiguration capabilities of FPGAs can be used to change the architecture’s footprint. The amenability of FUR to partial reconfiguration is an important precondition for our work. In the following, we present the organization of the FUR architecture, the principle of the reconfigurable FUR architecture, and the applied evolutionary technique. For details about the implementation of FUR we refer to [12]. 2.1
Organization of the FUR Architecture
Fig. 1 shows the overall organization of the FUR architecture. The FUR architecture is rather generic and can be used together with different basic pattern matching primitives [9,10]. It combines multiple pattern matching elements into
T. Knieper et al.
input
252
CCs
decision
max
...
CDM
Fig. 1. The Functional Unit Row (FUR) Architecture is hierarchically partitioned for every category into Category Detection Modules (CDMs). For an input vector, a CDM calculates the likeliness for a previously trained category by summing up positive answers from basic pattern matching elements: the Category Classifiers (CCs). The CDM with most activated CCs defines the FUR’s decision.
a single module with graded output detecting one specific category. A majority voter decides for a specific category by identifying the module with the highest number of activated pattern matching elements. More specifically, for C categories the FUR architecture consists of C Category Detection Modules (CDMs). A majority vote on the outputs of the CDMs defines the FUR architecture decision. In case of a tie, the CDM with the lower index wins. Each CDM contains M Category Classifiers (CCs), basic pattern matching elements evolved from different randomly initialized configurations and trained to detect CDM’s category. A CDM counts the number of activated CCs for a given input vector, thus the CDM output varies between 0 and M . The architecture becomes specific with the implementation of the CCs. In our case we define a single CC as a row of Functional Units (FUs), shown in Fig. 2. The FU outputs are connected to an AND gate such that in order for a CC to be activated all FU outputs have to be 1. Each FU row is evolved from an initial random bitstream, which ensures a variation in the evolved CCs. The number of FU rows defines the resolution of the corresponding CDM. input pattern FU2
...
FUn AND
...
FU1
Fig. 2. Category Classifier (CC): n Functional Units (FUs) are connected to an n-input AND gate. Multiple CCs with a subsequent counter for activated CCs define a CDM.
a c
a>c
constant configuration input selection
253
MUX
input pattern
MUX
Coping with Resource Fluctuations: The Run-time Reconfigurable FUR
function selection
Fig. 3. Functional Unit (FU): The data MUX selects which of the input data to feed to the functions “>” and “≤”. The constant c is given by the configuration lines. Finally, a result MUX selects which of the function results to output.
The FUs are reconfigurable by writing the architecture’s register elements. As depicted in Fig. 3, each FU behavior is controlled by configuration lines connected to the configuration registers. Each FU has all input bits to the system available at its inputs, but only one data element (e.g., one byte) is selected. This data is then fed to the available functions. While any number and type of functions could be imagined, Fig. 3 illustrates only two functions for clarity. In addition, the unit is configured with a constant value, c. This value and the input data element are used by the function to compute the output of the unit. Based on the data elements of the input, the functions available to the FU elements are greater than and less than or equal. Through experiments these functions have shown to work well, and intuitively this allows for discriminating signals by looking at the different amplitudes. 2.2
Reconfigurable FUR Architecture
The notion of Evolvable Hardware bases on circuit optimization and reconfiguration. EHW-type adaptable systems improve their behavior in response to system internal and external stimuli, offering an alternative to classically engineered adaptable systems. While the adaptation to environmental changes represents the main research line within the EHW community, the ability to balance resources dynamically between multiple concurrent applications is still a rather unexplored topic. One the one hand, an EHW module might run as one out of several applications sharing a system’s restricted reconfigurable resources. Depending on the current requirements, the system might decide to switch between multiple applications or run them concurrently, albeit with reduced logic footprints and reduced performance. We are interested in scalable EHW modules and architectures that can cope with such changing resource profiles. On the other hand, the ability to deal with fluctuating resources can be used to support the optimization process, for example by assigning more resources when the speed of adaptation is crucial. The FUR architecture precisely fits this requirement as its structure can be changed (disregarding the register-reconfigurable FUs) along three dimensions, namely the number of
254
T. Knieper et al.
– categories, – FU rows in a category, and – FUs in a FU row. In this work we assume the numbers of categories and FUs in a FU row as constants reconfiguring the number of FU rows in a CDM. This is illustrated in Fig. 4. For a sequence I = {i1 , i2 , . . . , ik } we evolve a FUR architecture having i1 FUs per CDM, then switching to i2 FUs per CDM and re-evolving the architecture without flushing the configuration evolved so far. The key insights we want to gain by this investigation are the sensitivity of the FUR architecture measured in the classification accuracy to changes in the resources and the time for re-establishing near asymptotic accuracy quality. CDM 1
CC 1
CC 2
CC 2
CDM C ...1 CC CC 2
...
...
CC M
CC M
CC M
CC M+1
CC M+1
CC M+1
...
...
...
add CCs
CC 1
...
...
remove CCs
CDM 2
Fig. 4. Reconfigurable Functional Unit Row Architecture: The FUR architecture is configured by the number of categories, FU rows and FUs per FU row. In our work we fix the number of categories and FUs per FU rows while changing the number of FU rows per CDM.
2.3
Evolution of FUR Architecture
To evolve a FUR classifier we employ a 1 + 4 ES scheme. In contrast to previous work [12], we do not use incremental evolution evolving CDMs and FU rows separately but evolve the complete FUR architecture in a single ES run. The mutation operator is configured to mutate three genes in every FU row. In preparation to the experiments on the reconfigurable FUR architecture we investigate FUR’s general performance by evaluating it on a set of useful FU rows per CDM and FUs per FU row configurations. The performance is calculated by a 12-fold Cross Validation (CV) scheme.
3
Experiments and Results
In this section we present two kinds of results. Initially, we analyze FURs behavior by successively testing a range of parameter combinations. Combined with an overfitting analysis we are then able to picture FUR’s complete behavior for
Coping with Resource Fluctuations: The Run-time Reconfigurable FUR
255
a given benchmark. Afterwards, we select a good-performing configuration to investigate FUR’s performance, when being reconfigured during run-time. For this experiment we define multiple FUR architecture configurations with varying number of FU rows and plot the accuracy development, when switching between the configurations. 3.1
Benchmarks
For our investigations we rely on the UCI machine learning repository [14] and specifically, on the Pima and the Thyroid benchmarks. Pima, or the Pima Indians Diabetes data set is collected by the John Hopkins University in Baltimore, MD, USA and consists of 768 samples with eight feature values each, divided into a class of 500 samples representing negative tested individuals and a class of 268 samples representing positive tested individuals. The data of the Thyroid benchmark represents samples of regular individuals and individuals suffering hypo- and hyperthyroidism. Thus, the samples are divided into 6.666, 166 and 368 samples representing regular, subnormal and hyper-function individuals. A sample consists of 22 feature values. The Pima and the Thyroid benchmarks don’t rely on high classification speeds of EHW hardware classifiers, however, these benchmarks have been selected because of their pronounced effects in the run-time reconfiguration experiment revealing FUR’s characteristics. 3.2
Accuracy and Overfitting Analyses
We implement FUR’s parameter analysis by a grid search over the number of FU rows and number of FUs. For a single (i, j)-tuple, where i denotes the number
0.78
Pima (30,8): training vs. test accuracy
0.76
test accuracy
0.74 0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.6
0.65
0.7
0.75 0.8 training accuracy
0.85
0.9
0.95
Fig. 5. Overfitting analysis: In this example the test and training accuracies would be roughly 0.76 and 0.76, respectively.
T. Knieper et al.
0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74
0.95 0.9 accuracy
accuracy
256
0.85 0.8 0.75 0.7 0.65 0.6
0
10
20
30
40
50
FURows
60
70
80 2
4
6
8
10
12
14
16
18
20
0 FUs
10
20
30
40
50
FURows
70
80 2
4
6
8
10
14
16
18
20
FUs
(b)
1
1
0.99
0.99
accuracy
accuracy
(a)
60
12
0.98 0.97
0.98 0.97
0.96
0.96
0.95
0.95
0.94
0.94
0
10
20
30 FURows
40
50
60
(c)
70
80 2
4
6
8
10
12
14
16
18
20
0 FUs
10
20
30 FURows
40
50
60
70
80 2
4
6
8
10
12
14
16
18
20
FUs
(d)
Fig. 6. Pima and Thyroid overfitting analysis: Best generalization and the according termination training accuracies for the Pima (a) (b) and the Thyroid (c) (d) benchmarks, respectively.
of FU rows and j the number of FUs, we evolve a FUR classifier by running the evolutionary algorithm for 100.000 generations. As we employ a 12-fold cross validation scheme, the evolution is repeated 12 times while alternating the training and test data sets. During the evolution we log for every increase in the training accuracy FUR’s performance on the test data set. The test accuracies are not used while the evolution runs. To detect the test accuracy where the FUR architecture starts to approximate the training set tightly and to contemporary lose its ability to generalize, we average the test accuracies logged during the evolutionary runs and select the termination training accuracy according to the highest average test accuracy. This is shown in Fig. 5 for the Pima benchmark and the (30, 8) configuration. The test accuracy, drawn along the y-axis, rises in relation to the training accuracy, drawn along the x-axis, until the training accuracy reaches 0.76. After this point the test accuracy degrades gradually. Consequently, we note 0.76 and 0.76 as the best combination of test and termination training accuracies. To cover the interesting parameter areas and keep the computational effort low we evaluate the Pima and Thyroid benchmarks for 2, 4, 6, . . . , 20 FUs per FU row and for 2, 4, 6, 8, 10, 14, 16, 20, 25, 30, 35, 40, 50, 60, 70, 80 FU rows. Fig. 6 shows the results for both benchmarks. In the horizontal level the diagrams span the parameter area of FU rows and FUs. The accuracy for each parameter tuple is drawn along the z-axis with a projection of equipotential accuracy lines on the horizontal level. While the test accuracies for the Pima benchmark, presented
Coping with Resource Fluctuations: The Run-time Reconfigurable FUR
257
in Fig. 6(a) are largely independent from the number of FUs and FU rows with small islands of improved behavior around the (8, 8 − 10) configurations, the Thyroid benchmark presented in Fig. 6(c) has an performance loss in regions with a large number of FUs and few FU rows. Tables 1 and 2 compare FUR’s results for the Pima and the Thyroid benchmarks to related work. Additionally, we use the data mining tool RapidMiner [15] to create numbers for standard and state-of-the-art algorithms and their modern implementations. To this, we evaluate in a 12-fold cross validation manner the algorithms: Decision Trees (DTs), k-th Nearest Neighbor (kNN), Multi-layer Perceptrons (MLPs), Linear Discriminant Analysis (LDA), Support Vector Machines (SVMs) and Classification and Regression Trees (CART). For the Pima benchmark our architecture outperforms any other method. It forms together with SVMs, LDA, Shared Kernel Models and kNNs a group of best performing algorithms within a 3% margin. The accuracy range of the Thyroid-benchmark is much smaller because of the irregular category data size proportions and a single dominant category amounting for 92.5% of the data. In this benchmark our architecture lies 0.66% behind the best algorithm.
Table 1. Pima benchmark: Error rates and standard deviation in %. We use the data mining toolbox RapidMiner [15] to evaluate the algorithms marked by “*”. Preliminary, we identify good performing algorithm parameters by a grid search. Remaining results are taken from [16]. Algorithm Error Rate FUR 21.35 SVM* 22.79 LDA* 23.18 Shared Kernel Models 23.27 kNN* 23.56 GP with OS, |pop|=1.000 24.47 CART* 25.00 DT* 25.13 GP with OS, |pop|=100 25.13 MLP* 25.26 Enhanced GP 25.80 – 24.20 Simple GP 26.30 ANN 26.41 – 22.59 EP / kNN 27.10 Enhanced GP (Eggermont et al.) 27.70 – 25.90 GP 27.85 – 23.09 GA / kNN 29.60 GP (de Falco et al.) 30.36 – 24.84 Bayes 33.40
± Standard Deviation 4.84 4.64 2.56 3.07 3.69 3.61 4.30 4.95 4.50
1.91 – 2.26
1.29 – 1.49 0.29 – 1.30
258
T. Knieper et al.
Table 2. Thyroid benchmark: Error rates and standard deviation in %. We use the data mining toolbox RapidMiner [15] to evaluate the algorithms marked by “*”. Preliminary, we identify good performing algorithm parameters by a grid search. Remaining results are taken from [16]. Algorithm DT* CART* CART PVM Logical Rules FUR GP with OS GP BP + local adapt. rates ANN BP + genetic opt. GP Quickprop RPROP GP (Gathercole et al.) SVM* MLP* ANN PGPC GP (Brameier et al.) kNN*
3.3
Error Rate 0.29 0.42 0.64 0.67 0.70 1.03 1.24 1.44 – 0.89 1.50 1.52 1.60 1.60 – 0.73 1.70 2.00 2.29 – 1.36 2.35 2.38 2.38 – 1.81 2.74 5.10 – 1.80 5.96
± Standard Deviation 0.18 0.27
0.51 0.62
0.44
Reconfigurable FUR Architecture Results
In our second experiment we investigate the question of FUR classification behavior under changes in the available resources while being under optimization. We execute for both benchmarks a single experiment where we configure a FUR architecture with 4 FUs per FU row and change the number of FUs every 40.000 generations. We split the data set into disjoint training and test sets analog to the previously used 12-fold cross validation scheme and start the training of the FUR classifier with 40 FU rows. Then, we gradually change the number of employed FU rows to 38, 20, 4, 3, 2, 1, 20, 30, 40 executing altogether 400.000 generations. Fig. 7 shows the results for the Pima benchmark. We observe the following: – The training accuracy drops significantly for almost any positive and negative change in the number of FU rows and recovers subsequently. – While the asymptotic training accuracy is lower when using only few FU rows, the test accuracy tends to reach for any FU row configuration the usual accuracy rate. This behavior is visible from generation 120.000 to 280.000 in Fig. 7 and is confirmed by previous results showed in Fig. 6 (a).
Coping with Resource Fluctuations: The Run-time Reconfigurable FUR
259
– The recovery rate of the test accuracy depends on the amount of FU rows. While for periods with few FU rows the recovery rate is slow, for periods with 20 and more FU rows the evolutionary process manages to recover the test accuracy much faster. Interestingly, the rise of the training accuracy for generations 280.000 to 320.000 results in a falling test accuracy. This could be a statistical effect, where the test accuracy varies in some interval as the classifier is evolved from a random initialized configuration. – The test accuracy is mostly located between 0.6 and 0.7, independent of the changes in the number of FU rows. Thus, and this is the main observation, the FUR architecture shows to a large extent a robust test accuracy behavior under reconfiguration for the Pima benchmark. 0.9 40 38
0.7
20
# FU rows
accuracy
0.8
0.6
4 3 2 1
# FU rows train test 0.5 0
50000
100000
150000
200000 250000 Generations
300000
350000
400000
Fig. 7. The Reconfigurable Pima benchmark: Changing classifier’s resources (number of FU rows) during the optimization run.
Figure 8 presents the results for the Thyroid benchmark. We observe the following: – The training accuracy, similar to the Pima results, drops significantly when changing the number of FU rows. – As anticipated by previous results showed in Fig. 6 (c), the test accuracy drops for FUR architecture configurations with very few FU rows. This can be observed in Fig. 8 at generations 120.000 to 280.000. – Because of the uneven distribution of category data sizes the test accuracy deviation is smaller and follows more tightly the development of the training accuracy.
260
T. Knieper et al. 1 0.99
40 38
0.98 0.97
0.95 20
# FU rows
accuracy
0.96
0.94 0.93 0.92 0.91
# FU rows train test
4 3 2 1
0.9 0
50000
100000
150000
200000 250000 Generations
300000
350000
400000
Fig. 8. Reconfigurable Thyroid benchmark: Changing classifier’s resources (number of FU rows) during the optimization run.
– Analog to the observations made by the Pima benchmark, more FU rows increase the test accuracy recovery rate. – The main result is that reconfigurations of the FUR architecture are quickly compensated in the test accuracy. The limitation in the case of the Thyroid benchmark is a minimum amount of FU rows to leverage robust behavior. In summary, as long as the FUR configuration contains enough FU rows, FUR’s test accuracy behavior is stable during reconfigurations. Additionally, more FU rows leverage faster convergence.
4
Conclusion
In this work we propose to leverage the FUR classifier architecture for creating evolvable hardware systems that can cope with fluctuating resources. We describe this reconfigurable FUR architecture and experimentally evaluate it on two medical benchmarks. First, we analyze the overfitting behavior and show that the FUR architecture performs similar or better than state-of-the-art classification algorithms. Then we demonstrate that FUR’s generalization performance is robust to changes in the available resources as long as a certain amount of FU rows is present in the system. Furthermore, FUR’s capability to recover from a change in the available resources benefits from additional FU rows.
Coping with Resource Fluctuations: The Run-time Reconfigurable FUR
261
References 1. de Garis, H.: Evolvable Hardware: Genetic Programming of a Darwin Machine. In: Intl. Conf. of Artificial Neural Nets and Genetic Algorithms, pp. 441–449. Springer, Heidelberg (1993) 2. Higuchi, T., Niwa, T., Tanaka, T., Iba, H., de Garis, H., Furuya, T.: Evolving Hardware with Genetic Learning: a First Step Towards Building a Darwin Machine. In: From Animals to Animats, pp. 417–424. MIT Press, Cambridge (1993) 3. Miller, J., Hartmann, M.: Untidy Evolution: Evolving Messy Gates for Fault Tolerance. In: Liu, Y., Tanaka, K., Iwata, M., Higuchi, T., Yasunaga, M. (eds.) ICES 2001. LNCS, vol. 2210, pp. 14–25. Springer, Heidelberg (2001) 4. Haddow, P.C., Hartmann, M., Djupdal, A.: Addressing the Metric Challenge: Evolved versus Traditional Fault Tolerant Circuits. In: Adaptive Hardware and Systems (AHS), pp. 431–438. IEEE, Los Alamitos (2007) 5. Sekanina, L.: Evolutionary Design of Gate-Level Polymorphic Digital Circuits. In: Rothlauf, F., Branke, J., Cagnoni, S., Corne, D.W., Drechsler, R., Jin, Y., Machado, P., Marchiori, E., Romero, J., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2005. LNCS, vol. 3449, pp. 185–194. Springer, Heidelberg (2005) 6. Stoica, A., Zebulum, R.S., Keymeulen, D., Daud, T.: Transistor-Level Circuit Ex´ periments Using Evolvable Hardware. In: Mira, J., Alvarez, J.R. (eds.) IWINAC 2005. LNCS, vol. 3562, pp. 366–375. Springer, Heidelberg (2005) 7. Sekanina, L.: Evolutionary Functional Recovery in Virtual Reconfigurable Circuits. Journal of Emerging Technologies in Computing Systems 3(2) (2007) 8. Higuchi, T., Iwata, M., Kajitani, I., Iba, H., Hirao, Y., Manderick, B., Furuya, T.: Evolvable Hardware and its Applications to Pattern Recognition and FaultTolerant Systems. In: Sanchez, E., Tomassini, M. (eds.) Towards Evolvable Hardware 1995. LNCS, vol. 1062, pp. 118–135. Springer, Heidelberg (1996) 9. Glette, K., Gruber, T., Kaufmann, P., Torresen, J., Sick, B., Platzner, M.: Comparing Evolvable Hardware to Conventional Classifiers for Electromyographic Prosthetic Hand Control. In: Adaptive Hardware and Systems (AHS), pp. 32–39. IEEE, Los Alamitos (2008) 10. Yasunaga, M., Nakamura, T., Yoshihara, I.: Evolvable Sonar Spectrum Discrimination Chip Designed by Genetic Algorithm. In: Systems, Man and Cybernetics, vol. 5, pp. 585–590. IEEE, Los Alamitos (1999) 11. Glette, K., Torresen, J.: A Flexible On-Chip Evolution System Implemented on a Xilinx Virtex-II Pro Device. In: Moreno, J.M., Madrenas, J., Cosp, J. (eds.) ICES 2005. LNCS, vol. 3637, pp. 66–75. Springer, Heidelberg (2005) 12. Glette, K., Torresen, J., Yasunaga, M.: An Online EHW Pattern Recognition System Applied to Face Image Recognition. In: Giacobini, M. (ed.) EvoWorkshops 2007. LNCS, vol. 4448, pp. 271–280. Springer, Heidelberg (2007) 13. Torresen, J., Senland, G., Glette, K.: Partial reconfiguration applied in an on-line evolvable pattern recognition system. In: NORCHIP 2008, pp. 61–64. IEEE, Los Alamitos (2008) 14. Asuncion, A., Newman, D.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences (2007) 15. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: YALE: Rapid Prototyping for Complex Data Mining Tasks. In: Intl. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 935–940 (2006) 16. Winkler, S.M., Affenzeller, M., Wagner, S.: Using Enhanced Genetic Programming Techniques for Evolving Classifiers in the Context of Medical Diagnosis. In: Genetic Programming and Evolvable Machines, vol. 10(2), pp. 111–140. Kluwer Academic Publishers, Dordrecht (2009)
A Self-reconfigurable FPGA-Based Platform for Prototyping Future Pervasive Systems Jean-Marc Philippe, Benoˆıt Tain, and Christian Gamrat CEA, LIST, Embedded Computing Laboratory, Point Courrier 94, Gif-sur-Yvette, F-91191 France [email protected]
Abstract. The progress in hardware technologies lead to the possibility to embed more and more computing power in portable, low-power and low-cost electronic systems. Currently almost any everyday device such as cell phones, cars or PDAs uses at least one programmable processing element. It is forecasted that these devices will be more and more interconnected in order to form pervasive systems, enabling the users to compute everywhere at every time. This paper presents a FPGA-based self-reconfigurable platform for prototyping such future pervasive systems. The goal of this platform is to provide a generic template enabling the exploration of self-adaptation features at all levels of the computing framework (i.e. application, software, runtime architecture and hardware points of view) using a real implementation. Self-adaptation is provided to the platform by a set of closed loops comprising observation, control and actuators. Based on these loops (providing the platform with introspection), the platform can manage multiple applications (that may use parallelism) together with multiple areas able to be loaded on-demand with hardware accelerators during runtime. It can also be provided with self-healing using a model of itself. Finally, the accelerators implemented in hardware can learn how to perform their computation from a software golden model. Focusing on the low-level part of the computing framework, the paper aims at demonstrating the interest of self-adaptation combined with collaboration between hardware and software to cope with the constraints raised by future applications and systems.
1
Introduction
Thanks to the continuous technology shrink, computer designers are able to embed more and more computing power in almost every object of everyday life. Additionally, these objects are meant to be more and more interconnected, letting people entering the ubiquitous or pervasive computing era (many computers per
This work was supported and funded by the European Commission under Project ÆTHER No. FP6-2004-IST-4-027611.
G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 262–273, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Self-reconfigurable FPGA-Based Platform for Prototyping
263
person) after the mainframe era (many people, one computer) and the personal computer era (one computer per person) [1]. For example, these communicating resources can be found in modern cars (which can use more than 60 connected embedded CPUs) as well as in cell phones or laptops and even in clothes (e.g. wearable computing). The pervasive systems formed by these networked devices provide the users with invisible services that enable them to compute everywhere and at every time. Based on that, one can observe the evolution of already existing applications and the emergence of new applications requiring more and more portable computing power (such as mobile television). These applications also put a lot of constraints on the underlying computing resources: besides the necessary high computing power, low-power as well as fault-tolerance and the ability to compute highly heterogeneous data flows are seen as important for future computing devices. One solution to face these different constraints is to take advantage on the dynamic adaptability of modern reconfigurable architectures. Being able to change on the fly the working parameters and structure of a computing device enables it to be adapted to a lot of application domains as well as to the different needs of the surrounding users [2]. Unfortunately, the high numbers of both constraints and possible states of the computing system make it difficult, and even impossible to manage with traditional control mechanisms. A possible solution to this problem is to embed into the system the necessary abilities and knowledge to enable it to manage itself: a way to observe its state and its environment, a way to take decisions and a way to apply these decisions in order to change its state to better fit the environment. This basic behavior is know as self-adaptation. Based on the fact that the pervasive systems are based on a very high number of computing resources, it is obvious to also provide these resources with the ability to share information or even tasks. This collaborative behavior is also seen as very important [3]. The exploration of the different techniques enabling both embedded self-adaptation and collaboration is a very complex research subject. This paper presents a self-reconfigurable FPGA-based platform used to prototype solutions based on closed loops for making hardware and software collaborate seamlessly so as to ease the work of pervasive system designers. The rest of this paper is organized as follows. Section 2 presents different works based on self-adaptation as well as the concept behind the paper. The proposed self-reconfigurable platform is introduced in Section 3 from the hardware and software points of view. Section 4 deals with the chosen test applications for evaluating the platform. Before concluding we present the experimental results in Section 5.
2
Related Works on Self-adaptation and Context
Self-adaptation can be defined as the ability of a system to adapt to its environment by allowing its components to monitor this environment and change their behavior in order to preserve or improve the operation of the system according
264
J.-M. Philippe, B. Tain, and C. Gamrat
to some defined criteria. This definition is related with either the modification of some parameters that define the working point of the device (e.g. the power supply voltage and the clock frequency) or the modification of the structure of the architecture (both at software and hardware levels). For example, self-adaptation can be implemented in order to provide the architecture with fault-tolerance by enabling the monitoring of temperature for the detection of transient hot spots that may damage some parts of a chip [4]. Preventing chip damages as well as self-repairing some runtime defects (e.g. caused by electromigration) is a promising idea for future computing architectures that can monitor some of the variables that characterize their state [5]. At a higher level, a controller can observe the task the architecture has to do so as to select and download the more efficient partial bitstream to implement the requested computation [6]. More advanced self-X features can also be implemented by introducing self-placement and self-routing properties to an architecture which can autonomously modify its structure in order to achieve fault detection and fault recovery [7]. In the ÆTHER project, a general model of a basic computing entity that aims to be networked with other entities of the same type to form complete systems was introduced [3]. Each of these entities is meant to be self-adaptive, which implies that they can change their own behavior to react to changes in their environment or to respect some given constraints. As shown in Fig. 1, the Self-Adaptive Networked Entity, or SANE in short, is a self-contained component composed of mainly four parts. The first one is the computing engine, dedicated to data processing. It can be adapted to the wide range of algorithms that the SANE system is able to compute. The second part is the observer which is responsible for monitoring the computing process and some runtime parameters related to the environment as well as the chip. This observation process enables the SANE to be aware of itself, of its environment, and of its computing performance related to the loaded task. The role of the controller part is to take all the decisions related to the ongoing computation task. The closed loop composed of the monitoring process associated with an adaptation controller provides the SANE with the self-adaptation ability. The last part of the SANE is the communication interface, dedicated to
Processed data
Data
Computing Engine
Observer
Controller
Communication Interface
Goals, constraints Implementations of tasks Collaboration
Fig. 1. Functional view of the SANE
A Self-reconfigurable FPGA-Based Platform for Prototyping
265
collaboration between the SANEs. The collaboration process is done through a publish/discover mechanism that allows a SANE to publish its abilities and to discover the computing environment formed by the other SANE in its neighborhood. This mechanism enables the SANEs to exchange their tasks or just to clone their states to other SANEs [8].
3
Description of the Prototyping FPGA-Based Platform
In order to study the properties of the above mentioned SANE model, a generic physical prototype was implemented. This section describes both the hardware and software sides of the platform prototype. It also gives an overview on the chosen task allocation mechanism (one possible service of the adaptation controller of the SANE) for hiding hardware complexity from the application point of view. 3.1
Hardware Part of the Prototype
The platform is based on a Virtex-4 FPGA (Xilinx ML402 board) which has self-reconfiguration abilities thanks to the Internal Configuation Access Port (ICAP) of some Xilinx FPGAs. The platform is partitioned into one static area containing a Microblaze (32-bit RISC core) for controlling the platform and four dynamically and partially reconfigurable (DPR) areas as it is shown in Fig. 2. On the current implementation, each area is composed of 3192 LUT, 20 DSP and 20 RAMB16 blocks (maximum available resources for one operator).
Fig. 2. High-level view of the platform including the static area (Microblaze), four dynamic hardware areas and the floorplan of the platform on Xilinx PlanAhead
266
J.-M. Philippe, B. Tain, and C. Gamrat
Fig. 3. Standardized interface for all operators
The Xilinx Partial Reconfiguration Early Access tools were used with both EDK (Embedded Design Kit)and PlanAhead for generating the static and partial bitstreams. All the input and output ports of the different hardware accelerators cross the boundary between the static part and the dynamic part using bus macros which are used to lock the routing between dynamic and static regions. For reusability, they were encapsulated into an interface core (see Fig. 2) which allows the designer to create a new operator without caring about bus macros since they are already placed in this interface core. It also provides a standardized interface between the Microblaze and the operators since it is based on FSL (Fast Simplex Link). The interface also provides a direct connection of the operator to external devices of the board (such as a VGA camera in the histogram equalization application shown in section 4) as it is shown in Fig. 3. The data transmission protocol consists of a data valid signal for indicating that the data present on the link is ready to be read. Different SANE prototypes can be linked using the Ethernet connection available on the board. 3.2
Software Part of the Prototype
The software part of the platform is managed by the Petalinux distribution of µCLinux [9]. µCLinux is a port of the Linux kernel to support embedded processors without memory management unit (MMU) (Microblaze and Petalinux have the MMU support since respectively version 7 and 0.30 with 2.6 Linux kernel). From an application programming perspective, Petalinux offers an interface almost identical to standard Linux, including command shells, C library support and Unix system calls, C and C++ compilers as well as execution of software applications with POSIX threads, and the use of the µCLinux ICAP driver for dynamically setting the configuration of the four DPR areas from the software. 3.3
Management of the Platform: The Allocation Controller
The accesses to the ICAP (for internal self-reconfiguration) as well as the allocation of the DPR areas to the different applications are controlled an application called Allocation Controller (AC). It manages the configuration of the platform based on its exclusive access to the ICAP: the reconfigurations are done by the AC based on both requests from the different computing applications and the
A Self-reconfigurable FPGA-Based Platform for Prototyping
267
Fig. 4. High-level structure of the management of the platform. The AC exchanges commands with the applications and configures the different hardware areas.
availability and configuration of the different DPR areas. For this purpose, the AC has an internal model of the platform which provides it with self-awareness. This internal representation (which indicates in its first version if a DPR area is free and the identifier of the loaded partial bitstream) is updated when new accelerators are requested. The AC sends the identifiers of the allocated DPR areas to the requesting applications, enabling their computing threads to seamlessly use dedicated accelerators loaded on demand by the AC. For this purpose, the AC has both an access to the ICAP and to a local bitstream repository which is located in the DDR memory of the board (see Fig. 4). The communication between applications and the AC use semaphores (to lock the computing resources) and a shared memory. A semaphore on the AC allows only one communication with the AC at a time. In case two threads (possibly from different applications) request a resource to the AC, the second thread is suspended until the AC has finished the first allocation and released the semaphore (the current implementation is based on a first come first served algorithm). There are also semaphores for the management of the shared memory, and for the hardware reconfigurable slots (e.g. when all DPR areas are in use, requesting computing threads are suspended). The shared memory is used by the computing applications and the AC to exchange request and allocation commands. Before creating a computing thread, the application sends a request command composed of the identifier of the requested hardware accelerator to be loaded by writing to the shared memory. The allocation controller compares the identifier of the requested accelerator and the identifier of the accelerators implemented on the FPGA. If the identifiers are different the AC reconfigures one area with the appropriate bitstream. Then the AC answers the requesting thread by sending the identifier of the assigned hardware area. When the thread has finished, the requesting application releases the DPR area by sending a command to the AC. The commands are composed of 32-bit words (4 octets) comprising the command identifier that is sent by the applications to the AC (request an area to be loaded by a given operator or release the area when the corresponding computing thread has finished). The second octet stores the identifier of an hardware area. This octet is written by the AC to send the identifier of the assigned DPR
268
J.-M. Philippe, B. Tain, and C. Gamrat
area to the applications and by the application to release a hardware area when the computing thread has finished. The third octet is used to store the requested operator identifier. It is written by the application and read by the AC. Finally, the fourth octet is used to store the pid of the requesting application (not used at this time). 3.4
Access from Software Threads to the Accelerators
Once the allocation is performed, the communication between the computing threads and the DPR areas is done directly through the FSL with nputfsl and ngetfsl macros. The µCLinux FSL drivers are not used since they take a lot of CPU cycles. Due to the static nature of the nputfsl and ngetfsl macros regarding the FSL identifier to be used, one C computing function per hardware area was needed for the applications. The right function is chosen depending on the hardware resource allocated by the AC.
4
Test Applications
Two applications were implemented to illustrate the possibilities of the platform. The first one is a simple image contrast processing function and the second one is an optical character recognition (OCR) application customized to use both software recognition and hardware self-adaptable accelerators. 4.1
Histogram Equalization
The first test application is an image enhancement application based on histogram equalization which is used to increase the global contrast of an image by allowing a better distribution of the pixel intensities on the histogram (see Fig. 5). The external data ports of the histogram equalization operator on the ML402 board are linked to a daughter board with a camera which are part of the Xilinx Video Starter Kit.
(a)
(b)
(c)
Fig. 5. Pictures showing the original and the badly contrasted pictures at the input of the operator and the enhanced picture at the ouput: (a) Original picture, (b) Badly contrasted picture, (c) Picture after histogram equalization)
A Self-reconfigurable FPGA-Based Platform for Prototyping
269
The histogram equalization needs one hardware area on the platform. When started, the application requests a DPR area to the AC and then performs proper initialization of the loaded operator through the related FSL. When the user stops the application, it sends a command to the AC to release the hardware resource. This application also features a closed loop based on the observation of the mean of the pixel intensities of the input picture. As shown in Fig. 5, when the mean of pixel intensities is sufficiently high, the histogram equalization is not used (picture a) and when this mean reaches a level under a user-defined threshold, histogram equalization is performed to enhance input pictures (pictures b and c). 4.2
Optical Character Recognition
Optical Character Recognition (OCR) is used in pervasive applications to provide the computing system with data to process in a user-friendly way (e.g. in future healthcare environments, for collecting important information on business cards, for online translation of foreign languages, etc.). OCR is popular since it enables a computing system to use the same information human beings process: printed letters and numbers (contrary to RFID tags and barcodes which are not human readable). Another advantage is that using printed information to identify properties of objects is very cheap and uses already available information. The GNU Ocrad OCR application (version 0.17) [10], which was used as a basis for the application, is composed of steps such as pre-processing for enhancing the input picture (binary threshold, picture transformations such as crop, scale, etc.) and analyzing the page layout to find columns, detecting text blocks and then for each text block, detecting text lines. For each text line, Ocrad detects the characters and finally recognize them. The possibility of using hardware accelerators was added to Ocrad for research purposes. The hardware accelerator is composed of an operator that computes the cross-correlation between an input character and a set of masks corresponding to the different letters. If the maximum cross-correlation is above a certain threshold, the letter is recognized. Another modification of the original Ocrad application was the parallelization at the character-level using POSIX threads (for each character in a line, the character recognition function is called by a pthread).
5
Results and Analysis of the Platform
This section presents the measurements that were realized using the abovepresented platform as well as an analysis on memory requirements. 5.1
Size of the Operators and Reconfiguration Time
Table 1 gives the physical resources required for the two hardware operators used in the test applications. The given percentages are given compared to the total amount of available resources in one of the DPR areas. Depending on the routing and the placement of the operator on each of the four DPR areas, the
270
J.-M. Philippe, B. Tain, and C. Gamrat Table 1. Hardware resources required for the two operators
LUT FF SLICE M/L DSP16 RAMB16
Histogram Equalization 1607 (50%) 1385 (43%) 491 (62%) 0 2 (10%)
OCR 876 (27%) 959 (30%) 293 (37%) 6 (30%) 2 (10%)
size of the partial bitstreams may slightly vary. However, for the different partial bitstreams generated by PlanAhead, the size is around 170 kB. The reconfiguration time is considered as the amount of time between the moment the AC receives a request command from one application through the shared memory and the moment it writes the identifier of the allocated area in the shared memory. This time also includes the retrieving of the requested partial bitstream from the on-board memory and its writing to the allocated area through the ICAP. The mean reconfiguration time was measured to be 13 ms for a 170kB partial bitstream, which means that the effective reconfiguration bandwidth is around 104Mbits/s (transfer of the bitstream from the DDR memory to the configuration memory of the chip). 5.2
Speed-Up of OCR Thanks to Hardware Support
In this custom version of OCR, the set of masks is built during runtime : the hardware learns from the software how to recognize letters. The algorithm is the following: each computing thread always calls an hardware accelerator first and then the original software version if no letter is recognized by the hardware. At the beginning of the execution, the set of masks is empty. When the first input letter needs to be recognized, the accelerator is called and returns the fact that no letter was recognized so the software version is called. It recognizes the letter and the system uses the input picture of the character as the corresponding mask for future executions (it is stored in the set). During execution, new letters are added to the set. If the cross-correlation is high enough, the character is recognized by the hardware and the software version is not called (see Fig. 6) thus speeding up character recognition (see Fig. 7). Fig. 7 shows the accelerator self-adaptation through the learning process. It was obtained by measuring the execution times of the OCR application on a text of five different lines for three configurations of the platform (pure software, one hardware accelerator, four hardware accelerators). One can notice that for the first line, the software version and the hardware version with only one accelerator have the same execution time. For the second line, the execution time of the hardware version with one hardware accelerator is more than two times
A Self-reconfigurable FPGA-Based Platform for Prototyping
271
Fig. 6. Algorithm of the hardware accelerator based on cross correlation between the input character and a set of masks
Fig. 7. Execution times of the OCR application for different configurations of the platform. The lines come from a standard text in English and are different but comparable in size. The first line enables the hardware accelerators to learn how to recognize letters from the software application: at the beginning of the second line, most of the letters are in the set of masks. For the second line and the following ones, the execution time is decreased.
faster than the software version. In fact, during the execution, the number of masks used by the hardware accelerator increases thanks to the learning process. This implies that the probability of hardware recognition also increases, so as the recognition speed. This learning property provided by a closed loop between the software golden model and the hardware accelerators can be applied to other languages (or to image recognition algorithms), since only pictures of letters are stored in the set of masks. By changing the golden model of computation (i.e. the software version of OCR), the computation can be changed and the hardware accelerators can evolve to a new configuration.
272
5.3
J.-M. Philippe, B. Tain, and C. Gamrat
Using the AC for Self-healing
Another prototyped closed loop is related to self-healing. By providing the hardware areas with an observer which probes their state, the AC is also used as a self-healing enabler since it enables the platform to recover from an external corruption of one of the hardware areas. As a demonstration example, while the system is running, one of the used hardware area can be reconfigured with a blank bitstream via the JTAG interface of the board (so that µCLinux is not aware of this modification). By reading the loaded operator identifier and by using internal timeouts, the AC can be aware of a problem and automatically reconfigure the problematic area with a fresh requested operator. This property is particularly interesting to prototype other self-healing mechanisms and to either simulate or physically implement some failures on the chip to assess their efficiency. For example, the AC can read back any loaded partial configuration so as to compare it with a reference using a hash function. If the checksum is not correct, the AC can refresh the configuration and test it again. After several different recovering techniques, it can invalidate the DPR area if all tests fail. 5.4
Analysis of the Platform
The experiments showed that the main issues to solve concern memory requirements. Due to the use of semaphores for synchronization, the memory footprint of all applications noticeably increased. Future work will consist in finding some other ways to implement synchronization mechanisms to improve the efficiency of the platform. The other memory issue is directly linked to the static nature of both the place and route process of FPGAs and the software interface to the operators. For each hardware operator, four partial bitstreams were generated and need to be stored in the memory. As already written, from the software point of view, four software functions needed to be written per application since nputfsl and ngetfsl macros only take static FSL identifier as a parameter. This issue is directly linked with FPGA tools and the way they manage the place and route process. Different challenges regarding this problem are tackled by other work such as the Erlangen Slot Machine [11] or through online routing [12]. But this management is out of the scope of our current research on closed loop prototyping since these are more seen as self-adaptation enablers for reconfigurable hardware.
6
Conclusions
In this paper, we presented a self-reconfigurable platform based on an FPGA that aims at prototyping solutions to the issues raised by future pervasive applications (e.g. providing systems with self-adaptation as studied in the ÆTHER FP6 project [13]). It is used to study how to include mechanisms to simplify the use of hardware possibilities from traditional applications. This is done through an allocation mechanism based on a shared memory which enables the running applications to request hardware accelerators to an allocation controller. The
A Self-reconfigurable FPGA-Based Platform for Prototyping
273
platform also features a low-cost self-healing mechanism thanks to the AC. This platform is used to prototype self-adaptation concepts studied in the ÆTHER project such as hardware adaptation using reconfigurable architectures, monitoring and control loops, task delegation mechanisms, application deployment and information exchange between hardware entities thanks to publish / discovery mechanisms. It was one of the main blocks of the final ÆTHER demonstration where a number of such boards exchanged information and computing tasks through Ethernet and WiFi, enabling this system to optimize its behavior.
References 1. Krikke, J.: T-engine: Japan’s ubiquitous computing architecture is ready for prime time. IEEE Pervasive Computing 4(2), 4–9 (2005) 2. Satyanarayanan, M.: Pervasive computing: vision and challenges. IEEE Personal Communications 8(4), 10–17 (2001) 3. Danek, M., Philippe, J.-M., Bartosinski, R., Honzk, P., Gamrat, C.: Self-Adaptive Networked Entities for Building Pervasive Computing Architectures. In: Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp. 94–105. Springer, Heidelberg (2008) 4. Mukherjee, R., Mondal, S., Ogrenci Memik, S.K.: Thermal Sensor Allocation and Placement for Reconfigurable Systems. In: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, ICCAD (2006) 5. Sylvester, D., Blaauw, D., Karl, E.M.: ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon. In: IEEE Design & Test of Computers, November 2006, pp. 484–490 (2006) 6. Lagger, A., Upegui, A., Sanchez, E., Gonzalez, I.: Self-Reconfigurable Pervasive Platform for Cryptographic Application. In: Proceedings of the International Conference on Field Programmable Logic and Applications, FPL 2006 (2006) 7. Soto Vargas, J., Moreno, J.M., Madrenas, J., Cabestany, J.: Implementation of a Dynamic Fault-Tolerance Scaling Technique on a Self-Adaptive Hardware Architecture. In: Proceedings of the International Conference on Reconfigurable Computing and FPGAs, pp. 445–450 (2009) 8. Jesshope, C.R., Philippe, J.-M., van Tol, M.: An Architecture and Protocol for the Management of Resources in Ubiquitous and Heterogeneous Systems Based on the SVP Model of Concurrency. In: Berekovi´c, M., Dimopoulos, N., Wong, S. (eds.) SAMOS 2008. LNCS, vol. 5114, pp. 218–228. Springer, Heidelberg (2008) 9. Williams, J.: Embedded Linux as a platform for dynamically self-reconfiguring systems-on-chip. In: The International Conference on Engineering of Reconfigurable Systems and Algorithm (2005) 10. Diaz Diaz, A.: Ocrad - The GNU OCR, http://www.gnu.org/software/ocrad/ 11. Majer, M., Teich, J., Ahmadinia, A., Bobda, C.: The Erlangen Slot Machine: A Dynamically Reconfigurable FPGA-based Computer. Journal of VLSI Signal Processing Systems 47(1), 15–31 (2007) 12. Paulsson, K., Hbner, M., Becker, J., Philippe, J.-M., Gamrat, C.: On-line Routing of Reconfigurable Functions for Future Self-Adaptive Systems - Investigations within the AETHER Project. In: Proceedings of the International Conference on Field Programmable Logic and Applications (FPL 2008), pp. 415–422 (2008) 13. The AETHER project web page. The AETHER consortium (2006), http://www.aether-ist.org
The X2 Modular Evolutionary Robotics Platform Kyrre Glette and Mats Hovin University of Oslo, Department of Informatics, P.O. Box 1080 Blindern, 0316 Oslo, Norway {kyrrehg,matsh}@ifi.uio.no
Abstract. We present a configurable modular robotic system which is suitable for prototyping of various robotic concepts and a corresponding simulator which enables evolution of both morphology and control systems. The modular design has an emphasis on industrial configurations requiring solidity and precision, rather than rapid (self-)reconfiguration and a multitude of building blocks. As an initial validation, a three-axis industrial manipulator design has been constructed. Evolutionary experiments have been conducted using the simulator, resulting in plausible locomotion behavior for two experimental configurations.
1
Introduction
The construction of a robotic system by assembly of several instances of a general base module (and some auxiliary modules) may have multiple advantages over custom built systems. By having general components, one may construct a variety of robotic configurations with different functionality without having to redesign the whole system each time. This saves design effort, as well as enabling robot builders without in-depth knowledge about electronics and mechanical design, to assemble robotic systems. Such a system is ideal in the case of short-term student projects where focus is on robot behavior rather than the underlying hardware details. In addition, the reuse of parts from design to design is inherent in the modular robot principle, allowing for potential cost savings. Another cost saving factor could come from the production of several identical parts, allowing for savings both in purchase and production processes. However, few demonstrations of real cost savings have so far been shown [1], although some approaches, such as the Molecubes project, attempt to address this issue [2]. Several modular robotic systems offer fast reconfiguration possibilities, from simple manual connection mechanisms [3,2] to more advanced active connection mechanisms allowing self-reconfiguration [4,5,6]. On the other hand, modular robots are also emerging in industry, such as the UR-6-85-5-A industrial robot from Universal Robots [7]. Here, a more fixed structure is employed, while several advantages, such as flexibility, low cost, and ease of development are retained. Modular robotics can have several advantages of being combined with evolutionary methods [8,9]. Firstly, the general robot modules scan be suitable as G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 274–285, 2010. c Springer-Verlag Berlin Heidelberg 2010
The X2 Modular Evolutionary Robotics Platform
275
building blocks for an evolutionary design process, giving the algorithm flexibility in terms of constructing various morphologies. Secondly, constructing a robot from modular building blocks, especially when these are fine-grained, may pose a complex design challenge. In such cases an evolutionary search may be able to find solutions which would be hard to find by traditional manual design methods. Furthermore, a multi-objective evolutionary approach could be helpful in finding solutions which take into account not only performance but also factors such as cost and robustness. An efficient simulation of the robotic system is of high importance for the flexibility and efficiency of an evolutionary search, however, in many cases the gap between the simulator and the real world, often referred to as the reality gap [10], may reduce the validity of the evolved simulated solutions. In this paper we present the early results of a developed modular robotic system, coupled with a simulator and an evolutionary framework. The robot design consists of one actuated core module and some static auxiliary modules. At the time being we have not focused on self-reconfiguration or active connection mechanisms between parts, but rather envisioning the system as a tool for prototyping of robotic concepts. This gives a longer lifetime for a given configuration, allowing for solid, fixed connection mechanisms. Moreover, we would like to have a design which is relatively easy to build, thus, in contrast to the abovementioned systems, we avoid having a custom designed printed circuit board. Instead we employ cheap and commonly available off-the-shelf components as much as possible, with the exception of the 3D-printed housing. The modules have been designed with an introductory robotics course in mind, emphasizing a revolute joint which allows simple kinematic calculations. However, it is envisioned that the modular system will also allow for more advanced experiments, such as robotic locomotion, with an increased ease of use and production than earlier approaches by the authors [11]. It is also envisioned that the rigid body structure of the proposed robot will be combined with soft polymer parts, and taking such aspects into account in the simulation by utilizing advanced features of the physics engine. The utilization of such features has been explored by Rieffel et al. in [12], and has also been partially been implemented in the simulation and explored by the authors in [13]. The paper is structured as follows. Section 2 describes the robotic system, while Section 3 describes the setup for some preliminary experiments. The experimental results are given in Section 4 and discussed in Section 5. Finally, Section 6 concludes the paper.
2
The X2 Robotic System
The X2 robotic system is modular and consists of an actuated core module and some static auxiliary modules. Each core module is independently controlled and has several connection possibilities, which allows for flexible configuration of the robot structure.
276
K. Glette and M. Hovin
Fig. 1. Exploded view of the X2 core module, showing the inner rotational core in yellow and the outer shell in white. The motor can be seen on the underside and the microcontroller board and rotational encoder on top.
2.1
Core Module
The core module is the main building block of the system, and consists of two main parts: an inner core and an outer shell, which together function as a revolute joint. In addition, the module contains a motor, a rotational sensor, and control electronics. See Fig. 1. The inner core of the core module can have a rotational movement which is actuated by the side-mounted motor through a belt. One reason for choosing an inner rotational core is the high stability offered by such a design, with high sideways support from the shell. In addition the motor axle is offloaded and does not have to directly drive a plastic part. From our experiences, driving plastic parts with high local forces may result in tearing of the plastic material. The belt also offers a gear ratio and is in a sense an alternative to a prohibitively expensive harmonic drive gearing mechanism. Further robotic modules can be attached to either the inner core or the shell in order to obtain a revolute joint functionality. For motor control an Arduino ATmega328 microcontroller board is fitted, performing sensor input, motor actuation and communications with a host. For feedback to the control loop an optical incremental rotation encoder is fitted, belt-driven by the inner core. The planetary geared 12V DC servo motor is controlled via a motor driver board. Communications to the host interface is performed over USB cable, using a USB to TTL converter chip. Each module has its own USB and power cable sockets. The core module structural parts are mainly produced by rapid prototyping technology, in this case as direct outputs from a 3D printer. Some auxiliary parts are molded in silicone from 3D-printed molds, such as a flexible pen holder tip and motor sleeves. The rest of the components, like ball bearings, belts, and the electronics, are commercial off-the-shelf parts.
The X2 Modular Evolutionary Robotics Platform
2.2
277
Configuration
Because of the modular design a large number of structural configurations are possible, limited mainly by the number of parts available. In addition, since each module is individually powered and controlled, a high number of modules could result in an impractical number of cables. Modules are connected with a number of screws for solidity, while still making reconfiguration possible with a little effort.
Fig. 2. The X2 configured as a standard three-axis manipulator arm. Left: equipped with tool. Upper middle: closeup of the microcontroller board and the rotation encoder. Lower middle: closeup of cable connectors and DC motor. Right: experimental configuration, see Sec. 3.2.
The initial configuration is the standard three-axis manipulator arm shown on the left in Fig. 2, inspired by industrial manipulators. This configuration is being used for educational purposes in a robotics course and is the only configuration actually built so far. The setup includes a base for fixing the robot to the ground, an extension arm module and a flexible silicone tip for holding a pen, or a hard tip for holding other tools such as a milling tool. One possible configuration could be a four-legged setup as shown in Fig. 3. This robot is planned built for experiments on walking and climbing behavior, but exists at the time of writing only as a simulation model. The configuration allows for up to three degrees of freedom for each of the four legs, and needs in this case twelve core motorized modules, plus some helper modules. 2.3
Simulation
A robot simulator has been developed, based on the PhysX [14] physics library and using OpenGL for visualization, see Fig. 3 for a screenshot. The PhysX library is primarily developed for real-time applications such as computer games, and some features (cloth, soft bodies, fluids) can be hardware accelerated by a graphics processing unit (GPU) through the CUDA framework. At the moment the X2 robot does not utilize any of the abovementioned features, but it is
278
K. Glette and M. Hovin
Fig. 3. The X2 configured as a four-legged robot. Image from simulator.
planned to include support for soft bodies in order to simulate soft (silicone molded) parts of the robot. The cloth feature is supported for the simulation of artificial muscles, and is described in [13]. The modules are simulated as dynamic rigid bodies, and are constrained by revolute joints which are limited to rotation along one single axis. Simple primitives, such as boxes, capsules, and spheres, are combined in order to simulate the shapes of the modules, however a polygonal mesh can be loaded and visualized on top of these primitives in order to improve the visual presentation.
3
Robot Control and Experimental Configuration
This section describes the experimental control system and evolvable robot configurations. 3.1
Controller Model
control value
For the following robot configurations, a relatively simple trigonometry-based function has been chosen for controlling the joint movements. An illustration of a period of the controller curve for one joint can be seen in Fig. 4. The attack parameter decides the time between t0 and t1 , pause0 the time between t1 and t2 , and decay the time between t2 and t3 , and pause1 the time between t3 and t4 . The
t0
t1
t2
t3
t4
time
Fig. 4. Example period of the controller output
The X2 Modular Evolutionary Robotics Platform
279
controller then repeats the same curve in a cyclic fashion, with a given frequency. All joints share the same frequency, but have different curve parameters, as well as individual phase shifts, φ. The frequency for the joint controllers are kept low in the simulator in order to not exceed the angular speed of the real robot joints, as well as to encourage static locomotion. We believe the chance of successfully transferring the control to the real robot is higher when avoiding dynamic locomotion, such as jumping behavior. 3.2
Experimental Configuration 1
Initially, we would like to investigate the possibilities of evolving locomotion behavior in the simulation phase, with transfer to a real robotic setup, and a second-phase evolutionary tuning, in mind. Therefore, for the first experiment a fixed and simple morphology has been chosen as it will be closest to the existing hardware setup and thus the first possibility for real-life validation. The configuration is based on the educational configuration as described in Sec. 2.2. However, the robot is fixed to a small platform and the tip is equipped with a high friction object for making it possible for the robot to pull itself, including the platfrom, forward – see Fig. 2 for the real configuration and Fig. 5 for the simulator configuration. In addition, only two of the joints are enabled, which makes it impossible for the robot arm to turn around the vertical axis. Evolution is in this case performed on a basic controller curve for each joint, and the initial angle and amplitude of the movement. This can be described in the genome encoding with 5 controller curve parameters, and the minimum and maximum angle of the movement, as follows (number of bits in parentheses): attack (8) pause0 (8) decay(8) pause1 (8) φ(8) min.angle(8) max.angle(4) This is then multiplied by the number of joints and decoded to floating-point values from a binary encoding. Appropriate ranges for each of the parameters have been chosen in order to avoid unstable configurations. The total length of the genome counts 107 bits: unused (3) joint0 (52) joint1 (52)
(a)
(b)
Fig. 5. Simulator screenshot of configuration 1, shown with a polygonal mesh representation (a) and the underlying simulation primitives (b).
280
3.3
K. Glette and M. Hovin
Experimental Configuration 2
As a second experiment, we would like to investigate the capabilities of the simulator by introducing a more complex morphology and possibilities for evolution controlling morphology parameters in addition to control. This is a more interesting scenario in terms of being able to evolve morphology in the first phase of the evolution, as well as investigating the scalability of the evolutionary algorithm. The configuration is based on the four-legged configuration described in Sec. 2.2, however the length of the tip as well as the length of the arm between the second and third joints are adjustable. The tip and arm length are equal for all legs for stability reasons, but to challenge the evolutionary search no symmetry is taken into account for the joint controllers, giving a total of 8 individual controller parameter sets. The joint coding follows the same style as in the first configuration, however the bit precision is changed in some places: attack (6) pause0 (6) decay(6) pause1 (6) φ(8) min.angle(6) max.angle(7) The total genome can then be described as follows, counting 379 bits: unused (3) arm length(8) tip length(8) joint0 (45) ... joint7 (45) 3.4
Evolution
For both of the robotic configurations, the fitness function is the average speed at which the robotic phenotypes are able to move along one axis. The phenotypes are evaluated during 3000 simulation steps for the first configuration and 4000 for the second, where one simulation step corresponds to 1/60 s of simulated time. Moving backwards and falling over gives zero fitness score. For the evolutionary runs, the GAlib library [15] has been employed, running the ”simple” genetic algorithm (GA) as described in [16]. The evolution runs have been run for 250 generations, with a population size of 50 and a two-point crossover probability of 0.4 for all experiments. The bit flip mutation probability has been set to the reciprocal of the number of bits in the genome.
4
Results
This section describes the experimental results and the current status of the hardware development. 4.1
Evolution Runs
Simulated evolution runs were carried out using the settings as described in 3.4. The fitness curves are plotted in Fig. 6. The entire evolutionary process took 3 hours 56 minutes for the first configuration and 22 hours 11 minutes for the second configuration. This corresponds to an average individual evaluation time of 6.4 seconds per individual for the second configuration, as opposed to the 66.7 seconds of simulated time. Note that this number is somewhat improved because some evaluations could be cut off at an early stage due to falling or similar behavior.
The X2 Modular Evolutionary Robotics Platform 1.8
5
1.6
4.5
281
4
1.4
3.5 1.2
Fitness
0.8
2.5 2
0.6 1.5 0.4
1
0.2
0.5 elite population average
elite population average
0
0 0
50
100
150
200
250
0
50
100
Generations
150
200
250
Generations
(a)
(b)
Fig. 6. Best and average population fitness plots for the first(a) and second (b) configuration. Note the different scale on the vertical axes.
4.2
Locomotion Results
The best individuals from each of the simulated evolution runs have been evaluated qualitatively as well as measuring the motor controllers’ output and positions of selected parts during the locomotion process. For the first configuration, a successful forward pulling motion was achieved, and a plot of the best controller can be seen in Fig. 7. In the phase where the actual forward movement took place, the tip was pushed to the ground in such a way that most of the platform was elevated from the ground while moving. One can also observe from the figure that a slight backwards movement took place at the end of each advancement, which was due to a slight backwards push in the process of lifting the tip.
2 1.5
control value / position
Fitness
3 1
1 0.5 0 -0.5 -1 lower joint upper joint x pos. y pos.
-1.5 -2 10
12
14
16
18 time (s)
20
22
24
26
28
Fig. 7. Controller plot of evolved solution for the first configuration. Upper joint refers to the angular control values (from the initial starting angle) for the joint nearest the tip. High values signify the limbs moving towards the ground. x and y position values have been scaled with different constants for readability.
282
K. Glette and M. Hovin right front leg left front leg right rear leg left rear leg
50
60
70
80
90 time (s)
100
110
120
Fig. 8. Second configuration leg tip heights over time, scaled with different factors and translated for readability. In the actual gait some legs were lifted higher than others.
For the second configuration the best individual managed to move forward in a cyclical manner, however the movement seemed somewhat unstable, and in some cases a slight change in the initial conditions could cause the robot to fall over. A plot of leg positions can be seen in Fig. 8. The evolved morphology parameters, tip length and arm length, are summarized in Tab. 1. Table 1. Evolved arm and tip lengths description range best ind. arm length [12.0,18.0] 15.5 tip length [8.5,17.0] 15.6
4.3
Hardware System Status
Currently, 3 core modules have been manufactured, as can be seen in Fig. 2, and simple tests have been carried out to verify the functionality of the actuation. Furthermore, the first experimental configuration has been assembled with the necessary auxiliary modules. However, the motor control system and communications have not yet been fully implemented on the microcontroller, and as such the evolved behavior from the simulator cannot yet be tested on the hardware system.
5
Discussion
By observing the elite fitness curves in Fig. 6, it seems like the advances are more frequent in the evolution of the first configuration, which is expectable since the fitness landscape is expected to be significantly less difficult than for the second configuration. This is also strenghtened by the observation that the average fitness improves more over time for the first configuration. It is however
The X2 Modular Evolutionary Robotics Platform
283
a bit unexpected to see that the final evolved solution seems to be suboptimal in the sense that there is a slight backwards movement for each cycle. This may be caused by the evolutionary search getting stuck in a local optimum, however, inaccurate friction simulation may possibly also play a role. The best solution obtained for the second configuration seems slightly awkward and suboptimal by visual inspection, however, a relatively fast movement is obtained. The evolved controllers were not entirely symmetrical with respect to the left and right legs, but still, by looking at the peaks in Fig. 8, one can discern similarities to a crawler gait as described for instance in [17]. It is therefore interesting to observe, that even when purposefully not building in symmetry into the controllers, a variation of the static crawler gait is obtained. When observing the evolved tip and arm lengths of the second configurations, one can see that while they are high, the maximum allowed values are not chosen. The reason for this could be that while longer limbs offer potentially faster locomotion, very long limbs make it hard to find a control system which can keep balance. In order to evolve more stable solutions, an individual could be tested under varying conditions, such as walking in slopes and traversing obstacles. While the proposed control system is simplistic, it does not seem to be a major limitation for the current experiments. However, for further studies in evolving locomotion it would be interesting to look into the use of central pattern generators such as in [9], as well as a tighter coupling between the morphology and the control system, and the addition of sensor inputs. We have so far not been able to test the evolved solutions on the real robotic system, however it is expected that the reality gap will be present to at least some extent. Even when directing the search towards static locomotion there may be issues such as the distribution of mass, friction, and more which could perturb the transfered results significantly. Further research should seek to investigate the relation between the simulated models and the real world performance. Furthermore, we would like to look into more aspects of evolving morphology, both in terms of growing bodies (for instance from L-systems) and introducing soft parts in both the simulation and the real robotic system. The introduction of soft parts seems particularly interesting since the PhysX engine allows for acceleration of these features through GPUs. The proposed robotic system has the advantage of a solid design suitable for industrial-like applications, coupled with being easy to build, given that one has access to a 3D printer. This is of particular interest with regard to student projects, where modules can be assembled quickly with very little electronics work. While the cost issue is addressed through the use of simple off-the-shelf electronical and mechanical parts, a current challenge is the amount of plastic material used for the shell. Material for 3D printers is at the moment very expensive, and the size of the X2 parts prohibits mass production through 3D printing. Although this may change in the future, when 3D printers are more commonplace, at the moment one solution could be to modify the shapes so that it would be possible to mold or mill them. However this is a complicated process and it therefore seems like reducing the size of the core module is a more viable option.
284
6
K. Glette and M. Hovin
Conclusion
We have developed a modular robotic system and a corresponding simulation environment with the possibility for artificial evolution of morphology and control. The design focuses more on solidness for industrial-like applications than rapid (self-)reconfiguration. Evolutionary experiments have been conducted with the simulator and static locomotion behavior has been achieved, however some practical work remains before the evolved solutions can be tested on the real robotic system. While the current design addresses production cost in several ways, it is still necessary to reduce the material cost associated with 3D printing, and we will therefore attempt to design a smaller core module. Future work also includes evolution of more advanced control and morphology, including soft parts.
References 1. Yim, M., Shen, W., Salemi, B., Rus, D., Moll, M., Lipson, H., Klavins, E., Chirikjian, G.: Modular self-reconfigurable robot systems [grand challenges of robotics]. IEEE Robotics & Automation Magazine 14(1), 43–52 (2007) 2. Zykov, V., Chan, A., Lipson, H.: Molecubes: An open-source modular robotics kit. In: Proc. IROS (2007) 3. Moeckel, R., Jaquier, C., Drapel, K., Upegui, A., Ijspeert, A.: YaMoR and Bluemove – an autonomous modular robot with Bluetooth interface for exploring adaptive locomotion. In: Proceedings CLAWAR 2005, pp. 685–692 (2005) 4. Duff, D., Yim, M., Roufas, K.: Evolution of polybot: A modular reconfigurable robot. In: Proc. of the Harmonic Drive Intl. Symposium, Nagano, Japan (November 2001) 5. Kamimura, A., Kurokawa, H., Yoshida, E., Murata, S., Tomita, K., Kokaji, S.: Automatic locomotion design and experiments for a modular robotic system. IEEE/ASME Transactions on Mechatronics 10(3), 314–325 (2005) 6. Sproewitz, A., Billard, A., Dillenbourg, P., Ijspeert, A.: Roombots–Mechanical Design of Self-Reconfiguring Modular Robots for Adaptive Furniture. In: Proceedings of the 2009 IEEE international conference on Robotics and Automation, Institute of Electrical and Electronics Engineers Inc., pp. 2735–2740 (2009) 7. Universal Robots: UR-6-85-5-A product sheet, http://www.universal-robots.com/Produkter/Produktblad.aspx 8. Hornby, G., Lipson, H., Pollack, J.: Generative representations for the automated design of modular physical robots. IEEE transactions on Robotics and Automation 19(4), 703–719 (2003) 9. Marbach, D., Ijspeert, A.: Online optimization of modular robot locomotion. In: 2005 IEEE International Conference Mechatronics and Automation, vol. 1 (2005) 10. Jakobi, N., Husbands, P., Harvey, I.: Noise and the reality gap: The use of simulation in evolutionary robotics. In: Mor´ an, F., Merelo, J.J., Moreno, A., Chacon, P. (eds.) ECAL 1995. LNCS, vol. 929, pp. 704–720. Springer, Heidelberg (1995) 11. Garder, L.M., Hovin, M.E.: Robot gaits evolved by combining genetic algorithms and binary hill climbing. In: GECCO 2006: Proceedings of the 8th annual conference on Genetic and evolutionary computation, pp. 1165–1170. ACM, New York (2006)
The X2 Modular Evolutionary Robotics Platform
285
12. Rieffel, J., Saunders, F., Nadimpalli, S., Zhou, H., Hassoun, S., Rife, J., Trimmer, B.: Evolving soft robotic locomotion in PhysX. In: GECCO 2009: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference, pp. 2499–2504. ACM, New York (2009) 13. Glette, K., Hovin, M.: Evolution of Artificial Muscle-Based Robotic Locomotion in PhysX. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS (to appear, 2010) 14. NVIDIA: PhysX SDK, http://developer.nvidia.com/object/physx.html 15. Wall, M.: GAlib: A C++ library of genetic algorithm components, http://lancet.mit.edu/ga/ 16. Goldberg, D.: Genetic Algorithms in search, optimization, and machine learning. Addison-Wesley, Reading (1989) 17. Hornby, G., Fujita, M., Takamura, S., Yamamoto, T., Hanagata, O.: Autonomous evolution of gaits with the Sony quadruped robot. In: Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, pp. 1297–1304 (1999)
Ubichip, Ubidule, and MarXbot: A Hardware Platform for the Simulation of Complex Systems Andres Upegui1 , Yann Thoma1 , H´ector F. Satiz´ abal1 , Francesco Mondada2 , 2 1 Philippe R´etornaz , Yoan Graf , Andres Perez-Uribe1, and Eduardo Sanchez1 1
REDS, HEIG-VD, HES-SO, Yverdon, Switzerland [email protected] 2 MOBOTS, EPFL Lausanne, Switzerland [email protected]
Abstract. This paper presents the final hardware platform developed in the Perplexus project. This platform is composed of a reconfigurable device called the ubichip, which is embedded on a pervasive platform called the ubidule, and can also be integrated on the marXbot robotic platform. The whole platform is intended to provide a hardware platform for the simulation of complex systems, and some examples of them are presented at the end of the paper. Keywords: Reconfigurable computing, bio-inspired systems, collective robotics, pervasive systems, complex systems.
1
Introduction
The simulation of complex systems has gained an increasing importance during the last years. These simulations are generally bounded by the initial constraints artificially imposed by the programmer (e.g., the modeller). These artificial constraints are aimed at mimicking the physical constraints that real complex systems are exposed to. Our approach relies on the principle that for modelling real complex systems like biological systems or social systems, models must not be artificially constrained but must be physically constrained by the environment. Biological systems, for instance, evolve in dynamic physical environments which are constantly changing because of their intrinsic properties and their interaction with the system. For instance, the number of parts and their interconnection in a real complex system is neither random nor regular, but follows a set of implicit building rules imposed by physical constraints and the environment in which they evolve. These constraints are a key element for the emergence of behaviours that are unpredictable by analytical methods. Such emergence has a direct impact on the self-organising properties of complex systems and vice versa, given that there is not clear causality relation between these two properties. G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 286–298, 2010. c Springer-Verlag Berlin Heidelberg 2010
Ubichip, Ubidule, and MarXbot: A Hardware Platform
287
Within the framework of the Perplexus project1 , our main goal has been to develop a scalable pervasive platform made of custom reconfigurable devices endowed with bio-inspired capabilities. This platform will enable the simulation of large-scale complex systems and the study of emergent complex behaviours in a virtually unbounded wireless network of computing modules. Such hardware platform was aimed at being able to model complex systems in a more realistic manner thanks to two main aspects: (1) a rich interaction with the environment thanks to sensory elements, and (2) the replacement of artificial constraints imposed by the programmer - by physical constraints - imposed by the hardware platform and its interaction with the environment. The network of modules is composed of a set of ubiquitous computing modules called (ubidules), which contain two ubidule bio-inspired chip (ubichips) capable of implementing bio-inspired mechanisms such as growth, learning, and evolution. The ubichip constitute thus the core of the Perplexus modelling platform and can be integrated on the ubidule or on the marXbot robotic platform, which has been also developed on the framework of this project. This paper presents the complete Perplexus modelling hardware platform. Sections 2, 3, and 4 describe the main hardware components of the project, respectively, the ubichip, the ubidule, and the marXbot robot. Then, section 5 gives an overview of several applications where the platform has been used for modelling different types of complex systems with applications to engineering. And finally, section 6 summarises the opportunities offered by the platform.
2
Ubichip
The ubichip is the core device of the whole hardware platform, which provides the reconfigurability support for implementing dynamic hardware architectures. Real complex systems are dynamic, their internal components and interactions are constantly changing according to the interaction of the world with its intrinsic dynamics. This dynamic aspect is precisely the main feature of the ubichip. The ubichip is a reconfigurable digital circuit that allows the implementation of complex systems with dynamic topologies. A fine-grained dynamic partial reconfiguration permits to easily modify the system from an external processor, but built-in self reconfiguration mechanisms permits also to modify it internally in a completely autonomous and distributed way. Moreover, dynamic routing allows also to create and destroy internal connections in the circuit. Previous work in this field is the POEtic tissue [11], a reconfigurable hardware platform for rapidly prototyping bio-inspired systems, which has been developed in the framework of the European project POEtic. The limitations exhibited by the POEtic tissue suggested several architectural and configurability improvements that lead us to the ubichip architecture, better suited for supporting the complex systems that we want to model with our devices. 1
PERvasive computing framework for modeling comPLEX virtually-Unbounded Systems FP6 european project. (http://www.perplexus.org)
288
A. Upegui et al. Reconfigurable Array MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC
Ubicell
Ubicell
MC MC MC MC MC MC MC MC MC MC
SR unit
MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC
DR unit
MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC
Ubicell
Ubicell
Fig. 1. Composition of a Macrocell
The reconfigurable array of the ubichip consists in a bi-dimensional regular array of reconfigurable cells called macrocells. A macrocell is composed of a selfreplication (SR) unit, a dynamic routing (DR) unit, and four ubicells, this last one being the basic computing unit of the ubichip. Figure 1 depicts a top level view of a macrocell, which is composed of three layers: a ubicell array layer, a dynamic routing layer, and a self-reconfiguration layer. General Purpose Inputs-Outputs (GPIOs) of the reconfigurable array are implemented in the form of dynamic routing units that, instead of being connected to ubicells, are directly connected to input and output pins of the circuit. These GPIOs allow thus to extend the array to form a multi-chip array, connected through dynamically created paths. The next subsections will briefly describe the functionality of each layer. 2.1
Ubicell Layer
A ubicell is the basic computing unit of the ubichip, which contains four 4-input LUTs and four flip-flops. The ubicell has two basic operating modes: native mode and SIMD mode. In native mode, a ubicell can be configured in different basic “classical“ modes like counter, FSM, shift-register, 4 independent registered and combinatorial LUTs, adder, subtracter, etc. There is a particular configuration mode of a ubicell, very useful in the modelling of complex systems, the 64-bit LFSR mode. In this mode one can use use the 64 configuration flip-flops used for storing the four 4-input LUTs configuration as a reasonable quality pseudo-random number generator. This novel feature, very useful on complex systems modelling, allows to include processes as probabilistic functions and pseudo-random event trigger at a very low resources cost, in terms of reconfigurable computing. This configuration mode has been used in different models implemented on the architecture, such as ontogenetic neural networks [14] and evolutionary games [7]. In SIMD mode (single instruction multiple data), the ubicell layer can be configured as an array of processing elements in order to perform vectorised parallel computation. In this mode, each ubicell can be configured as a minimal 4 bits processor, and 4 ubicells can be put together to form a 16-bit processor. A
Ubichip, Ubidule, and MarXbot: A Hardware Platform
289
centralised sequencer can read a program and send instructions to be executed in parallel by processing units. This configuration mode has been used in modelling complex systems as neural networks [2] and culture dissemination models. 2.2
Self-reconfiguration Layer
Our self-reconfiguration layer allows a given macrocell to access the configuration bit-string of a neighbour macrocell. In this way, a macrocell can either read the configuration of its neighbour, modify it, and reinsert it for modifying the neighbour’s functionality. Another possibility is to recover the configuration bitstring of its neighbour and send it to another (remote) macrocell that will use it to configure its neighbour. This is what we call replication. Now let’s consider the case of two neighbouring initial macrocells A0 and B0 . Lets permit A0 to read the configuration of B0 for copying it and creating B1 , and then lets do the same with B0 reading A0 and copying it to A1 . If we consider the tuple [AB] as a cell, we have an initial cell [A0 B0 ] that has created an exact copy of itself [A1 B1 ]; that’s self-replication. Replicating a single macrocell is not very practical, since the functionality implemented on a single macrocell may be very limited. For overcoming this limitation, we propose the THESEUS mechanism (standing for THeseus-inspired Embedded SElf-replication Using Self-reconfiguration) [13]. THESEUS includes a set of building flags that allow to replicate a larger functional block composed of several macrocells. The building flags describe how to build the complete block. In this manner, a macrocell can access its neighbor’s configuration, and through it, it can also access a chain of other macrocells’ configurations described by a predefined building path by means of the building flags. This reconfigurability feature is one of the firsts step towards the modelling of more realistic complex systems. Following our approach, the implemented complex system size can grow and prune through self-replication and self-destruction. In parallel, the system building blocks can be dynamically connected and disconnected driven by processes executed internally to each block, all this thanks to the dynamic routing mechanism that will be decribed in the next subsection. 2.3
Dynamic Routing Layer
As explained before, real complex systems are constantly modifying its topology. The brain, ecological systems, and social networks, are just some examples where neurons, species, or people are constantly modifying their interaction channels. From complex systems theory, it can be represented as graph links that are being created and destroyed. Dynamic routing offers the possibility of implementing such dynamic-topology systems on a hardware substrate in order to model such changing interactions in a more direct way. The basic idea of the algorithm is to construct paths between sources and targets by dynamically configuring multiplexers, and by letting the data follow the same path for each pair of source and target. A phase of path creation executes a breadth-first search distributed algorithm, looking for the shortest
290
A. Upegui et al.
path. Sources and targets can decide to connect to their corresponding unit at any time by launching a routing process. If we consider the high silicon overhead due to routing matrices on reconfigurable circuits, specially high for dynamic routing, we adopted a solution requiring a small silicon overhead, while being flexible enough to deal with the changing topology of our complex networks. Our dynamic routing algorithm is an improvement of the one implemented in the POEtic chip [10]. The risk of congestion has been be reduced by means of three features: (1) the new algorithm will exploit the existing paths by reusing them, (2) an 8-neighborhood (instead of the 4-neigborhood of POEtic) will allow a dramatic reduction of congestion risk compared to the amount of logic required, and (3) we allow to destruct paths in order to remove unused connections and reuse them later. Finally, while in POEtic the circuit execution was frozen during a routing process, in the ubichip the creation of a new path lets the system run without interruption.
3
Ubidule
The ubidule platform is composed of two electronic boards mainly featuring two ubichips, an ARM processor running Linux, a 3.4 Mgates FPGA, and support for several peripherals. One of the major features of the ubidule platform is its modularity and flexibility. It is easily customizable for each one of the target applications, and so is a complete and efficient modelling platform. For the sake of modularity the ubidule has been decomposed into two boards: a mother board containing the CPU, the FPGA, and peripheral support, and a daughter board containing two ubichips, which can also be integrated on the marXbot robot. 3.1
Ubidule’s Mother-Board
USB Conn
USB Conn
microSD
USB HUB
USB USB
Colibri Module
Ubichip daughter board 0
Jtag PC IV
Jtag
Ubichip
AER con
AER
Par conf
3
6 5
4
ctrl
Ubichip
SRAM
Fig. 2. Schematic of the Ubidule platform
SRAM
Jtag
LCD Touchscreen
Spartan 3A XC3SD3400A
SelectMAP Prog Flash
ctrl
2
SRAM bus
Jtag
1
7
GPIOs
UART
SRAM
SDcard bus
VLIO Memory bus
UART
USB Conn
USB Conn
USB Conn
Wifi
Ethernet
Figure 2 depicts the schematic of the ubidule platform. A mini-PCI socket supports a ubichip daughter board including two ubichips and their respective resources required for running on native and SIMD mode. Even if the current
Ubichip, Ubidule, and MarXbot: A Hardware Platform
Fig. 3. Ubidule platform (top view)
291
Fig. 4. Ubichip daughter board
board contains two ubichips, it can be scaled up to contain up to four ubichips without modifing the current addressing scheme. A second mini-PCI socket supports a CPU board containing an ARM processor, that can be either an Xscale PXA270 or PXA320, and enough memory resources for running a GNU-Linux operating system. This CPU board constitutes the first step toward the desired flexibility and modularity of our ubidules, by providing the advantages of a performant processor, a well supported operating system, gcc tools, and a number of software resources as application programs, services, and drivers. Figure 3 shows the ubidule board. 3.2
Ubichip Daughter Board
The ubichip daughter board mainly contains 2 ubichips with their respective SRAM external memories. Figure 4 shows a top view of the board. When using the ubichips in native mode, both ubichips can be used as an extended configurable array. GPIOs can be configured in both chips in order to connect dynamic routing units from one ubichip to the other. The ubichip daughter board can also be directly inserted in the marXbot robot and both ubichips can be configured from the robot iMX microcontroller through the serial configuration interface. Nevertheless, concerned about the ubichip power consumption from the marXbot power supply, the ubichip daughter board provides the possibility of powering a single ubichip.
4
The Marxbot Robotic Platform
To extend the exploration of complex systems in real-world applications we decided to embed the ubichip in a robotic platform. We therefore designed the marXbot mobile robot, taking care of several specific aspects: large number of robots (more than 20), facility of experimentation, ability to embed the ubidule
292
A. Upegui et al.
Fig. 5. A complete marXbot robot in production (Photo Basilio Noris)
Fig. 6. A group of marXbots during experimentation
as one module, and possibility to run long experiments. Because this design effort was not feasible within the perplexus project alone, we designed the marXbot robot in synergy with the swarmanoid project2 . This section presents the particular features of the marXbot (Figures 5 and 6). 4.1
Modularity
The marXbot robot is a flexible mobile robotic platform. It is made of stacked modules of a diameter of about 17 cm. The modularity of the marXbot robot is based on a common CAN bus and a LiION battery based power supply, both shared by all modules. In the examples presented in this paper, three main modules have been used: – The base module includes the wheels, the tracks (together called Treels), proximity sensors, RFID reader/writer, accelerometers and gyros and battery connection. The Treels provide mobility to the marXbot. They consist of two 2 W motors, each associated with a rubber track and a wheel. Motors are driven by dedicated electronic boards situated on each side of the battery (one for each motor). The maximum speed of the marXbot is 30 cm/s. The base of the marXbot includes infrared sensors to act as virtual bumpers and ground detectors. Those sensors have a range of some centimeters and are distributed around the robot: 24 are directed outside and 8 are directed to the ground. In addition, 4 contact ground sensors are placed under the lowest part of the robot. The base of the marXbot also embed a RFID reader and writer with an antenna situated on the bottom of the robot, close to the ground. – The scanner module allows to build a distance map of the obstacles surrounding the robot [3]. Our design is based on 4 infrared Sharp distance sensors mounted on a rotating platform. These sensors have a limited range and a dead zone close to the device, so we couple two sensors of different 2
http://www.swarmanoid.org
Ubichip, Ubidule, and MarXbot: A Hardware Platform
293
ranges (40–300 mm and 200–1500 mm) to cover distances up to 1500 mm. The platform rotates continuously to make 360◦ scans. To maximize the life time of the scanner, the fix part transfers energy by induction to the rotating part. They exchange data using infrared light. – The top module includes the cameras, a RGB LED beacon, the imx.31 processor and its peripherals such as WiFi board and SD card reader. Two cameras can be mounted: a front camera and an omnidirectional camera on top. Both are equipped with imagers of 3 Mpixels. The RGB LED beacon allows to display a high intensity (1W) color light. Combined with the cameras, this is a localized and well understandable communication system. The imx.31 processor runs LINUX and access standard peripherals such as WiFi, USB or flash storage. 4.2
Ubichip Compatibility
An ubidule extension module has been designed to ensure the embodiment of the ubidule in the marXbot. It ensures the following functionalities: – Mechanical adaptation between the four screws of the marXbot extension system and the four screws of fixation of the ubidule. – A step-up power supply generating 7.5V, 12W for the ubidule. – A microcontroller ensuring the transparent translation of messages between USB (ubidule) and CAN (marXbot). Therefore the final solution has been to develop an ubichip extension module, without screen and user interface, to be placed within the marXbot robot as a sandwich module. From a control point of view, all microcontrollers within the marXbot robot can be controlled using the ASEBA framework [4]. This framework transmits event messages over the CAN bus to exchange commands, to read sensors etc. We have implemented an ASEBA node in the ubidule making it compatible with the software architecture of the marXbot. This allows full control of the marXbot from the ubidule. 4.3
Battery Management
The exploration of complex systems requires the use of groups of robots during long periods, for instance under the control of genetic algorithms. Because of the battery-based power supply of the robots, long experiments are problematic. Therefore the marXbot is powered by a 3.7 V, 10 Ah Lithium-Polymer battery which is hot-swappable. The hot-swapping capability is provided by a supercapacitor which maintains the power supply of the robot for 10 s during battery exchange. A battery exchanger (figure 7) has been designed to automatically extract the battery from a running marXbot and replace it by a charged one in a delay below 10 seconds.
294
A. Upegui et al.
Fig. 7. The battery station able to exchange the battery of the marXbot during operation in 10 seconds
5
Complex Systems Simulations
In this section we briefly describe two examples of complex systems models, that exploit different aspects of the Perplexus hardware platform. Subsection 5.1describes an ontogenetic neural network that exploits the ubichip’s self-reconfiguration and dynamic routing mechanisms, and subsection 5.2 describes a collective foraging task set up on the marXbots. 5.1
Neurogenetic and Synaptogenic Networks on the Ubichip
Given its dynamic routing mechanisms, the ubichip results in a promising digital hardware platform for implementing connective systems with dynamic topologies, more precisely in our case, developmental artificial neural networks. The current implementation of the model considers the initial existence of a set of unconnected 4-input neurons, where dendrites (inputs) and axons (outputs) are connected to dynamic routing units which are previously configured to act as targets and sources respectively. The connectivity pattern is further generated during the neural network life-time. We use a simplified neuron model whose implementation on the ubichip requires only six macrocells. Each dendrite includes the required logic for creating and destroying a synapse in a probabilistic way, and is implemented in a single macrocell. Two more macrocells are used for implementing the soma (cell body of a neuron), the axon (the computation of the activation function and the neuron output), and the management of the dynamic routing address modification. Figure 8 illustrates the complete ontogenetic process with a series of screenshots obtained from the ubimanager tool. Initially, a single neuron is configured on the ubichip(top left screen-shot). A replication process can be triggered on this neuron which can be copied somewhere else in the circuit. A first step is to select where will it be copied, and create a dynamic routing unit to that location
Ubichip, Ubidule, and MarXbot: A Hardware Platform
295
Fig. 8. Sequence of screen-shots of the dynamic routing layer during the development of a neurogenetic and synaptogenic network with 16 4-inputs neurons
(top centre). Then the configuration is sent serially from the initial location to the destination, in order to use this information for creating an exact copy of the initial neuron after a certain number of clock cycles (top right). Now we have two neurons that can again replicate both of them simultaneously, so new target locations are selected (bottom left) for obtaining 2 newly created neurons (bottom centre). At the end, we obtain a circuit fully populated of neurons (bottom right), which in parallel had also performed a probabilistic synaptogenic process that permitted them to interconnect their dendrites and axons. We have experienced two types of developmental processes: random and ambient-driven networks. During the development of random networks, neurogenetic and synaptogenic processes are triggered randomly. The development of ambient-driven networks considers the existence of a set of input stimuli that increase the probability of most active neurons for getting connected. Moreover, existing unused synapses can be removed. At the end, our resulting networks exhibit similarities with topologies observed on biological neural networks [14]. 5.2
Collective Robotics for Target Localization
We have used the marXbot platform for testing a novel approach for the localization of targets in a population of foragers. The control of the population of robots is performed in a distributed way. In this implementation, our robots have two possible states which are “work” and “search”. In the “work” state, robots perform a certain foraging task and are distributed on the arena. In the case of the work presented in this chapter, we have a dummy foraging task
296
A. Upegui et al.
consisting on navigating on the arena avoiding obstacles. The main interest is in the “search” state, in which a robot will try to arrive to a specific target region on the arena. This target region could be a battery charging station, an area for garbage disposal, or the output of a maze. Whatever the robot may search, the goal is to exploit the collective knowledge, given that there are other robots that can estimate how far they are from the target region, and will somehow help the searching robot to achieve its goal. The proposed target localization avoids the use of global positioning systems, that might be difficult to deploy in unknown or hostile environments, and avoids also the use of odometry, which is sensitive to cumulated errors after large running periods. Our approach uses colour LEDs and omnidirectional cameras in order to indicate to other robots the shortest path to a desired target, based on a principle of disseminating the information gathered by the robots through the population. The proposed coordination scheme is completely distributed and uses state communication [1] in a intrinsic way, i.e. robots transmit some information about their internal state, but they are not aware of whether other robots receive this information or not. This fact simplifies the communication and endows the system with an intrinsic robustness. The application runs in both real marXbots robots and in the Enki simulator [5]. Figure 9 show a simulation window and figure 10 shows a set of robots on an arena running the foraging task. The rich sensory elements present in the marXbot robots represent an excellent modelling platform for the simulation of complex systems that interact with the environment.
Fig. 9. Simulated arena where the experiments evolve running the foraging task
Fig. 10. Set of marXbots running a foraging task and searching for specific zones of the arena
We performed a series of five experiments with increasing information about the target position [8]. Each of the five experiments adopt a different strategy for finding the target, and the whole set of robots were constantly searching targets. The strategies were:(1) random search, (2) the use of static landmarks, (3) the use of static landmarks and the robots as landmarks, (4) the use of only the robots as landmarks, and (5) the use of a gradient of colours, by landmark
Ubichip, Ubidule, and MarXbot: A Hardware Platform
297
propagation, mimicking a social localization. This incremental comparison has shown that the “social” approach, whereby the navigation of the population is guided by a gradient of colours, improved the performance in finding a target. We have also compared our social localization approach with a global positioning system (similar to a GPS system) in which a robot knows its position and the position of the target area. The social approach performs slightly worse without the presence of obstacles. However, when including obstacles between the robot and the target area, the social approach largely outperforms the global knowledge system, since the colour gradient formed by the colony of robots does not indicate the direction of the target but, more useful, they point the path to follow in order to find the target.
6
Conclusions
In this paper we presented the complete hardware platform resulting from the Perplexus project. The goal of the hardware is to serve as modelling platform for more realistic complex systems able to interact with the environment, and able to mimic and/or deal with physical constraints. The complete platform is composed of a reconfigurable digital device - the ubichip -, an ubiquitous computing module - the ubidule -, and a robotic platform - the marXbot -. In every case, we have also provided the possibility of simulating the systems in a transparent manner. The ubimanager tool allows to configure a real chip or to simulate a circuit from its own VHDL description [12]. The ASEBA framework [4], allows to write program for the marXbot being possible to be run on the real robot, or on the enki simulator [5]. We have also shown two examples were the platform has been successfully used for the simulation of complex systems. The first one, an ontogenetic neural network, uses the ubichip as substrate for implementing dynamic topology mechanism such as neurogenesis and synaptogenesis. The second one, a social localization task, uses the marXbot in order to implement a foraging task where robots must eventually find specific areas. There are other examples of complex systems implemented on the Perplexus hardware platform: incremental learning on the ubichip for a marXbot controller [9], artificial neural networks in SIMD mode [2], optimizer swarms of self-replicating particles, protein based computation [6], and social networks for activity recognition based on wearable systems. All these applications use of one or several parts of the Perplexus hardware platform for their implementation. The platform has shown to be an interesting alternative for complex systems modelling. The ubichip’s architectural features have provided the required flexibility for modelling the complex processes involved in the formation and evolution of dynamic networks. The sensing and actuating capabilities of the marXbot robot have provided an enhanced interaction with the environment and with other robots. And the ubidule has constituted the base platform for hosting the ubichip, allowing it to interact with the world in a flexible manner.
298
A. Upegui et al.
Acknowledgment The authors would like to thank all the members of the Perplexus project for their valuable work, and their colleagues at the REDS and the MOBOTS groups for their support. This work is funded by the FET programme IST-STREP of the European Community, under grant IST-034632 (PERPLEXUS).
References 1. Balch, T., Arkin, R.C.: Communication in reactive multiagent robotic systems. Auton. Robots 1(1), 27–52 (1994) 2. Hauptvogel, M., Madrenas, J., Moreno J.M.: SpiNDeK: An integrated design tool for the multiprocessor emulation of complex bioinspired spiking neuronal networks. In: Haddow, et al. (eds.) Proceedings of the IEEE Congress on Evolutionary Computation - CEC 2009, pp. 142–149 (2009) 3. Magnenat, S., Longchamp, V., Bonani, M., R´etornaz, P., Germano, P., Bleuler, H., Mondada, F.: Affordable SLAM through the Co-Design of Hardware and Methodology. In: Proceedings of the 2010 IEEE International Conference on Robotics and Automation. IEEE Press, Los Alamitos (2010) 4. Magnenat, S., R´etornaz, P., Bonani, M., Longchamp, V., Mondada, F.: ASEBA: A Modular Architecture for Event-Based Control of Complex Robots. IEEE/ASME Transactions on Mechatronics (2010) 5. Magnenat, S., Waibel, M., Beyeler, A.: Enki - an open source fast 2d robot simulator, http://home.gna.org/enki/ 6. Parra, J., Upegui, A., Velasco, J.: Cytocomputation in a biologically inspired and dynamically reconfigurable hardware platform. In: Haddow, et al. (eds.) Proc. of IEEE Congress on Evolutionary Computation - CEC 2009, pp. 150–157 (2009) 7. Pena, J.C., Pena, J., Upegui, A.: Evolutionary graph models with dynamic topologies on the ubichip. In: Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp. 59–70. Springer, Heidelberg (2008) 8. Satiz´ abal, H.F., Upegui, A., P´erez-Uribe, A.: Social target localization in a population of foragers. In: NICSO, pp. 13–24 (2010) 9. Satiz´ abal, H.F., Upegui, A.: Dynamic partial reconfiguration of the ubichip for implementing adaptive size incremental topologies. In: Haddow, et al. (eds.) Proceedings of the IEEE Congress on Evolutionary Computation - CEC 2009, pp. 131–141 (2009) 10. Thoma, Y., Sanchez, E., Arostegui, J.M.M., Tempesti, G.: A dynamic routing algorithm for a bio-inspired reconfigurable circuit. In: Cheung, P.Y.K., Constantinides, G.A. (eds.) FPL 2003. LNCS, vol. 2778, pp. 681–690. Springer, Heidelberg (2003) 11. Thoma, Y., Tempesti, G., Sanchez, E., Moreno, J.M.: POEtic: an electronic tissue for bio-inspired cellular applications. Biosystems 76(1-3), 191–200 (2004) 12. Thoma, Y., Upegui, A.: Ubimanager: a software tool for managing ubichips. In: NASA/ESA Conference on Adaptive Hardware and Systems, pp. 213–219 (2008) 13. Thoma, Y., Upegui, A., Perez-Uribe, A., Sanchez, E.: Self-replication mechanism by means of self-reconfiguration. In: Lukowicz, P., Thiele, L., Tr¨ oster, G. (eds.) ARCS 2007. LNCS, vol. 4415. Springer, Heidelberg (2007) 14. Upegui, A., Perez-Uribe, A., Thoma, Y., Sanchez, E.: Neural development on the ubichip by means of dynamic routing mechanisms. In: Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp. 392–401. Springer, Heidelberg (2008)
Implementation of a Power-Aware Dynamic Fault Tolerant Mechanism on the Ubichip Platform Kotaro Kobayashi1, Juan Manuel Moreno2 , and Jordi Madrenas2 1 2
Delft University of Technology, Delft, The Netherlands [email protected] Universitat Polit`ecnica de Catalunya, Barcelona, Spain [email protected], [email protected]
Abstract. Dynamic fault-tolerant techniques such as Built-in Self Repair (BISR) are becoming increasingly important as new challenges emerge in deep-submicron era. A dynamic fault-tolerant system was implemented on the Ubichip platform developed in the PERPLEXUS European project, which is a bio-inspired custom reconfigurable VLSI. The system is power-aware; power consumption is monitored dynamically to regulate the number of copies made by a self-replication mechanism. This paper reports the design, implementation, and simulation of the fault-tolerant system. Keywords: Dynamic Fault Tolerance, Self-replication, Reconfiguration, BISR, Bio-inspiration, Ubichip, PERPLEXUS, Power-awareness.
1
Introduction
The IC technology scaling, which follows the famous Moore’s law has evoked a great deal of advancement in modern electronics for the last few decades. Designers have been able to integrate greater number of transistors on a limited area of silicon die; modern VLSI systems with multiple function blocks on a single die allow designers to reduce the physical size of the systems and manufacturing costs. The ITRS predicts in [2] that the gate length of VLSI systems will go below 20 nm in the later half of this decade, a length enough to fit only few hundreds of silicon atoms in one line. This deep-submicron paradigm poses new challenges to the VLSI design; intricacy of the fabrication will be greater, so that manufacturing defects will likely increase while testing for those defects will be very challenging due to the ever increasing complexity of the system. The reliability will also suffer due to phenomena such as gate insulator tunneling, Joule heating, and electromigration. Furthermore, the small feature size will certainly increase the unpredictable errors due to alpha particles, namely soft error, or Single Event Upset (SEU) [1], [2] . G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 299–309, 2010. c Springer-Verlag Berlin Heidelberg 2010
300
K. Kobayashi, J.M. Moreno, and J. Madrenas
There have been many advancements in techniques such as Design for Test (DFT) and Built-in Self-test (BIST) [1]. While these tests can effectively detect faults due to defects, they cannot detect unforeseeable faults caused by aging-defects or temporal faults such as SEU. In order to assure the reliability while incorporating deep-submicron technologies, the system should have dynamic fault-tolerance capabilities to detect and correct errors on the run. If a VLSI system can autonomously detect and correct an error situation dynamically, it will not only increase the reliability but also the yield and life-time of the ICs, resulting in a significant cost reduction [5]. The Ubichip is a bio-inspired custom reconfigurable VLSI system developed in the PERPLEXUS project [6], [10]. Ubichip offers bio-inspired capabilities such as dynamic routing and self-replication. The operational flexibility provided by these mechanisms makes Ubichip an ideal platform to implement dynamic fault tolerant systems with Built-in Self Repair (BISR) capabilities. This paper presents the design, development, and simulation of a power-aware fault-tolerant system implemented on the Ubichip. Section 2 discusses the background and overall system architecture. Section 3 briefly introduces the Ubichip platform used in this experiment. Section 4 describes the implementation of the design in detail. Section 5 discusses the implementation and simulation results. Finally, the future research areas as well as concluding remarks are included in section 6.
2 2.1
A Power-Aware Fault Tolerant System Background
In order to protect a system from logic errors during run-time,it can use Built in Self Repair (BISR). Several different methods of implementing BISR are discussed in [5]. Triple Modular Redundancy (TMR) is a widely known method of BISR. Although it is also known to be area consuming, it is very simple to design and unlike error correcting codes [3], no static specialized design tools are required; it is more versatile in accommodating different logic circuits. In dynamic reconfigurable systems, Funcion Units (FU) are configured at run-time as required. Unused FUs can simply be deleted to give more space for necessary functions. In such systems, the same TMR circuit can work for different FUs configured in the same area because of the simplicity of the algorithm. 2.2
Power Awareness
The power consumption must be considered when implementing TMR. Having three identical circuits would result in at least three times more power consumption in terms of switching current. Furthermore, power consumption is a major issue to be solved in VLSI today; larger circuits, higher operation frequency, and smaller feature size all contribute to higher power consumption. TMR is intrinsically not a power efficient design technique. In order to reduce the effect on power consumption, authors have implemented the power-aware
Implementation of a Power-Aware Dynamic Fault Tolerant Mechanism
301
TMR system based on a previous work presented in [11]; the system monitors its power consumption and eliminates one or both of the TMR copies when the power consumption is above a predefined threshold. While clock gating or power gating also can be used to control the power consumption of TMR designs in the same way, our framework on Ubichip is capable of dynamic reconfiguration, thus same FU space can be used for different blocks according to power consumption and operation phases. 2.3
System Description
Figure 1 shows the FSM states of the power aware design presented in this paper. Initially the system starts with a single functional unit (FU). As our system platform (Ubichip) does not have current sensing capabilities, the power consumption of the running application is measured by a ’transition counter’. This subsystem estimates the power consumption by means of a ’counter value’ and controls the number of FU copies using the self-replication (SR) function of the Ubichip. Counter value is computed by accumulating output values from multiple clock cycles and counting the number of transitions. When the number of transitions from the original FU is the highest, meaning in this case more than 3 bits transition in 2 consecutive clocks, the counter value is ’00’ and no copies of FU are made. When the number of transitions is low, meaning the transition from the original FU is between 0 and 1 bit for 2 consecutive clocks, the counter value becomes ’10’, which leads the system to create 2 copies of FU. Counter value ’01’ is an intermediate transition count; when the output from the original FU has 2-bit transition for more than 2 clock periods, only one copy of the FU is created.
Fig. 1. FSM State Diagram of the implemented system
302
K. Kobayashi, J.M. Moreno, and J. Madrenas
The system starts with single FU mode. After few clock cycles the transition counter estimates the current consumption and indicates it as ’counter value’. The system constantly monitors its power consumption and changes the number of FU copies accordingly.
3
A Reconfigurable Framework: PERPLEXUS
The system was designed within a framework developed in the PERPLEXUS European project. The Ubichip is the kernel of this project; a reconfigurable VLSI system endowed with bio-inspired capabilities. Details of the PERPLEXUS project can be found in [10], [6]. 3.1
Ubichip
Ubichips are mounted on a prototype system called Ubidule, explained in [10]. A Ubichip consists of three major blocks: An array of reconfigurable processing elements called Macrocell (MC), the System Manager and a controller for Content Addressable Memory (CAM). The system manager block is responsible for configuring the reconfigurable array and external communication. Each MC is made up with four reconfigurable cells called Ubicell, which is explained later in this section. The configuration bit stream for each MC can be recovered and configured dynamically using the Self-Replication (SR) function of the Ubichip. The SR function is used extensively in this project, thus its details are briefly explained later in this section. Each MC also contains a Dynamic Routing (DR) control unit, which allows a pair of MCs to establish communication paths dynamically. The DR functionality of Ubichip is further explained in [7]. Furthermore, a Ubichip can also be configured in multiprocessor mode where a SIMD-like parallel machine can be implemented. 3.2
The Ubicell
Figure 2 shows the overall organization of a Ubicell. As explained extensively in [4], a Ubicell can be configured to implement various logic functions in LUT mode or work as a part of multi-processor machine in ALU mode. In this project all the cells are configured to various configurations within LUT mode. 3.3
Inter-cell Connection
Neighboring Ubicells can be connected by selecting appropriate input/output multiplexers. Figure 3 shows the neighborhood connectivity among Ubicells. The output multiplexers are able to choose not only the output but raw input from other neighbor cells as well. Furthermore, it is possible for any pair of macrocells (4 Ubicells) to communicate using the Dynamic Routing (DR) capability.
Implementation of a Power-Aware Dynamic Fault Tolerant Mechanism
303
Fig. 2. Organizatio of a Ubicell (Left), Ubicell array and Macrocell (Right)
Fig. 3. Inter-Ubicell Connectivity
3.4
Self Reconfigulation
A group of more than one macrocells (organism) can be copied to other parts of the Ubicell array using the Self-Replication (SR) mechanism. An organism has the configuration bits of its MCs connected by a chain of shift registers. The configuration bits of MCs can be recovered through this chain by a SR controller. The SR controller can use this recovered bit-stream to configure an empty area during self-replication process. Details of the SR controller on Ubichip are explained in [9]. 3.5
Ubimanager
The authors used a software tool called Ubimanager, which was designed in the PERPLEXUS project in order to manage the Ubichips. The Ubimanager allows developers to design Ubichip implementations by means of a GUI environment;
304
K. Kobayashi, J.M. Moreno, and J. Madrenas
developers can configure all the three layers of Ubichip: Ubicells, Dynamic Routing Units (DR), and Self-Replication Units (SR). It is also capable of simulating the implementation using Modelsim. A detailed description of Ubimanager tool is provided in [8]. In a Ubimanager environment, the array of Ubicells is represented in a GUI window; a developer can configure each cell by double-clicking the cell to open the configuration window.
4
Implementation
Figure 4 shows shows a block diagram of the dynamic fault tolerant system implemented on a Ubichip. There are three SR controllers; one is responsible for reading the configuration bit-stream from the original FU, being the other two responsible for replicating the copies. The control signals for the SR controllers are created in the ’Control FSM’ block. The level of system power consumption is sent to the control FSM from the ’Transition Counter’ block as ’counter value’. According to the power consumption level the FSM changes the operation mode and forces the SR controller to have an appropriate number of FU copies. Every time new copies of the FU are made, the Control FSM block relies on the signal from the ’SR Timer’ block to stop the SR process upon completion. The outputs from FUs are compared at the ’Output Comparator’ block.
Fig. 4. Power Aware Fault Tolerant System: Overall Block Diagram
The functionality of its main building blocks is the following. Functional Unit (FU): As the goal of this experiment is to show a proof of concept working system, the implementation of the Functional Unit (FU) was kept simple; a combination of memory, counter and pseudo-random number generator (LFSR) was configured in a MC as shown in figure 5.
Implementation of a Power-Aware Dynamic Fault Tolerant Mechanism
305
Fig. 5. Functional Unit (FU)
Output Comparator: In this implementation, output comparator simply indicates the bit-wise XOR of the outputs from each FU. In a future implementation, the comparator result should be fed back to the controller to implement error correction. SR Controller: While it is possible for one SR controller to remove and make a copy of an organism (set of MCs), ’remote configuration’ explained in [9] is necessary to control two different copies separately. In this case, a total of 3 SR units are required: one for recovering the configuration bit stream of the original organism, one each for configuration of the two copies. The SR mechanism of Ubichip blocks the output from the MCs during the SR process, eliminating the need to filter erroneous output during the SR process. Furthermore, values of each register in the MCs are incorporated in the configuration bit stream; states of the circuits are preserved to the newly created copy of an organism. SR Timer: A 4-bit flag called ’H-flag’ contained in each MC defines the shape of an organism. The SR unit does not have the number of MCs included in a single organism; it is not possible for the SR unit alone to determine the number of cycles required to complete a SR process. A counter is necessary to stop the SR process at an appropriate time. FSM: While a Ubimanager provides a GUI environment for design implementation, it cannot compile from high-level languages such as C or VHDL; the entire circuit must be implemented by a combination of circuits available in the LUT mode of the Ubicells. The authors resorted to utilize commercially available RTL synthesizer tools to implement the FSM. First, the state chart was converted to HDL using Mentor Graphics HDL Designer. Next, Precision RTL, also by Mentor Graphics was used to syntesize the HDL and produce the RTL schematics with look-up tables (LUTs). The contents of the LUTs as well as the connections among the LUTs were then configured manually to each Ubicell using Ubimanager. Routing, Floor planning: Figure 6 shows the implemented system. One can see the wiring for routing, and configured Ubicells in this figure. All the routing and floor planning are conducted manually; there is no automatic tool available thus planning of the location of each functional cells, connections among cells, and overall floor-planning is a very crucial part of design implementation on Ubichips, and should be conducted carefully.
306
K. Kobayashi, J.M. Moreno, and J. Madrenas
Fig. 6. System Implementation on Ubichip
5
Implementation Results
The implementation was tested using the Modelsim tool integrated in the Ubimanager environment. Figure 7 and 8 show screen-shots of the system under simulation. One can see how a different counter value results in a different number of copies. Each system block was confirmed to be working according to the design intention. After the system was verified by simulation it was physically implemented in the Ubichip available in the Ubidule board. 5.1
Cell Count, Area Overhead
Table 1 shows the number of cells used for each system block. The fault tolerant system takes a total of 64 Ubicells. The area overhead in this application is significant because a very primitive 4-cell single MC was used as the FU. When a larger FU is implemented, the area of the rest of the system remains unchanged. As the size of Ubicell array in Ubichip is 10 by 10 MCs (20x20=400 Ubicells), the area overhead of this fault tolerant system is 16% of the total array area. 5.2
Timing Observation
Configuring cells using a serial register means that the time required for configuration increases as the number of MCs to be replicated increases. Table 2 shows the cycles required for the SR unit to complete for different number of MCs. The worst-case estimation for the operating frequency of Ubichip is 50MHz. Since the operation of the FU must pause during the replication process, the replication time especially for larger FUs may become a serious issue for timing-critical applications.
Implementation of a Power-Aware Dynamic Fault Tolerant Mechanism
Fig. 7. Simulation View. 2FU mode
Fig. 8. Simulation View. 1FU mode Table 1. Cell Count of the Design Block: # of cells: Control FSM 22 SR Controller (x3) 12 Output Comparator 11 SR Timer 5 Transition Counter 14 Functional Unit 4
307
308
K. Kobayashi, J.M. Moreno, and J. Madrenas
Table 2. Clock Cycle and time in seconds required for Self-Replication (at 50MHz) Number of MCs: Clock Cycles: Time in seconds: 1 547 10.9µs 2 1,072 21.4µs 10 5,272 105µs 300 157,522 3.15ms
6 6.1
Conclusion Concluding Remarks
Built-in Self-Repair (BISR) is a technique becoming more and more important as the feature size of VLSIs shrink and the chance of faults such as aging defect and temporal errors increases. In this paper, a conceptual design of a poweraware BISR system using triple-modular redundancy (TMR) was implemented on a custom dynamically reconfigurable platform. Motivation of such system as well as the design and implementation was explained followed by the simulation and implementation results and observation. The authors have successfully demonstrated how the Ubichip, a bio-inspired reconfigurable custom VLSI can be used to implement flexible power-aware fault tolerant systems. 6.2
Future Work
In order to have the fault-tolerant system presented here to be available for more practical uses, the authors have found several directions for future research: Power estimation: Accurate measurement of power consumption is necessary to have a power-aware system working correctly. As the Ubichip platform does not offer current measurement capabilities, this experiment took transition of output values to estimate the dynamic power consumption of the functional unit. A research should be conducted to incorporate a system to measure the power consumption more accurately. Error Correction: In this experiment, a simple bit-wise XOR circuit compared the outputs in the TMR system. Further research and development is necessary to implement an error correction capabilities. Such correction system should detect and locate the circuit with error, eliminate the faulty circuit out from the TMR trio, and create a new copy of the circuit in a new location. SR Controller: The control mechanism of the system was implemented on reconfigurable cells on the Ubichip, resulting in an area overhead on the reconfigurable fabric. A research should be conducted to study the possibility of implementing the self-replication controller circuit as part of the platform so that developers can easily implement this BISR capability in their new designs.
Implementation of a Power-Aware Dynamic Fault Tolerant Mechanism
309
Developing Environment: Ubimanager provides many useful features for designing and implementing functions on Ubichip platform. However, the lack of a high-level language compiler means that the developers must implement LUT contents and routing manually, increasing the development time significantly. Furthermore, the lack of debug tools makes it very time consuming to detect and correct errors in the design. Design tools such as floor planner, interconnect router, high-level language compiler, and debugger would make Ubichip more accesible in practical applications.
Acknowledgements This work has been partially funded by the European Union (PERPLEXUS project, Contract no. 34632).
References 1. Bushnell, M.L., Agrawal, V.D.: Essentials of Electronic Testing for Digital, Memory and Mixed-Signal VLSI Circuits. Kluwer Academic Publishers, Boston (2000) 2. International technology roadmap for semiconductors. 2009 ITRS report, emerging research materials. Technical report (2010) 3. Kleihorst, R.P., Benschop, N.F.: Fault tolerant ICs by area-optimized error correcting codes. In: IOLTW, p. 143. IEEE Computer Society, Los Alamitos (2001) 4. Moreno, J.M., Madrenas, J.: A reconfigurable architecture for emulating large-scale bio-inspired systems. In: IEEE Congress on Evolutionary Computation, CEC 2009, pp. 126–133, 18-21 (2009) 5. Nieuwland, A.K., Kleihorst, R.P.: IC cost reduction by applying embedded fault tolerance for soft errors. J. Electronic Testing 20(5), 533–542 (2004) 6. PERPLEXUS Project. Pervasive computing framework for modeling complex virtually-unobunded (2010), http://www.perplexus.org/ 7. Thoma, Y., Sanchez, E., Moreno, J.M., Tempesti, G.: A dynamic routing algorithm for a bio-inspired reconfigurable circuit. In: Cheung, P.Y.K., Constantinides, G.A., de Sousa, J.T. (eds.) Field-Programmable Logic and Applications, pp. 681–690. Springer, Heidelberg (2003) 8. Thoma, Y., Upegui, A.: Ubimanager: A software tool for managing ubichips. In: NASA/ESA Conference on Adaptive Hardware and Systems, AHS 2008, pp. 213– 219, 22-25 (2008) 9. Thoma, Y., Upegui, A., Perez-Uribe, A., Sanchez, E.: Self-replication mechanism by means of self-reconfiguration. In: Lukowicz, P., Thiele, L., Tr¨ oster, G. (eds.) ARCS 2007. LNCS, vol. 4415. Springer, Heidelberg (2007) 10. Upegui, A., Thoma, Y., Sanchez, E., P´erez-Uribe, A., Moreno, J.M., Madrenas, J., Sassatelli, G.: The PERPLEXUS bio-inspired hardware platform: A flexible and modular approach. KES Journal 12(3), 201–212 (2008) 11. Vargas, J.S., Moreno, J.M., Madrenas, J., Cabestany, J.: Implementation of a dynamic fault-tolerance scaling technique on a self-adaptive hardware architecture. In: Prasanna, V.K., Torres, L., Cumplido, R. (eds.) Proceedings of ReConFig 2009: 2009 International Conference on Reconfigurable Computing and FPGAs, Cancun, Quintana Roo, Mexico, December 9-11, pp. 445–450. IEEE Computer Society, Los Alamitos (2009)
Automatic Synthesis of Lossless Matching Networks Leonardo Bruno de Sá1, Pedro da Fonseca Vieira2, and Antonio Mesquita3 1
Brazilian Army Technological Center, Av. das Américas, 28705, Guaratiba, Rio de Janeiro, Brazil 2,3 Federal University of Rio de Janeiro, Ilha do Fundão, Electrical Engineering Program, Rio de Janeiro, Brazil [email protected], [email protected], [email protected]
Abstract. An evolutionary method for the synthesis of impedance matching networks is proposed. The algorithm uses a coding scheme based on the graph adjacency matrix to represent the topology and component values of the circuit. In order to generate realistic solutions the sensitivities of the network parameters are accounted for during the synthesis process. To this end a closed form expression for the Transducer Power Gain sensitivity with respect to the component values of LC lossless matching networks is derived, in such a way that the effects of the components tolerance on the matching network performance can easily be quantified. The evolutionary algorithm efficiency is tested in the synthesis of an impedance matching network and the results are compared with other methods found in the literature. Keywords: circuit synthesis, matching networks, graph representation.
1 Introduction The impedance matching problem consists of finding a linear time-invariant lossless two-port network (also called equalizer) such that the available power from an input resistive generator is delivered to a complex frequency dependent load over a prescribed frequency band [1]. Many techniques have been developed for the impedance matching problem solution along the last seventy years. The first successful technique, called analytic gain bandwidth [2-3], is based on a load model that characterizes the matching circuit termination by a prescribed rational transfer function. This technique provides a gain bandwidth limitation on any lossless infinite-element matching circuit having simple RC or RLC loads. Moreover, it describes a procedure to synthesize practical equalizers for a given load model. The main drawback of this technique is the need of precise load models requiring the use of lengthy approximation methods. In order to overcome the difficult problem of designing an equalizer by analytical means, the Real Frequency Technique (RFT) has emerged as a major breakthrough in equalizer design [4-5]. The main advantage of this technique is that no model of the load is required. However, this method is not quite effortless since several computational phases involving data fitting processes and explicit factorization of real polynomials are still necessary. G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 310–321, 2010. © Springer-Verlag Berlin Heidelberg 2010
Automatic Synthesis of Lossless Matching Networks
311
The methods described previously provide a procedure to synthesize an equalizer but present at least one of the following drawbacks: require an approximate rational model of the load [2-3], require some data fitting process [4-5], require an initial guess of the equalizer parameters [6] or use general purpose fixed topologies for the matching network [2-5]. These problems can be conveniently handled if an evolutionary approach is employed to synthesize the matching network. It will be shown that the load characteristics can be directly manipulated by the evolutionary process, improving the synthesis results since no approximation errors derived from the antenna modeling are introduced and the proposed evolutionary process does not require any data fitting or initial parameter guess. Moreover, since classical synthesis methods use a limited number of fixed topologies, they are unable to explore a wide range of design alternatives. On the other hand, evolutionary techniques may be used to find unconventional network topologies [6-7]. In the present work, a hybrid evolutionary method combining two algorithms is proposed for the matching network synthesis problem. The benefits of hybrid methods combining Genetic Algorithm (GA) and traditional optimization techniques were already attested for analog synthesis [8-9]. The topology search is provided by a GA based on the adjacency matrix chromosome representation [10], while the component values tuning is performed combining the GA and the classical Nelder-Mead Downhill Simplex [11] method. The fitness computation considers practical ranges for the component values and the matching network sensitivity, eliminating the need for post-synthesis components tolerance analysis.
2 TPG Sensitivity Computation The impedance matching problem is shown schematically in Fig 1 where the internal nodes of the lossless network are numbered from left to right starting from node 1 up to node n. V1 denotes the voltage on node 1 and Vn denotes the voltage on node n. The values of the load impedance, Zload ( ω ) , are sampled at a discrete set of frequencies along the desired bandwidth. Then a table containing the real and imaginary parts of Zload ( ω ) at each sample frequency is known. The matching problem for a lossless network may be formulated in terms of the Transducer Power Gain (TPG) defined as the fraction of the maximum power available from the source which is delivered to the load [1]:
Fig. 1. Impedance matching arrangement
312
L.B. de Sá, P. da Fonseca Vieira, and A. Mesquita
TPG ( ω ) = 1 − ρ1 ( ω )
2
(1)
where ρ1 ( ω ) , the reflection coefficient of port 1, is a function of the input impedance Z1 ( ω ) :
ρ1 ( ω ) =
Z1 ( ω ) − R o1
(2)
Z1 ( ω ) + R o1
The voltage V1 may be expressed in terms of Z1 and R o1 as: V1 =
Z1 Z1 + R o1
(3)
where the frequency dependence was omitted to simplify. Combining (1), (2) and (3), the reflection coefficient of port 1 may be expressed in terms of V1 :
ρ1 = 2V1 − 1
(4)
Considering a lossless matching network with m parameters p = [ p1 , p 2 ,..., pm ] where the pi 's are the network component values. The TPG sensitivity with respect to each parameter, pi , is defined as [12]:
= STPG pi
pi ∂TPG ⋅ TPG ∂pi
(5)
Combining (1), (4) and (5), the TPG sensitivity may be written as:
STPG = pi
(
4pi 2
)
2V1 − 1 − 1
⋅ℜ [ 2V1 − 1] ⋅
∂V1 ∂pi
(6)
where ℜ [•] is the real part of a complex variable. Since the nodal voltages of Fig. 1 can be obtained in an AC analysis, the only unknown term to be determined in (6) is the partial derivative of V1 with respect to pi . This term can be expressed as [13]: t ∂ [Y] ∂V1 = ⎡⎣ V a ⎤⎦ ⋅ ⋅ [V] ∂pi ∂pi
(7)
where ⎡⎣ V a ⎤⎦ , [ Y ] and [ V ] are, respectively, the adjoint nodal voltage vector, the nodal admittance matrix and the nodal voltage vector. The nodal voltage vector and the adjoint nodal voltage vector can be obtained, respectively, from the AC smallsignal analysis of the circuits shown in Fig. 1 and Fig. 2. The last term in order to compute (7) is the partial derivative of the admittance matrix. There are only two possible ways of connecting a two-terminal element in a network as shown in Table 1: a “floating” connection or a “grounding” connection. The corresponding contributions to the ∂ [ Y ] / ∂pi matrix are given in the same table.
Automatic Synthesis of Lossless Matching Networks
313
Fig. 2. Circuit used to obtain the adjoint voltage vector Table 1. Partial Derivative of the Admittance Matrix Connection Case
∂ [ Y]
Circuit Connection
∂pi
⎧ j ≠ gnd floating case ⎨ ⎩ k ≠ gnd
⎧ j ≠ gnd grounding case ⎨ ⎩k = gnd
The non-zero entries of the matrices in Table 1 may be expressed as: ∂Yjj ∂pi
=
∂Yjk ∂Ykj ∂Ykk =− =− = YLC ∂pi ∂pi ∂pi
(8)
where YLC is given by:
1 ⎧ , if pi is an inductor ⎪− YLC = ⎨ j ⋅ ω⋅ L2 ⎪ j⋅ ω , if pi is a capacitor ⎩
(9)
where ω∈ Ω = ⎡⎣ωmin , ωmax ⎤⎦ . Replacing (9) and (8) in (7) and (7) in (6):
)(
)
4pi ⎧ ⋅ ℜ ⎣⎡2V1 -1⎦⎤ ⋅ YLC ⋅ Vj -Vk ⋅ Vja -Vka , if floating connection ⎪ 2 2V -1 -1 ⎪⎪ 1 STPG pi = ⎨ 4pi ⎪ ⋅ ℜ ⎡⎣2V1 -1⎤⎦ ⋅ YLC ⋅ Vj ⋅ Vja , if grounding connection ⎪ 2V -1 2 -1 1 ⎪⎩
(
)
(
)
(
(10)
Therefore, according to (10), in order to compute the TPG sensitivity for a lossless impedance matching network, two AC analyses must be performed. In the first AC
314
L.B. de Sá, P. da Fonseca Vieira, and A. Mesquita
analysis, the circuit in Fig. 1 is used to compute V1 ,…, Vk . In the second AC analysis, the circuit in Fig. 2 is used to compute the adjoint voltages V1a ,…, Vka .
3 Proposed Evolutionary Algorithm All input parameters of the proposed evolutionary algorithm are listed in Table 2. Table 2. Control Parameters used in the Proposed Evolutionary Algorithm
Maximum Number of Nodes
Parameter
Acronym N nodes _ max
Minimum Number of Nodes
N nodes _ min
2
Population Size
Nind
200
Number of Generations
N gen
100
Probability of Crossover
PC
0.6
Probability of Mutation Simplex Number of Iterations
Values 10
PM
0.1
N NM
200
λ
0.01
wTPG ,wSens ,wPF,wTR
2,2,1,1
Capacitors Interval
[ Cmin , Cmax ]
[0 ,5 ]F
Inductors Interval
[ L min , L max ]
[0 ,5 ]H
Penalty Constant Weighting Factors
3.1 Representation
Let G ( v, e ) be an oriented graph with no parallel edges and n + 1 nodes sorted out between 0 and n , where 0 is the ground node of the circuit. The reduced adjacency matrix A = [a ij ] of the oriented graph G is the n x n matrix defined as [11]: ∀ i≠ j≠0 ⎧1, if (i, j) ∈ e a ij = ⎨ ⎩0, otherwise a ii = 1, if (i, 0) ∈ e
(11)
In the above definition, the self-loops are replaced by the adjacencies to the ground node. Fig. 3 shows an example of a graph representing a typical topology of an analog circuit and its corresponding reduced adjacency matrix. The branches {e1 , e 2 , e3 } are the adjacencies to the ground node represented by the main diagonal entries. It will be shown that the adjacency matrix representation is extremely flexible, representing any lossless network topology. This is an important advantage when compared with other chromosome coding schemes found in the literature [7, 15] that limit
Automatic Synthesis of Lossless Matching Networks
315
Fig. 3. Adjacency matrix representation for analog circuits (a) oriented graph G (b) adjacency matrix A
the number of topologies generated by the evolutionary process to a small class of circuits such as ladder networks. The reduced adjacency matrix, as stated in (18), allows representing topologies of any type as shown in Fig. 4. In the proposed topology coding scheme, the capacitors, the inductors and the parallel associations (C//L) are, respectively, encoded by the numbers 1, 2 and 3. Note that the parallel association (C//L) allows the adjacency matrix to represent grounded components of different type connected to the same node.
Fig. 4. (a) Ladder and (b) non-conventional topology and their corresponding adjacency matrix representations
It can be observed from the adjacency matrix definition that the proposed encoding scheme can map at most two parallel edges between two vertices. Since elements of the same nature in parallel can be replaced by their equivalents, this does not restrict the topology search in a lossless network synthesis. In the particular case of analog circuits, the representation must simultaneously represent topology and component values. In this sense, a 3D matrix was implemented as illustrated in Fig. 5. In this structure, the first matrix dimension defines the network topology according to (18) with the elements encoding scheme used in Fig. 4. The other two matrix dimensions represent, respectively, capacitor and inductor values.
Fig. 5. Proposed representation of an impedance matching network using the adjacency matrix
316
L.B. de Sá, P. da Fonseca Vieira, and A. Mesquita
The component values ranges are defined at the start of the evolutionary process in such way that the synthesized circuit can be implemented with practical element values as described in the last two lines of Table 2. A linear normalization procedure is employed to represent the component values using a 16-bit unsigned integer matrix. 3.2 Crossover and Mutation The proposed crossover strategy, consisting in exchanging two submatrices with randomly chosen dimensions, is illustrated in Fig. 6. Assume two individuals with their adjacency matrix representations of dimensions m and n . If m < n , the coordinates of the crossover point ( i, j) are chosen on the smaller matrix, where i and j are integers inside the intervals i ∈ [0, m -1]
(12a)
j ∈ [0, m -1]
(12b)
To define the dimensions of the submatrices to be exchanged in the crossover, two integers p and q are randomly chosen in the intervals: p ∈ [1, m - i]
(13a)
q ∈ [1, m - j]
(13b)
Finally, the coordinates of the crossover point on the largest matrix are integers randomly chosen in the intervals: k ∈ [0, n - p] (14a) l ∈ [0, n - q]
(14b)
In Fig. 6, the dimensions of the matrices, the coordinates of the crossover points and the dimensions of the submatrices are, respectively, m = 3 , n = 4 , i = 1 , j = k = l = 2 , p = 1 and q = 2 .
Fig. 6. Example of a crossover between two impedance networks (a) parent 1 (b) parent 2 (c) offspring individual 1 (d) offspring individual 2
Mutation requires only one individual. The proposed mutation operator creates a new individual by randomly replacing an existing submatrix by a new submatrix containing also randomly chosen entries, as depicted in Fig. 8.
Automatic Synthesis of Lossless Matching Networks
317
Fig. 7. Example of a mutation (a) before mutation (b) after mutation
Assume one individual with its adjacency matrix representation of dimension m . The coordinates of the mutation point ( i, j) are chosen according to (12). The dimensions of the submatrix are always p = q = 1 . In Fig. 7, where a capacitor is exchanged by a parallel association of a capacitor and an inductor, the matrix dimension and the coordinates of the mutation point are, respectively, m = 3 and i = j = 1 . It can be noted that the topology and the component values of the impedance matching network are simultaneously changed by the proposed genetic operations. 3.3 Fitness Computation Since the main specifications that a practical lossless matching network should fulfill are: TPG close to one over the prescribed frequency band, low sensitivity and practical component values, the proposed evolutionary process will take into account all these characteristics in the fitness computation. The fitness should be maximized and it is defined as the inverse of the error function ε given by: ε = 1 + w TPG ⋅ ε TPG + w Sens ⋅ ε Sens + w PF ⋅ ε PF + w TR ⋅ ε TR
(15)
In this equation, ε is the total error, ε TPG is the error in the impedance matching, εSens is the error in the network sensitivity, ε PF is the error in the component values, ε TR is the error for using an ideal transformer and the w i 's are weighting factors. The choice of the weighting factors is generally based on expert knowledge of the optimization problem at hand [17]. Combining (1) and (4), the TPG can be written as a function of the voltage on node 1, which was previously stored by evolutionary algorithm after the simulator execution:
TPG = 1 − 2V1 − 1
2
(16)
Having computed the TPG values along the frequency band of interest, a minimax error criterion is used to obtain ε TPG :
{
εTPG = min max TPG −1 ω∈Ω
}
(17)
where Ω = [ ωmin , ωmax ] denotes the frequency band of interest. Low sensitivity with respect to the component values is a necessary condition for practical implementations of evolved matching networks. In this case, the closed form TPG sensitivity derived in Section II is used with a minimax criterion to compose the sensitivity error:
318
L.B. de Sá, P. da Fonseca Vieira, and A. Mesquita
⎧⎪ m ⎫⎪ (18) εSens = min ⎨ max STPG ⎬ pi ⎩⎪ i =1 ω∈Ω ⎭⎪ where m is the number of elements in the lossless matching network. The third requirement that a lossless matching network must met is related to practical component values. A penalty function strategy is used to restrict the component values to practical ranges. The major idea is to transform a constraint optimization problem into an unconstrained one by introducing a penalty term into the total error function [18]:
∑
m
ε PF = λ ⋅
∑d
i
(19)
i =1
where λ and d i are, respectively, a user-defined constant and a distance metric for the i th constraint. The last error, ε TR , is concerned with the use of an ideal transformer for impedance matching: ⎧0.2, if an ideal transformer is used εTR = ⎨ (20) ⎩0 , otherwise 3.4 Algorithm Overview
The epistatic nature of the evolutionary analog circuit synthesis is well known [8, 19]. This means that the behavior of any analog circuit is a combined function of topology and component values. The use of a GA to modify both, the topology and the component values of a circuit may result in an inefficient evolutionary algorithm [8]. In fact, a network topology generated by genetic operations can be properly evaluated only if the component values are tuned. To overcome this problem, an evolutionary algorithm performing the topology search through a GA (crossover and mutation) and the component values tuning through a conventional optimization method should be preferred. The schematic diagram of the proposed evolutionary algorithm is shown in Fig. 8.
Fig. 8. Evolutionary algorithm used in the impedance matching network synthesis including the component values tuning step
In the proposed algorithm all individuals in the initial population are tuned by the Nelder-Mead Downhill Simplex method, which do not require the calculation of derivatives. Since there is the possibility of the Nelder-Mead algorithm getting stuck in local minima, it is combined with the random search of the GA to find the minimum to (15). The criterion of number of iterations, N NM in Table 2, is used to stop the
Automatic Synthesis of Lossless Matching Networks
319
component values tuning. In this work, a proportional selection operator with elitism is used as shown in Fig. 8. The inclusion of the component values tuning in the fitness computation requires a demanding computational effort. Except for the initial population, where all individuals must be tuned, for the next populations only part of the individuals must be tuned. This occurs since in the next populations not all individuals are submitted to the genetic operations. In this sense, a Boolean variable associated to each individual indicating if the individual suffered crossover or mutation is used. Thus, only the individuals that were affected by the genetic operators will be optimized in the tuning step, reducing the computational effort.
4 Numerical Results This example is a well-known test case called Fano LCR load [2], it was used to verify the effectiveness of the proposed evolutionary method. The load impedance consists of a 1.0Ω resistor in parallel with a 1.2F capacitor in series with a 2.3H inductor. The gain must be maximized in the bandwidth [0, 1 rad/s]. The LCR load is sampled at a discrete set of 100 frequencies along the desired bandwidth. This example was already solved in [20] using a hybrid algorithm that optimizes ladder topologies and in [21] using the Real Frequency Technique (RFT). Although the proposed algorithm can synthesize any kind of topology, in this particular case, the topology found by the proposed algorithm was the same found by the other two mentioned approaches as shown in Fig. 9.
Fig. 9. Lossless matching network topology Table 3. Parameters of example TPGmin Passband Ripple in dB Sensmax Transformer Ratio n C1 (F) L2 (F) C3 (F)
RFT [21] 0.848 0.191 24.892 1.483 0.352 2.909 0.922
Hybrid Algorithm [20] 0.852 0.239 19.876 1.485 0.386 2.976 0.951
Proposed Method 0.855 0.264 13.366 1.493 0.409 3.023 0.971
Table 3 summarizes the results. The passband ripple is defined in [21]. The proposed algorithm obtained the best result of TPG and sensitivity, but the worst of passband ripple. This is an expected consequence of trying to maximize the minimum sensitivity and passband gain regardless of the passband ripple. Fig. 10(a) shows the TPG along the
320
L.B. de Sá, P. da Fonseca Vieira, and A. Mesquita
prescribed frequency band for the three mentioned approaches. The simulations were done using HSPICE. The control parameters used by the proposed method are described in Table 2. The evolution of the best individual’s fitness throughout generations for two different configurations of the proposed evolutionary algorithm is performed. In the first configuration, without tuning step, the topology and component values are entirely manipulated by the GA. In this case, as shown in the figure, the fitness stays almost constant along the generations, since nothing was done to deal with the epistatic nature of analog circuits. In the second configuration, with tuning step, the Nelder-Mead Downhill Simplex is used with the GA. It can be noted in this case the substantial fitness changes between consecutive generations provided by the tuning step. The algorithm was run 25 times for each case and only the best individual performance of all runs is shown in Fig. 10(b).
Fig. 10. (a) Transducer Power Gain for the three approaches (b) Fitness versus generation for the best individuals in two different configurations of the evolutionary algorithm.
5 Conclusions A closed form to compute the TPG sensitivity with respect to the component values for a lossless impedance matching network was derived. An evolutionary algorithm including the sensitivity as part of the fitness computation was proposed. The representation of lossless impedance matching networks based on the adjacency matrix was presented as an alternative to representations that limit the number of topologies generated by the evolutionary process. In order to deal with the epistasy problem characteristic of the analog circuit synthesis, the conventional evolutionary algorithm steps were modified by the insertion of the component values tuning step during the fitness computation. This mechanism proved to be efficient, increasing substantially the best individual’s fitness throughout the generations of the evolutionary process. In order to test the algorithm, a well-known LCR load was used as example and it was observed that the results obtained by the proposed approach compare favorably with other results found in the literature.
Automatic Synthesis of Lossless Matching Networks
321
References 1. Balabanian, N., Bickart, T.A., Seshu, S.: Electrical Network Theory. John Wiley & Sons, Chichester (1969) 2. Fano, F.M.: Theoretical limitations on the broadband matching of arbitrary impedances. J. Franklin Inst. 249, 57–83 (1950) 3. Youla, D.C.: A new theory of broadband matching. IEEE Trans. Circuit Theory CT-11, 30–50 (1954) 4. Carlin, H.J.: A New Approach to Gain-Bandwidth Problems. IEEE Trans. on Circ. and Syst. 24(4) (April 1977) 5. Carlin, H.J., Yarman, B.S.: The Double Matching Problem: Analytic and Real Frequency Solutions. IEEE Trans. on Circ. and Syst. 30(1) (April 1983) 6. Koza, J., Bennett, F.H., Andre, D., Keane, M.A.: Genetic Programming III. Darwinian Invention and Problem Solving. Morgan Kaufmann, San Mateo (1999) 7. Lohn, J.D., Colombano, S.P.: A Circuit Representation Technique for Automated Circuit Design. IEEE Trans. Evol. Comp. 3(3), 205–219 (1999) 8. Grimbleby, J.B.: Automatic analogue circuit synthesis using genetic algorithms. IEE Proc. Circuits Devices Syst. 147(6), 319–323 (2000) 9. Damavandi, N., Safavi-Naenini, S.: A Hybrid Evolutionary Programming Method for Circuit Optimization. IEEE Trans. Circ. and Syst. I 52(5) (May 2005) 10. Mesquita, A., Salazar, F.A., Canazio, P.P.: Chromosome representation through adjacency matrix in evolutionary circuits synthesis. In: Proc. of the NASA/DoD Conference on Evolvable Hardware, pp. 102–109 (2002) 11. Nelder, J., Mead, R.: A Simplex Method for Function Minimization. Computer Journal 7, 308–311 (1965) 12. Daryanani, G.: Principles of Active Network Synthesis and Design. John Wiley & Sons, Chichester (1980) 13. Vlach, J., Singhal, K.: Computer Methods for Circuit Analysis and Design, 2nd edn. Van Nostrand Reinhold (1994) 14. Swamy, M.N.S., Thulasiraman, K.: Graphs, Networks and Algorithms. John Wiley & Sons, Chichester (1981) 15. Chang, S., Hou, H., Su, Y.: Automated Passive Filter Synthesis Using a Novel Tree Representation and Genetic Programming. IEEE Trans. Evol. Comp. 10(1), 93–100 (2006) 16. Greenwood, G.W., Tyrrell, A.M.: Introduction to Evolvable Hardware – A Practical Guide for Designing Self-Adaptive Systems. Wiley Interscience, Hoboken (2007) 17. Zebulum, R.S., Pacheco, M.A.C., Vellasco, M.M.B.R.: Evolutionary Electronics - Automatic Design of Electronic Circuits and Systems by Genetic Algorithms. CRC Press, Boca Raton (2001) 18. Smith, A.E., Coit, D.W.: Handbook of Evolutionary Computation. In: De Jong, K., Fogel, L., Schwefel, H. (eds.) C.5.2 (1997) 19. Vieira, P.F., Sa, L.B., Botelho, J.P.B., Mesquita, A.: Evolutionary synthesis of analog circuits using only MOS transistors. In: Proc. of the 2004 NASA/DoD Conference on Evolvable Hardware, pp. 38–45. IEEE Computer Press, USA (2004) 20. Rodríguez, J.L., García-Tuñon, I., Tabeada, J.M., Basteiro, F.O.: Broadband HF Antenna Matching Network Design Using Real-Coded Genetic Algorithm. IEEE Trans. Antennas Propag. 55(3) (March 2007) 21. Carlin, H.J., Amstutz, P.: On optimum broadband matching. IEEE Trans. Circuits and Syst. CAS-28, 401–405 (1981)
A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device Michael Farnsworth 1, Elhadj Benkhelifa 1, Ashutosh Tiwari 1, and Meiling Zhu 2 1 Decision Engineering Centre Microsystems and Nanotechnology Centre Cranfield University, College Road, Bedfordshire, MK43 0AL {m.j.farnsworth,e.benkhelifa,a.tiwari,m.zhu}@cranfield.ac.uk 2
Abstract. This paper introduces a novel approach to the evolutionary design optimisation of an MEMS bandpass filter, incorporating areas of multidisciplinary, multi-level and multi-objective design optimisation in the process. In order to demonstrate this approach a comparison is made to previous attempts to design similar bandpass filters, providing comparable results at a significant reduction in functional evaluations. In this endeavour, a circuit equivalent of the MEMS bandpass filter is evolved extrinsically using the SPICE Simulator. Keywords: Multi-Disciplinary Optimisation; Multi-Objective Evolutionary Algorithm; Multi-Level Optimisation; MEMS; Micro-Electro-Mechanical Systems; Extrinsic Evolution.
1
Introduction
Micro-electro-mechanical systems (MEMS) or micro-machines [1,2] are a field grown out of the integrated circuit (IC) industry, utilizing fabrication techniques from the technology of Very-Large-Scale-Integration (VLSI). The goal is to develop smart micro devices which can interact with their environment in some form. The paradigm of MEMS is well established within both the commercial and academic fields. At present encompassing more than just the mechanical and electrical [3], MEMS devices now cover a broad range of domains, including the fluidic, thermal, chemical, biological and magnetic systems. This has resulted in a host of applications to arise, from micro-resonators and actuators, gyroscopes, micro-fluidic, and biological lab on chip devices, to name but a few. Normally, designs of such devices are produced in a trial and error approach dependant on user experience and naturally an antithesis to the goal of allowing designers the ability to focus on device and system design. This approach, nominally coined a ‘Build and Break’ iterative, is both time-consuming and expensive [2]. Therefore the development of a design optimisation environment [15,16], which can allow MEMS designers to automate the process of modelling, simulation and optimisation at all levels of the MEMS design process, is fundamental to the eventual progress in MEMS Industry [2]. Work in MEMS design automation and optimisation can be seen to fall into two distinct areas; firstly the more traditional approaches found within numerical methods such as gradient-based search [7]; and G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 322–334, 2010. © Springer-Verlag Berlin Heidelberg 2010
A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device
323
secondly the use of more powerful stochastic methods such as simulated annealing and/or Evolutionary Algorithms (EAs) [4-6]. There has been a recent shift towards the use of EAs, and more specifically the use of Multi-Objective Genetic Algorithms (MOGA) [17] as these stochastic algorithms allow for a more robust approach to tackling the issues of a complex multi-modal landscape. The bulk of the work utilising Genetic Algorithms (GAs) and MOGA has been undertaken by researchers from the University of California, Berkeley, focusing solely on planar MEMS devices [4-6]. The paper highlights and builds upon past approaches introducing a novel multiobjective approach to the multi-level and multi-disciplinary design optimisation of a MEMS. A bandpass filter is chosen as a MEMS case study for this paper in order to demonstrate comparable results to the state of the art in the field. This MEMS device is evolved extrinsically in its equivalent analog circuit form using the SPICE simulator and then physically envisioned using the SUGAR Nodal simulator. Results are compared with those within the literature. This paper begins with a brief overview of the hierarchical design environment of MEMS in section 2, followed with a definition of the bandpass filter problem used in this study in section 3. The next section focuses on a novel evolutionary design optimisation approach to solving this problem in section 4 followed by results in section 5 and ending with conclusions. 2
Hierarchical MEMS Design
The hierarchical nature of MEMS design process provides designers with the problem of how best to approach the possible decomposition of the device at the various levels of modelling and analysis abstractions presented to them. Outlined by Senturia [14] the four levels (System, Device, Physical, and Process) each harbour its own set of tools and modelling approaches. The system level focuses upon the use of lumped element circuit models or block diagrams to model device performance, utilising powerful circuit simulators. They provide the possibility to interface with the mechanical elements of the device, either through analytical models, HDL models, reduced order models or alternatively electrical equivalent representations of the mechanical component. Both the device and physical level provide models of varying granularity. At a device level, a designer can look to build accurate 2D layout models through the use of NODAL simulators and various atomic MEMS elements, or by building mathematical analytical representations. The physical level generally utilises more expensive finite element and boundary element methods to simulate and analyse 3D models of the device. The process level looks towards the creation of appropriate mask layouts and process information needed for the batch process generally employed to fabricate the device. Therefore, by utilising system level tools it is possible to derive the function of the whole coupled electromechanical device, while the device or physical levels allow the device to be envisioned and thus allow fabrication to follow function. 3
Problem Definition
Analog circuit design for Hi, Low and Bandpass filters have been successfully undertaken using evolutionary methods in the past [8] [9], mainly through the use of genetic
324
M. Farnsworth et al.
programming and a circuit or bond graph representation [10]. These approaches looked to use components associated with circuit design and connect them in various topologies in order to match the target filter response. Recently MEMS have become a focus upon which to build devices that can provide superior performance to traditional mechanical tank components such as crystal and SAW resonators [11], widely used in bandpass filters within the radio frequency range. A feature of certain MEMS devices is the ability to represent the device as a whole in both mechanical and electrical equivalents. Taking for example a simple folded flexure resonator [11], the device can be represented as a simple spring-mass damping system, and equally this system has a similar equivalent within the electrical domain. Here the values for Mass (mrs), Stiffness (Krs), and Damping (Crs) of the resonator can be mirrored as Inductance (L), Capacitance (C), and Resistance (R) in the electrical domain. Therefore a mechanical folded flexure resonator can be represented and therefore analysed at a system level by building a simple RLC circuit. The coupling of such resonator units or ‘tanks’ through the use of mechanical bridges or springs allows the development of devices, which can provide certain filter responses. This can also be achieved in the circuit equivalent. The approach on relating the physical parameters of the folded flexure resonator to that of the equivalent circuit values has been outlined by Nguyen [11] and the subsequent equations are shown below.
ܴ௫ ൌ
ܿ௦ ඥ݇௦ ݉௦ ൌ ଶ ଶ ߟ ܳߟ
ܮ௫ ൌ
݉௦ ଶ ߟ
ܥ௫ ൌ
ଶ ߟ ݇௦
ߴܥ ߴݔ
(1)
ߟ ൌ ܸ
(2)
ʹߦܰ ߝ ݄ ߴܥ ൌ ߴݔ ݀
(4)
(5)
(3)
Where ܸ is the dc bias voltage, ξ is a constant that models additional capacitance due to fringe field electrics, ߝ is the permittivity of air, ݄ is the structural layer thickness, ܰ is the number of comb drive fingers and ݀ is the comb finger gap spacing. Using these equations it is possible to derive resistor, capacitor and inductance values from the damping, stiffness and mass values of the resonator and equivocally vice versa. This allows a direct link between the system and device levels and as a result allows designer to derive both function and fabrication to one particular instance of the MEMS filter design. Figure 1 outlines an approach to decompose a MEMS bandpass filter into separate modelling levels, extract the chosen design variables and construct suitable genotype representations in the case of EAs. In order to assess these two levels, objective functions for evaluation need to be introduced. In the case of filter design, a target response based upon chosen design targets of ‘passband’, stopband’ and ‘central frequency’ can be constructed. Figure 2 shows how to break the filter response into sections of ‘stopband’ and ‘passband’ with ideal target values of ‘20dB or less’ and ‘0dB’. A sampling of the frequency response can then be undertaken over a specified range with the goal to have a filter response in the stopband range equal to or below the target value and in the passband range the goal
A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device
325
Fig. 1. Filter Design Synthesis Breakdown Central Frequency Response One
Passband
Central Frequency Response Two
Central Frequency Target
Frequency Distance
Stopband One
Stopband Two
Fig. 2. Filter Objective Breakdown
Fig. 3. Central Breakdown
Frequency
Objective
326
M. Farnsworth et al.
is to simply match the target value. In both cases the objective function is simply the sum of the absolute error for the two ranges however considering the stopband is considerably larger a weighting factor is used to reduce this value. A second objective as shown in figure 3 looks to evaluate the distance of the peak filter response of the individual from the target central frequency. The goal being to differentiate between similar filter shapes which however may lie farther away from the target required. Once a suitable filter response has been found, the circuit model can then be converted to the equivalent mechanical values and then used as targets for 2D resonator layout design. 4
Multi-objective Evolutionary Algorithm Filter Design Synthesis
The design and optimisation of a MEMS bandpass filter forms the basis of our multi level problem. The approach used in this paper looks to couple a multi-objective genetic algorithm NSGAII [17] with an electrical circuit model representation, coined (GAECM). Utilising a varied length, real-valued and integer representation, the goal is to allow the GAECM approach to evolve the topology and parameters of the circuit in order to match the frequency response of a bandpass filter. Once a suitable filter design has been found, its values can then be converted into the equivalent mechanical values for mass and stiffness using the calculated , and then used as objective targets for the evolution of a 2D layout folded flexure resonator device. Past attempts [12][13] towards MEMS filter design optimisation have looked to couple the powerful approach of genetic programming with a bond graph representation, coined (GPBG). Though successful a large number of functional evaluations were required (2.6 million) and no respective circuit values were given and therefore it is not possible to derive whether the actual designs were physically feasible. Even so an approach was outlined to allow the automatic synthesis of a physical device in this case utilising an analytical model of a folded flexure resonator and linking it with the powerful approach of GAs [12-13]. The approach proved successful for the set of targets outlined, in this instance to match certain values for both mass, stiffness and damping of a single resonator device. However it was not a true multi-objective algorithm, nor did the actual values come from the previously designed filter. In order to solve each design problem alterations were made to the NSGAII algorithm in order to improve the overall search ability of the optimizer. The ‘SBX’ crossover for the GAECM algorithm has been adapted to be restricted to only occur between the length of the shortest individual as shown in figure 4. Included in the mutation operator is the ability to ‘clone’ or remove tanks from the individual in an attempt to aid topological search, as shown in figures 5 and 6. Tank 1
Tank 2
Tank 3
Parent One
R
C
L
CS
R
C
L
Parent Two
R
C
L
CS
R
C
L
CS
R
C
L
Restricted ‘SBX’ Crossover
Fig. 4. Restricted Crossover for System Level Representation
A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device Tank 1 R1
C1
Tank 2 L1
CS2
R2
C2
327
Tank 3 L2
CS3
R3
C3
L3
R2
C2
L2
Clone
R1
C1
L1
CSc
Rc
Cc
Lc
CS2
CS3
R3
C3
L3
Insertion Point for Cloned Tank
Fig. 5. Cloning Mutation Operator Tank 1 R1
C1
Tank 2 L1
CS2
R2
C2
Tank 3 L2
CS3
R3
C3
L3
Remove Tank
R1
C1
CS3
L1
R3
C3
L3
Fig. 6. Removal Mutation Opreator
The design optimisation of the resonator looks to utilises the model representation and simulation of the NODAL analysis tool named ‘SUGAR’. This particular approach follows that of previous work [8-10] in design optimisation of MEMS using the SUGAR platform, however in this instance a completely new folded flexure resonator as shown in figure 7 is evolved in place of previous simpler meandering resonator devices. Utilising a similar hierarchical representation, the whole device consisting of both components of the central mass and supporting springs of the folded flexure are evolvable. The central mass is made up of ten beam elements, four of which can be designed and then simply mirrored to the other half of the mass. The folded flexure springs are made up of eight individual springs, four at the top and bottom, each connected by three truss beams. Each spring is made up of a number of beam elements each with their own set of design variables, in this case ‘width, length and angle’. In this particular design problem constraints are placed upon the resonator so as to adhere to a more ‘classical’ design, with fixed angles for the central mass and folded Truss Beam Three Truss Beam Two
Truss Beam One
Beam Three Mass Spring Connector
Beam Two
Beam One Anchor
Mass Connector
Component
Centre Connector
Design Variable
Component
Design Variable
Mass Var1
Length1
Width1
Angle1
Spring1
Beam1
Beam2
Beamn
Length1
Width1
Angle1
Mass Var2
Length2
Width2
Angle2
Spring2
Beam1
Beam2
Beamn
Length1
Width1
Angle1
Mass Var3
Length3
Width3
Angle3
Spring3
Beam1
Beam2
Beamn
Length1
Width1
Angle1
Mass Varn
Lengthn
Widthn
Anglen
Springn
Beam1
Beam2
Beamn
Length1
Width1
Angle1
Fig. 7. Device Level Representation
328
M. Farnsworth et al.
Fig. 8. Whole Spring Crossover
Fig. 9. Inter Beam Crossover
Fig. 10. Central Mass Crossover
A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device
329
flexure springs and a simple mirroring along the x and y axis. Adaptations to the crossover operator were introduced to mimic that of previous work [4] and replace the classic ‘SBX’ operator, with a ‘whole spring’ crossover and ‘inter beam’ crossover, shown in figures 8 and 9 respectively when evolving spring design. Central mass crossover in figure 10 however uses the original ‘SBX’ crossover operator. The use of SUGAR provides advantages over a single use analytical model, as it allows more complex devices to be evolved and in the future allows for more novel devices to be incorporated. Three case studies as shown in table 3 form the basis of testing this new approach to filter design, beginning with a relatively low frequency filter taken from [12,13], two more filter design problems are introduced to test the robustness of the algorithm at higher frequencies. The parameters used by NSGAII to solve both the system and device level design problems are shown in table 1, in this instance the system level contains a higher mutation rate to facilitate the chance of adding or removing ‘RCL’ tanks. Also two population and offspring sets were run for each case study at the system level. Table 2 holds the various parameters for the circuit design problem, resistance is worked out from capacitance, inductance and equation (1) and therefore left blank. Each case study was fixed to a specific range where points were sampled at specific frequencies and then used to evaluate the two objectives outlined previously for the system level design. These were a range of [0Hz-10kHz] for case study 1 resulting in 10,000 sampling points, and [0Hz-25kHz] and [85kHz-110kHz] for case studies 2 and 3 respectively, resulting in 25,000 sampling points. As a result weighting factors for the sum of the stopbands were set to ‘divide’ the value by 9 and 25 in order for the algorithm to not focus to heavily on optimising the stopband. Table 1. NSGAII Parameters NSGAII Probability of SBX Crossover Probability of Mutation Distribution Index for crossover Distribution Index for mutation Population Size Offspring Size Selection Size Generations Tests
System 0.8 0.35 20
Device 0.8 0.10 20
20 100 / 20 100 / 10 100 / 10 100 5
20 100 100 100 100 -
Table 2. Circuit Design Variable Parameters Variable Type Tank No Voltage Resistance (ȍ) Capacitance (F) Inductance (H) Finger Number Thickness (m)
Case Study One Lower Upper Values Values 1 9 1 200 1e-15 1e-11 10 100000 1 200 2e-6 3e-5
Case Study Two Lower Upper Values Values 1 9 1 200 1e-17 1e-14 10 100000 1 200 2e-6 3e-5
Case Study Three Lower Upper Values Values 1 9 1 200 1e-18 1e-15 10 100000 1 200 2e-6 3e-5
Table 3. Case Study Parameter Ranges Passband Stopband 1 Stopband 2 Central Frequency
Case Study One 312Hz – 1000Hz 1Hz – 312Hz 1000Hz – 10kHz 656Hz
Case Study Two 19.5kHz – 20.5kHz 1Hz – 19.5kHz 20.5kHz – 25kHz 20kHz
Case Study Three 99.5kHz – 100.5kHz 85kHz – 99.5kHZ 100.5kHz – 110kHz 100kHz
330
M. Farnsworth et al.
5 Results and Comparison Results for each case study, and each population set for the system level filter design problem are found in table 4, with the best result ranked by filter objective listed for each test. The circuit models for test 4 of case study one for population 100 set, test 1 of case study two and test 5 of case study 3, both population 20 sets were converted to their mechanical equivalents as shown in table 5 and for each resonator ‘tank’ used as objective functions for the design synthesis of a 2D layout resonator device. The filter responses for each of these are shown in figure 11, and the evolved 2D layout designs for these filters are shown in figure 12. In the case of the 2D layout design optimisation, results which had an error of less than 0.1% for each objective were extracted. In comparison with earlier work [12,13] the results presented here show this particular approach to be robust over a set of different case studies where previous attempts focused only on one. In the course of solving each case study the GAECM method provided comparable bandpass filter shapes at a relatively small number of functional evaluations given the state of the art [12,13]. Finally the coupling of NSGAII with the NODAL platform SUGAR provided effective and fast design optimisation of the required 2D resonator layouts. Table 4. Best results for each case study ranked by filter objective Test 1 2 3 4 5 Test 1 2 3 4 5 Test 1 2 3 4 5 Test 1 2 3 4 5 Test 1 2 3 4 5 Test 1 2 3 4 5
Best Result Case Study 1: Population 100 Filter Objective Central Frequency Objective Voltage 941.76 110 112.5 953.40 86 161.7 565.25 293 66.4 478.65 24 43.9 942.03 256 159.7 Best Result Case Study 1: Population 20 Filter Objective Central Frequency Objective Voltage 940.47 112 1 1974.60 97 32.70 476.76 240 7.28 2130.29 0 109.85 2130.30 1 108.75 Best Result Case Study 2: Population 100 Filter Objective Central Frequency Objective Voltage 1798.99 230 84.3 2259.23 1250 54.99 1990.79 30 16.98 3085.71 50 102.68 2422.73 190 2.43 Best Result Case Study 2: Population 20 Filter Objective Central Frequency Objective Voltage 988.58 100 44.16 1293.24 260 78.03 2998.03 10 45.62 2095.91 150 115.56 1048.50 210 26.65 Best Result Case Study 3: Population 100 Filter Objective Central Frequency Objective Voltage 1632.81 170 86.87 2405.76 40 31.78 2712.51 110 169.61 1561.27 50 152.39 2289.03 30 197.81 Best Result Case Study 3: Population 20 Filter Objective Central Frequency Objective Voltage 2319.79 40 127.72 2181.26 30 40.30 1672.20 10 66.03 1628.61 20 27.54 1304.11 190 22.17
Tank Number 2 2 3 3 2 Tank Number 2 2 3 2 2 Tank Number 3 5 3 2 3 Tank Number 5 5 2 3 3 Tank Number 6 2 2 2 5 Tank Number 2 2 3 3 9
A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device
331
Table 5. Equivalent mass and stiffness (Kx) values for the best results of each case study Individual Folded Flexure Resonator Values Equivalent Mass (kg) Tank 1 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 2 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 3 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 4 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 5 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 6 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 7 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 8 Equivalent Stiffness (N/m) Equivalent Mass (kg) Tank 9 Equivalent Stiffness (N/m)
Best Result Case Study 1 2 3 5.92e-9 2.34e-10 3.92e-10 0.083 3.91 160.52 4.78e-8 2.50e-10 4.15e-10 0.073 3.24 159.72 3.03e-8 2.67e-10 4.03e-10 0.281 3.99 159.74 2.77e-10 3.92e-10 3.99 160.52 2.26e-10 2.95e-10 3.92 159.74 3.90e-10 159.74 4.18e-10 158.80 4.11e-10 160.52 4.07e-10 159.74
(b)
(a)
(c) Fig. 11. Filter frequency response for the best result for case study one (a), case study two (b) and case study three (c), ranked by filter response objective
332
M. Farnsworth et al.
(a)
(b)
(c) Fig. 12. Folded flexure resonator layout designs for best results from case studies one (a), two (b) and three (c)
6 Conclusions and Future Work Moving towards a more multi-level approach to design optimisation of MEMS will prove to be a challenging task. Presented here was a simple approach to the coupling of both system and device level tools in the hope of designing and optimising a MEMS bandpass filter. This involved combining multiple disciplines from the electrical and mechanical domain, utilising separate circuit level modelling and analysis tools such as ‘SPICE’ with a mechanical NODAL simulator ‘SUGAR. The new GAECM approach proved successful in evolving designs which gave comparable results to earlier work [12][13], but at a fraction of the cost, needing only 10,000 functional evaluations in comparison to 2.6 million with the GPBG approach. Also our designs were restricted to bounds which gave rise to feasible and realisable physical targets unlike previous attempts, by using the required electrical equivalent to mechanical equivalent conversion method presented in [11]. This allowed for the creation of filter designs which could be feasible and realisable in terms of fabrication of the resulting 2D layout designs. The design synthesis of the specific 2D folded flexure resonator devices was undertaken through the SUGAR platform and then using the multi-objective genetic algorithm NSGAII designs were evolved to match the required targets optimally found at the system level. By using NSGAII it is possible to undertake true multi-objective optimisation and the integration of it at both system and device level make the job of coupling the two levels together at a later date far easier than a separate genetic programming and GA approach. The use of a NODAL simulator proved successful in evolving designs that could match the target values
A Novel Approach to Multi-level Evolutionary Design Optimization of a MEMS Device
333
required proving 100% successful in solving all designs with 0.1% target error for each objective set. Also the functional evaluations for each design stood only at 10,000, significantly less than the 137,500 of the current state of the art approach [12,13]. The approach presented proved to be robust enough to handle bandpass filter design problems over a wide range, topological search was facilitated by the introduced changes in the GAECM approach, as can be seen in table 5 with ‘cloning’ of RCL tanks proving essential to both case studies 2 and 3. Overall the novel approach proved to be around 260x faster in terms of required functional evaluations for the filter design problem at the system level, and around 14x as effective at the device level when compared with the state of the art currently [12,13]. Future work looks to expand this approach to include more levels of the MEMS design process, specifically that of the physical level. Here designers utilize finite element and boundary element models to accurately analyse and design MEMS devices at a significant computational cost. Therefore any approach which can look to automate and hasten the design optimisation at this level will be of great benefit.
References [1] Fujita, H.: Two Decades of MEMS– from Surprise to Enterprise. In: Proceedings of MEMS, Kobe, Japan, pp. 21–25 (January 2007) [2] Benkhelifa, E., Farnsworth, M., Tiwari, A., Bandi, G., Zhu., M.: Design and Optimisation of microelectromechanical systems: A review of the state-of-the-art. International Journal of Design Engineering 3(1), 41–76 [3] Hsu, T.R.: MEMS and Microsystems, 2nd edn. Wiley, Chichester (2008) [4] Zhou, N., Agogino, A.M., Pister, K.S.: Automated Design Synthesis for Micro-ElectroMechanical Systems (MEMS). In: Proceedings of the ASME Design Automation Conference, ASME CD ROM, Montreal, Canada, September 29-October 2 (2002) [5] Kamalian, R.H., Takagi, H., Agogino, A.M.: Optimized Design of MEMS by Evolutionary Multi-objective Optimization with Interactive Evolutionary Computation. In: Proceedings of GECCO 2004 (Genetic and Evolutionary Computation Conference), Seattle, Washington, June 26-30 (2004) CD ROM [6] Zhang, Y., Kamalian, R., Agogino, A.M., Séquin, C.H.: Design Synthesis of Microelectromechanical Systems Using Genetic Algorithms with Component-Based Genotype Representation. In: Proc. of GECCO 2006 (Genetic and Evolutionary Computation Conference), Seattle, July 8-12, vol. 1, pp. 731–738 (2006) ISBN 1-59593 187-2 [7] Haronain, D.: Maximizing microelectromechanical sensor and actuator sensitivity by optimizing geometry. Sensors and Actuators A 50, 223–236 (1995) [8] Koza, J.R., Bennett III, F.H., Andre, D., Keane, M.A., Dunlap., F.: Automated Synthesis of Analog Electrical Circuits by Means of Genetic Programming. IEEE Transactions on Evolutionary Computation 1(2), 109–128 (1997) [9] Lohn, J.D., Colombano, S.P.: A Circuit Representation Technique For Automated Circuit Design. IEEE Transactions on Evolutionary Computation 3(3), 205–219 (1999) [10] Fan, Z., Hu, J., Seo, K., Goodman, E.D., Rosenberg, R.C., Zhang, B.: A Bond Graph Representation Approach for Automated Analog Filter Design [11] Wang, K., Nguyen, C.T.-C.: High-Order Medium Frequency Micromechanical Electronic Filters. Journal of MicroElectroMechanical Systems 8(4), 534–556 (1999)
334
M. Farnsworth et al.
[12] Fan, Z., Seo, K.K., Hu, J., Rosenberg, R.C., Goodman, E.D.: System-level synthesis of mems via genetic programming and bond graphs. In: Cantú-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U.-M., Beyer, H.-G., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A., Dowsland, K.A., Jonoska, N., Miller, J., Standish, R.K. (eds.) GECCO 2003. LNCS, vol. 2724, pp. 2058–2071. Springer, Heidelberg (2003) [13] Fan, Z., Wang, J., Achiche, S., Goodman, E., Rosenberg, R.: Structured synthesis of MEMS using evolutionary approaches. Applied Soft Computing 8, 579–589 (2008) [14] Senturia, S.D.: Microsystem Design, 8th edn. Kluwer Academic Publishers, Dordrecht (2001) ISBN-0-7923-7246-8 [15] Benkhelifa, E., Farnsworth, M., Tiwari, A., Zhu, M.: An Integrated Framework for MEMS Design Optimisation using modeFrontier. In: EnginSoft International Conference 2009. CAE Technologies For Industry and ANSYS Italian Conference (2009) [16] Benkhelifa, E., Farnsworth, M., Tiwari, A., Zhu, M.: Evolutionary Algorithms for Planar MEMS Design Optimisation: A Comparative Study. In: International Workshop on Nature Inspired Cooperative Strategies for Optimization, NICSO 2010 (to be Published 2010) [17] Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: Deb, K., Rudolph, G., Lutton, E., Merelo, J.J., Schoenauer, M., Schwefel, H.-P., Yao, X. (eds.) PPSN 2000. LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000)
From Binary to Continuous Gates – and Back Again Matthias Bechmann1, , Angelika Sebald1 , and Susan Stepney2 1
Department of Chemistry, University of York, YO10 5DD, UK [email protected] 2 Department of Computer Science, University of York, YO10 5DD, UK
Abstract. We describe how nuclear magnetic resonance (NMR) spectroscopy can serve as a substrate for the implementation of classical logic gates. The approach exploits the inherently continuous nature of the NMR parameter space. We show how simple continuous NAND gates with sin/sin and sin/sinc characteristics arise from the NMR parameter space. We use these simple continuous NAND gates as starting points to obtain optimised target NAND circuits with robust, error-tolerant properties. We use Cartesian Genetic Programming (CGP) as our optimisation tool. The various evolved circuits display patterns relating to the symmetry properties of the initial simple continuous gates. Other circuits, such as a robust XOR circuit built from simple NAND gates, are obtained using similar strategies. We briefly mention the possibility to include other target objective functions, for example other continuous functions. Simple continuous NAND gates with sin/sin characteristics are a good starting point for the creation of error-tolerant circuits whereas the more complicated sin/sinc gate characteristics offer potential for the implementation of complicated functions by choosing some straightforward, experimentally controllable parameters appropriately.
1
NMR and Binary Gates
Nuclear magnetic resonance (NMR) spectroscopy in conjunction with nonstandard computation usually comes to mind as a platform for the implementation of algorithms using quantum computation. Previously we have taken a different approach by exploring (some of) the options to use NMR spectroscopy for the implementation of classical computation [5]. We have demonstrated how logic gates can be implemented in various different ways by exploiting the spin dynamics of non-coupled nuclear spins in a range of solution-state NMR experiments. When dealing with spin systems composed of isolated nuclear spins, the underlying spin dynamics can be described conveniently by the properties of magnetisation vectors and their response to the action of radio-frequency (r.f.) pulses of different durations, phases, amplitudes and frequencies. Together with the integrated intensities and/or phases of the resulting NMR signals, this scenario provides a
Corresponding author.
G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 335–347, 2010. Springer-Verlag Berlin Heidelberg 2010
336
M. Bechmann, A. Sebald, and S. Stepney
Fig. 1. NOR gate implemented using NMR. a) NMR pulse sequence. b) Spectra corresponding to the four possible gate outputs where the integrated spectral intensity is mapped to logic outputs 0 and 1. c) Logic truth table mapping NMR parameters to gate inputs 0 and 1. (adapted from [5]).
rich parameter space and a correspondingly large degree of flexibility regarding choices of input and output parameters for the construction of logic gates. Fig. 1 shows an NMR implementation of a NOR gate, for illustration. The effects of r.f. pulses on a given nuclear spin system are fully under experimental control, and the response of the spin system is fully predictable with no approximations involved. An NMR experiment usually starts from the magnetisation vector in its equilibrium position: aligned with the direction of the external magnetic field (the z-direction in the laboratory frame). An r.f. pulse tips the magnetisation vector away from the z-direction. By choosing the duration, amplitude and frequency of the pulses appropriately, the tip of the magnetisation vector can be used to sample the entire sphere around its origin (Fig. 2).
Fig. 2. Magnetisation vector manipulation by r.f. pulses, e.g. rotation of magnetisation vector S from the z-direction to the −y-direction by a suitable r.f. pulse (a)). Structure of a r.f. pulse displaying characterisation parameters for amplitude, frequency, duration and phase as possible gate input controls (b)).
Our previous NMR implementations of logic gates [5] exploited special positions on this sphere, such as NMR spectra corresponding to the effects of 90 , or 180 , or 45 pulses to create binary input/output values. We have demonstrated that there are many different ways for such implementations of conventional logic gates by slightly less conventional NMR implementations, including many
From Binary to Continuous Gates – and Back Again
337
Fig. 3. 2D function graphs displaying influence of NMR parameters on the output of continuous NAND gates. a) Using the duration τp of the r.f. pulse and the duration of a preacquisition delay τd , resulting in sin dependence of both inputs. b) Using the resonance frequency offset ωp and the r.f. pulse duration τp , a sinc dependence for ωp and a sin dependence for τp is obtained. c) Comparison of experimental and theoretical result for a slice of sinc/sin NAND gate (in b) without mapping to the [0, 1] interval. This corresponds to the region in b) marked by the vertical bar in upper right corner. The deviation between experiment and simulation is always less than 0.5 percent.
different ways to define input and output parameters. There are many more possibilities for NMR implementations of conventional logic gates and circuits. Note that for these discrete logic gates a one-to-one mapping of the NMR parameter(s) to the binary state of the gate is possible in a straightforward manner. In this paper we concentrate on another aspect of NMR implementations of classic logic gates. Whereas previously our main focus was on the multitude of different options for implementing discrete logic gates and circuits by NMR, here we exploit another property of basic NMR experiments. Only a minute fraction of, for example, the space accessible to the magnetisation vector has so far been exploited for the construction of discrete logic gates. Now we lift this restriction and take advantage of the inherent continuous properties of our system and the natural computational power provided by the system itself [6]. The underlying continuous spin dynamics hereby provide the basis to the implementation of continuous logic operations. Compared to [5] this means we no longer restrict the inputs and outputs to be the discrete values 0 and 1, but allow them to be continuous values between 0 and 1.
2
Functions of NMR and Continuous Gates
Depending on the position of the magnetisation vector at the start of signal acquisition, the time-domain NMR signal is composed of sin and cos functions, with an exponentially decaying envelope (the so-called free induction decay, FID). Accordingly, trigonometric and exponential functions are two of the continuous functions inbuilt in any NMR experiment. Most commonly, NMR signals are represented in the frequency domain. Hence, Fourier transformation gives access to, for example, the sinc function ((sin x)/x) if applied to a truncated exponential decay. Fig. 3 illustrates this shift to continuous logic gates: we show the NMR implementation of NAND gates where the inputs have functional dependencies of sin/sin (Fig. 3a) and sin/sinc (Fig. 3b). Note how they have the same
338
M. Bechmann, A. Sebald, and S. Stepney
digital NAND gate behaviours at the corners {0, 1} × {0, 1}, but very different behaviours in between. Fig. 3c shows experimental NMR data representing the sinc function used in Fig. 3b. Taking the step to continuous gates, the input/output mapping now applies to the [0, 1] interval and is not as trivial as it is for the discrete logic gates. However, the NMR input parameters and output functions are known in analytical form, giving access to boolean behaviour at the corners of the two-dimensional parameter space, and continuous transitions in between. The digital NAND gate is universal. Here we relax the constraints on the inputs, to form our continuous NAND gates. These continuous gates can serve as starting points for the optimisation of certain properties of the NAND gate itself or, alternatively, for the optimisation of circuits based on NAND gates. We show how to obtain robust NAND gates (ones that still function as digital NAND gates, even if the inputs have considerable errors), by evolving circuits of the continuous single NAND gates with sin/sin (Fig. 3a) and sin/sinc characteristics (Fig. 3b). Then we evolve circuits for a robust XOR gate, constructed from continuous simple NAND gates. Finally, we briefly address the topic of more general continuous gates based on different functions [2] and how the naturally occurring continuous NMR functions may be exploited in such circumstances. Our optimisation tool is Cartesian Genetic Programming (CGP) [3].
3 3.1
Evolving Robust Continuous Gates and Circuits Continuous NAND Gate with sin/sin Characteristics
This continuous gate is based on the NMR parameters τp (pulse duration) and τd (preacquisition delay) (see Figs. 2b and 3a). It involves the following mapping of the NMR input parameters In 1 and In 2 : In 1 =
In 1 , In 2 τp ; τp90
∈ [0, 1] In 2 = 1 −
τd τd90
(1)
where τp90 corresponds to a pulse duration causing a 90 flip of the magnetisation vector and τd90 is the duration of a preacquisition delay causing a 90 phase shift of the magnetisation vector in the xy-plane. The output of the simple sin/sin NAND gate implemented by the NMR experiment is then Out = 1 − sin π2 In 1 sin π2 In 2 (2) Our target robust NAND gate is shown in Fig. 4a. It is a continuous gate, with discrete state areas which, accordingly, should represent an error-tolerant, robust gate. The sampling points used to define the fitness function for evolving this robust gate are shown in Fig. 4b. The fitness function f defined over these N sampling points is N 1 (3) f= evo 1 + Outi − Outtarget i i=1
From Binary to Continuous Gates – and Back Again
339
Fig. 4. a) Target robust NAND gate with discrete state areas. This is robust to errors in the inputs, yielding a correct digital NAND gate for inputs rounded to 0 or 1. b) Sampling points used in the fitness function to evolve the robust NAND gate.
Fig. 5. a) Functional behaviour of the array of nine continuous sin/sin NAND gates. b) Optimisation result being a linear array of nine continuous sin/sin NAND gates.
The evolved robust NAND gate is shown in Fig. 5 (see 7 for the CGP parameters used). It displays the desired feature of well-defined, discrete state areas. The behaviour towards the centre differs from Fig. 4a, but provides no contribution to the fitness function. The evolved circuit for the robust NAND gate is a linear array of nine simple NAND gates (Fig. 5b). With increasing lengths of the NAND-gate chains, the resulting circuit for the robust gate becomes fitter. Odd length chains converge to the robust NAND gate behaviour, whereas even-length chains converge toward a corresponding robust AND gate. This is illustrated in Fig. 6. The first simple NAND gate in the chain performs the NAND operation; all the remaining gates, with their paired inputs, act as simple NOT gates. The increasing length chain converges to fitter circuits, because of the S-shaped (1 − sin2 π2 x) form of the sin/sin gate along its x = y diagonal: any value passing through a pair of simple NOT gates moves closer to being 0 or 1, and so converges to 0 or 1 as the chain of simple NOT gates lengthens. The maximum displacement of points by a single NOT gate operation towards 0 or 1 is ≈ 0.11. This can be interpreted as a threshold for the convergence and stability of the array. Random fluctuations added numerically to every gate output in the range of [±0.1] do not hinder the convergence of the array (Fig. 6 last column). For rather large error values (> 0.2) the arrays tend to destabilise, especially for longer arrays.
340
M. Bechmann, A. Sebald, and S. Stepney
Fig. 6. Convergence of theoretical NAND gate arrays. Odd-numbered arrays converge toward target NAND gate (top row), even-numbered arrays (bottom row) converge toward a corresponding AND gate. The final circuit in each row displays the stability of the array convergence under erroneous signal transduction between gates, assuming random fluctuations in the range of [±0.1].
There are two possible sources for experimental imperfection and therefore imperfect gate behaviour: the accuracy by which the experimental NMR parameters (ωp , τp , . . .) can be executed by the NMR hardware; and the accuracy by which the NMR spectra can be acquired and analysed (integrated in this case). A comparison shows that the fluctuations caused by the measurement and analog-digital conversion are by far the dominating factors (e.g. pulses used were of duration 2.5 ms ±50 ns [1], while fluctuations in signal intensity were < ±0.5%). 3.2
Continuous NAND Gate with sin/sinc Characteristics
We now consider circuits based on the continuous simple sin/sinc NAND gate (Fig. 3b), again aiming for the target robust NAND gate with discrete state areas (Fig. 4a). Here mapping of the NMR parameters ωp (r.f. pulse frequency offset) and τp (r.f. pulse duration) is the following In 1 =
τp τp90
;
In 2 = 1 −
ωp ωpmax
(4)
where ωpmax is the maximum allowed r.f. frequency offset (minimum of sinc function). The output of the simple sin/sinc NAND gate implemented by the NMR experiment is then |κp90 | κ2p90 sin2 (ωeff τp ) + 2ωp2 (1 − cos (ωeff τp )) Out = 1 − (5) 2 ωeff
From Binary to Continuous Gates – and Back Again
341
where ωeff = ωp2 + κ2p90 assuming a perfect π/2 magnetisation flip for an onresonance r.f. pulse of amplitude and duration κp90 and τp90 respectively. The continuous sin/sinc NAND gate is a more complicated situation because it does not display symmetry along the diagonal, in contrast to the sin/sin NAND gate. We approach evolution of a robust NAND circuit based on simple sin/sinc NAND gates in a step-wise manner. Gate Confined to Include only the First Minimum of the sinc Function. To start with, we use a simple sin/sinc NAND gate confined to include only the first minimum of the sinc function (Fig. 7a).
Fig. 7. a) Initial simple sin/sinc NAND gate with one minimum included. b) CGP evolved result. c) Array of nine simple continuous sin/sinc NAND gates.
Fig. 7b shows the CGP evolved result, a robust arrangement of discrete state areas. The evolved circuit shown at the top of Fig. 7b is more complicated than the linear chain of NAND gates previously found in the circuit based on simple sin/sin NAND gates. If we build such a linear circuit from simple sin/sinc NAND gates we do find an acceptable solution (Fig. 7c), but with slightly poorer fitness. Despite the loss of symmetry of our sin/sinc starting NAND gate, repeated application of linear chains of increasing lengths still converges to the desired behaviour (Fig. 8). Gate Confined to Include the Second Minimum of the sinc Function. Next, we use a simple sin/sinc NAND gate confined to include the first two minima of the sinc function (Fig. 9a). Again, we compare the result of a CGP evolution (Fig. 9b) and the result of applying the linear array of nine simple sin/sinc NAND gates (Fig. 9c). CGP is successful in finding a solution which is fairly well optimised around the 16 sampling points (Fig. 4b), but the areas in between now display less obvious and more complicated characteristics. The linear chain of nine simple sin/sinc NAND gates is here slightly less successful finding a good solution at and around the sampling points, but a pattern relating to the number of minima in the starting gate is emerging. With only one minimum included, there are essentially just two levels in the contour
342
M. Bechmann, A. Sebald, and S. Stepney
Fig. 8. Convergence of one-minimum sin/sinc NAND gate chains for increasing (oddnumbered) chain length.
Fig. 9. a) Initial simple sin/sinc NAND gate with two minima included. b) CGP evolved result. c) Array of nine simple continuous sin/sinc NAND gates.
plot (Fig. 7c). Now, with two minima included, we find three distinct levels (around 0, around 0.5, and around 1; see Fig. 9c), separated from each other by steep steps. Fig. 10 shows the results of repeated application of linear arrays of simple sin/sinc NAND gates of increasing length. One can see how for the application of longer chains the terraced structure and step functions converge.
Fig. 10. Convergence of two-minima sin/sinc NAND gate chains for increasing (oddnumbered) chain length.
From Binary to Continuous Gates – and Back Again
343
Fig. 11. a) Initial simple sin/sinc NAND gate with three minima included. b) CGP evolved result. c) Array of nine simple continuous sin/sinc NAND gates.
Gate Confined to Include the Third Minimum of the sinc Function. Fig. 11 summarises the results when we include three minima of the sinc function in our starting sin/sinc NAND gate. CGP again evolves a solution which is optimised around all 16 sampling points (Fig. 11a), but with even more complicated behaviour in between. The (unevolved) linear chain of sin/sinc NAND gates now creates four distinct levels and an overall stepped structure, but is less fit with respect to the fitness function sampling points of Fig. 4b. From these results, we can see that continuous simple sin/sinc NAND gate can act as a good starting point for the implementation of a variety of complicated functions, simply by choosing the number of minima included appropriately for the starting continuous gate, and by defining a suitable number of sampling points.
4
Evolving XOR Circuits Using NAND Gates
Here we briefly demonstrate that this strategies used for evolving robust NAND circuits can also be used to obtain circuits with other functionality built from simple NAND gates. We use the continuous simple sin/sin NAND gate (Fig. 3a) as the starting point. Our target circuit is a robust XOR gate with discrete state areas (Fig. 12a), with the same 16 sampling points as before. An XOR gate constructed from simple sin/sin NAND gates (the grey region of Fig. 12b) gives the continuous behaviour shown in Fig. 13a. If this is followed by our previously discovered strategy of a chain of simple NAND gates (Fig. 12b), we get the result shown in Fig. 13b: a robust XOR gate. If we use CGP to evolve a solution from scratch, we get the more complicated circuit shown in Fig. 12c, with fitter continuous behaviour (Fig. 13c). Note that evolution here rediscovers the chaining strategy, and applies it to the final part of the circuit.
344
M. Bechmann, A. Sebald, and S. Stepney
Fig. 12. a) The target XOR gate with discrete state areas. b) Applying the NAND-gate chain approach for optimisation. c) CGP evolved circuit.
Fig. 13. a) XOR gate built from continuous NAND gates without optimisation. b) Result of NAND-gate chain approach. c) CGP evolved XOR gate.
5
Truly Continuous Gates
So far we have been using the continuous behaviour of the simple gates to implement robust, but still essentially digital, gates. In this section we use a different fitness function to evolve circuits with interesting truly continuous behaviour. We can make boolean logic continuous on the interval [0,1] by defining AND(a, b) = min(a, b) and NOT(a) = 1 − a (see [2]). These have the digital behaviour at the extreme values. Then NAND = 1 − min(a, b) (Fig. 14a). We start from the continuous simple sin/sin NAND gate (Fig. 14b). At first glance this seems to be a more straightforward optimisation task than for the robust gates, given that both the starting gate and the target function are continuous in nature, with a similar initial structure. Here we take a fitness function sampled over more points in the space, using a regular grid of 6 × 6 points. The evolved result is shown in Fig. 14c, together with the corresponding, rather elaborate, circuit. Here the more complex circuit yields only modest
From Binary to Continuous Gates – and Back Again
345
Fig. 14. a) The target NAND gate where NAND = 1 − min(a,b). b) The initial simple sin/sin NAND gate. c) The CGP evolved gate. d) The CGP evolved circuit (5% mutation rate, population size 500, best fitness 35.25, 10000 generations).
Fig. 15. Stability and error propagation through CGP evolved gate in Fig. 14c: with random error (a) [±0.5%]; (b) [±1%]; (c) [ ±10%]
improvements over the simple gate, with agreement between target and evolved function improving by about a factor 2 over that of the single simple sin/sin NAND gate. In particular, the evolved circuit does not really help to improve agreement with the most prominent feature of the target function, the sharp diagonal ridge. More work is needed to match the natural properties provided by the NMR system with the desired properties of the continuous gates. Fig. 15 shows the truly error tolerant behaviour of the CGP evolved gate in Fig. 14c.
346
6
M. Bechmann, A. Sebald, and S. Stepney
Conclusions and Next Steps
CGP has proved effective at evolving specific continuous circuits from the continuous simple NAND gates provided by our NMR approach. In particular, the simple sin/sinc gates can provide a rich set of disctretised behaviours. In these experiments, neither the robust gates, nor the truly continuous gates, are inspired by the natural properties of the NMR system, but rather by mathematical abstractions. Next steps will involve investigating and exploiting what the simple NAND gates “naturally” provide.
7
Experimental Setup
Evolutionary Setup. We use a modified version of the CGP code of [4]. Our setup uses a linear topology of 60 nodes plus input and output nodes with the maximum number of level-back connections. Optimum results used between nine and 33 nodes. The mutation rate during evolution was varied between 0.5% and 50%, where rates between 5% and 10% performed best. Populations of 50/500 were evolved for 10000 generations. Results presented are the best of 10 evolutionary runs. NMR Spectroscopy. 1 H NMR spectra of 99.8% deuterated CHCl3 (Aldrich Chemicals) were recorded on a Bruker Avance 600 NMR spectrometer, corresponding to a 1 H Larmor frequency of −600.13 MHz. On-resonant 90 pulse durations were 2.5 ms and recycle delays 3 s. Hardware limitations [1]: duration of r.f. pulses accurate to ±50 ns; pulse rise and fall times 5 ns and 4 ns respectively; pulse amplitude switched in 50 ns with a resolution of 0.1 dB; phases are accurate to ±0.006 degree and switched < 300 ns; r.f. range is 3–1100 MHz with a stability of 3 ·10−9 /day and 1 ·10−8 /year and a resolution of < 0.005 Hz. Frequency switching is < 300 ns for 2.5 MHz steps and < 2μs otherwise. Main source of experimental errors is integration error due to limited digitisation resolution, 0.5% maximum.
Acknowledgements We gratefully acknowledge the Leverhulme Trust for supporting this work. We thank Shubham Gupta, IIT Mumbai, India, for his cooperation in the initial stages of this work, supported by the TRANSIT project (EPSRC grant EP/F032749/1), and John Clark, York for continued discussions and comments.
References 1. Butler, E.: NMR hardware user guide version 001. Tech. rep., Bruker Biospin GmbH, Rheinstetten, Germany (2005) 2. Levin, V.: Continuous logic - i. basic concepts. Kybernetes 29(16), 1234–1249 (2000)
From Binary to Continuous Gates – and Back Again
347
3. Miller, J.F., Thomson, P.: Cartesian genetic programming. In: Poli, R., Banzhaf, W., Langdon, W.B., Miller, J., Nordin, P., Fogarty, T.C. (eds.) EuroGP 2000. LNCS, vol. 1802, pp. 121–132. Springer, Heidelberg (2000) 4. Miller, J.: Cartesian genetic programming source code (July 2009), http://sites.google.com/site/millerjules/professional 5. Rosell´ o-Merino, M., Bechmann, M., Sebald, A., Stepney, S.: Classical computing in nuclear magnetic resonance. Int. J. of Unconventional Computing 6(3–4) (2010) 6. Stepney, S.: The neglected pillar of material computation. Physica D: Nonlinear Phenomena 237(9), 1157–1164 (2008)
Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits Cristian Ruican, Mihai Udrescu, Lucian Prodan, and Mircea Vladutiu Advanced Computing Systems and Architectures Laboratory University “Politehnica” Timisoara, 2 V. Parvan Blvd., Timisoara 300223, Romania {crys,mudrescu,lprodan,mvlad}@cs.upt.ro http://www.acsa.upt.ro
Abstract. Setting the values of various parameters for an evolutionary algorithm is essential for its good performance. This paper discusses two optimization strategies that may be used on a conventional Genetic Algorithm to evolve quantum circuits: adaptive (parameters initial values are set before actually running the algorithm) or self-adaptive (parameters change at runtime). The differences between these approaches are investigated, with the focus being put on algorithm performance in terms of evolution time. When taking into consideration the runtime as main target, the performed experiments show that the adaptive behavior (tuning) is more effective for quantum circuit synthesis as opposed to self-adaptive (control). This research provides an answer to whether an evolutionary algorithm applied to quantum circuit synthesis may be more effective when automatic parameter adjustments are made during evolution.
1
Introduction
The continuous pursuit for performance pushes the exploration of new computing paradigms. The acquired experience from classical computation is considerable, as it is developed over more than half a century, whereas for quantum computing the race has started relatively recently, in the 1980’s. Even from today’s perspective, it cannot exactly be foreseen whether quantum computer will become physically feasible in the next decade. Evolutionary search was already applied for quantum circuit synthesis, with the focus being on the analysis of the genetic operators and their corresponding performance. The task of implementing the Meta-Heuristic approach on Quantum Circuit Synthesis (MH-QCS) makes use of the ProGA [5] framework, that provides all the necessary support for developing genetic algorithms. Our ProGA framework underpins a robust and optimized environment, its architecture being extended to handle the additional statistical information. The statistical data is processed on-the-fly by the adaptive algorithm and the results are used for adjusting the genetic operator’s rates during run-time. We focus on the genetic algorithm parameter control by involving statistical information taken from the current state of the search into algorithm decision. Our experiments reveal a higher convergence rate for the genetic evolution and therefore an important runtime G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 348–359, 2010. c Springer-Verlag Berlin Heidelberg 2010
Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits
349
speedup is achieved by using adaptive parameter tuning, as opposed to the selfadaptive parameter tuning approach. The automatic synthesis of a quantum circuit, for a given function, is not an easy achievable task [16][17][18]; in order to solve this problem the genetic algorithm will evolve a possible solution that will be evaluated against other previous solutions obtained, and eventually a close-to-optimal solution will be indicated. It is hard, if not impossible, to guess the values used for the tuning of genetic algorithm, because even a small change in the circuit topology will generate a different quantum logic function; this is the main motivation for adopting an adaptive genetic algorithm.
2
Background
Quantum computation is computation made with coherent atomic scale dynamics. A quantum computer is a physical device able to perform computation driven by quantum mechanical phenomena, such as entanglement and superposition of basis states. For the classical computer, the unit of information is the bit, whereas in quantum computation its counterpart is the so-called qubit. A quantum bit may be represented by using the spin 1/2 particle. For example, a spin-down | ↓ and a spin-up | ↑ may be used to represent the binary information encoded as |0 and |1. In Bra-Ket notation, a qubit is a normalized vector in a two dimensional Hilbert space |ψ = α|0 + β|1, |α|2 + |β|2 = 1 (α,β ∈ C), where |0 and |1 are the superposed basis states [9]. Genetic Algorithms (GA) are adaptive heuristic search algorithms based on evolutionary ideas of natural selection used to find solutions for optimization and search problems. The new field of Evolvable Quantum Information (EQI) has been established as the merging of quantum computation and evolvable computation [8]. The problem of setting values for different control parameters is crucial in the context of algorithm performance. Each GA parameter is responsible for controlling the evolution path towards the solution. There are two major forms of setting the parameter values for a genetic algorithm [15]: – Parameter tuning: the parameter values are fixed before the algorithm run and remain as such during run-time. There are several disadvantages for tuning: finding good parameters before the run may be time consuming and it is possible not to get optimal values for all the phases. – Parameter control: the initial parameter values are changed during the algorithm run, keeping the dynamic spirit of evolution. The adaption algorithm uses the feedback values from the process to adjust the parameters for better performance. As presented in Figure 1, the upper part of the hierarchy contains a method that aims at finding optimal parameters for the GA, while the lower part is dedicated to possible problem solutions on the application layer. We use the same approach of splitting the design into several layers. Thus, the quantum
350
C. Ruican et al.
(a) control flow
(b) information flow
Fig. 1. The 3-layered hierarchy of parameter tuning [15]
circuit synthesis genetic algorithm will run in the application layer, while the algorithm responsible with the dynamic adjustment of the operators will run in the design layer.
3
Search Methodology
Evolutionary algorithms relate to the probability theory, which is essential for the quantitative analysis of large sets of data, having as starting point the evolution of any random variable (i.e. representation types, selection methods, different operators used, etc; as opposed to the selection methods defined as natural values, the operators are in continuous space). Consider (Ω,S,P) a probability field, where Ω is the set of elementary events, S is the events space and P is a probability measure, then a random variable over Ω is an application X:Ω → R taking the following form: {ω|X(ω) < x} (1) where any subset of Ω is a part of S, where x is a random real number. We can define the probability measure for x: P (X < x) = P {ω|X(ω) < x}
(2)
Algorithm convergence is reducible to convergence in probability, which can be demonstrated by using probability values. It is considered that the evolutionary algorithms exhibit increased robustness (they work well on different data sets) largely due to the optimization functions, where the performance function (fitness) is always followed by the optimization function (metaheuristics). This way, evolutionary algorithms provide better results in comparison with other approaches (i.e. gradient type methods). If we consider X as being the solution space (a set of any individual solution states), then each individual is represented by an element from X; f : X → R. Our purpose is to identify maxx∈X f where x is a vector of decision variables that satisfies f (x) = f (x1 , ..., xn ). The individual fitness is evaluated using a performance function defined as:
Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits
351
eval(x) = f (x) + W × penalty(x)
(3)
where f=
f unction(evolved circuit) f unction(initial circuit)
(4)
and penalty = 1 −
number of evolved gates − number of initial gates number of initial gates
(5)
W is a user-defined weight indicating how severely the penalty affects the evaluation. The search process dynamics is generated by applying the crossover and mutation operators. The purpose is to find an optimal combination from population individuals, as the one corresponding to the maximum value for the performance function. Each program execution contains the same number of individuals and it is considered that a following run will always contain better individuals than those from a previous run; the algorithm trend being to reach the global optimum value for the performance function. Each genetic operator is applied with a given probability (defined as an algorithm parameter) over the current population, subsequently generating a new population. Our previously developed framework ProGA [Programming Genetic Algorithms] [5] is a new and powerful tool used to solve real-world genetic problems, in order to compensate for the situations where conventional (deterministic) methods fail to successfully produce an acceptable solution. A significant part of the effort has been dedicated to maintaining the framework’s extensibility which proves especially useful for comparison purposes when two different approaches are applied (adaptive and self-adaptive behavior). While the implementation details were hidden from dedicated components (from a macro level perspective), they do not to affect the chromosome details (at the micro level). The adaptive or self-adaptive rules are not pre-programmed but discovered during the algorithm evolution. The framework allows for different configurations, and thus the comparison between the characteristics of the emerged solutions becomes straightforward and accurate. At the Macro Level the genetic algorithm describes the iteration loops that increase the convergence of the individuals towards a possible solution. Knowledge about the termination criterion and about the probability used for the natural selection is available at this level. The macro level is also responsible for creating the population that will be used during the evolution. At the Micro Level the chromosome details are essential, together with the operations that may be applied (initialization, crossover, mutation, etc.) This level is important for the solution encoding. To provide an accurate comparison between the two major forms of setting parameter values (adaptive and self-adaptive), a third level is introduced in order to interface the common part of the algorithm with the parameter tuning parts (see Fig. 2). Adaptive evolutionary computations can be separated from the main GA in order to facilitate the assessment of the evolved results.
352
C. Ruican et al.
Fig. 2. System’s provided levels
First, a direct relationship between population and adaptive components is present because additional statistical information from the current generation is necessary for parameter adjustment (the decision will later be taken when enough statistical information become available from previous generations). Second, when self-adjustment is used a relationship between the chromosome and the self-adjustment component will be created (the decision is taken by each individual on the applied operator). The quantum circuit representation is crucial for chromosome encoding. Following Nature, where a chromosome is composed of genes, in our chromosome the genes represent circuit sections. This way, we are able to encode a circuit within a chromosome [4], and therefore represent a possible candidate solution (as presented in Fig.3a). A gene will store the specific characteristics of a particular circuit section and genetic operators will be applied either at the gene level or inside the gene. The genome representation is an array of quantum gates that are chosen randomly from a given set, with the only constraint that a quantum gate cannot be split in two genes. The initialization is performed once (at start-up), and is responsible with the genome creation (see Fig. 3b). A gene stores the specific characteristic of a particular quantum circuit section where the mutation operator has the role of producing a change, hence allowing the search algorithm to explore new spaces. The crossover operator will select gates from parents to create offsprings, by copying their contents and properties. 3.1
Static GA Operators
Parameter tuning is one of the approaches used for optimization problems. The parameter values are changed statically before the algorithm run, followed by results evaluation. The tuning becomes complicated when an increased number of parameters need to be adjusted. Thus, “considering four parameters and five values for each of them, one has to test 54 = 625 different setups. Performing 100 independent runs with each setup, this implies 62,500 runs just to establish a good algorithm design” [1]. Algorithm parameters are usually not independent and testing each of the possible combinations proves as practically impossible in many cases while certainly being extremely time-consuming.
Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits chromosome encoding gene encoding
gene 1
gene 2
353
gene m
n
1
(a) encoding
(b) control Fig. 3. Chromosome Encoding (a) and Chromosome Initialization (b)
3.2
Adapting GA Operators
Adaptive methods make use of additional information from the current state of the search. The statistical information is later used by the adaptive component for adjusting the algorithm operators. Compared with the static adjustment, for example, in incipient generations, large mutation steps are necessary for a good exploration in the problem search space, and later in the last runs only small mutation steps are needed to narrow the optimal solution. From the meta-heuristic point of view, it is considered that genetic algorithms contain all the necessary information for adaptive behavior. Nevertheless, the adaptive behavior optimizes the circuit synthesis algorithm (from the user point of view the setting of parameters is far from being a trivial task). Two types of statistical data are used as input for the adaptive algorithm. The first type is represented by the fitness results for each population corresponding to the best, mean and worst chromosomes. The second type is represented by the operator performance (see Fig.4). In reference [6], it is considered that the performance records are essential for deciding on operators reward. Functions as Maximum, Minimum, Average and Standard Deviation may be applied on any kind of statistical data. For each generation the maximum, average and minimum fitness values are provided by the genetic algorithm framework and stored within the statistical data. After each generation, the operator performance is updated with statistical data. Following the 1/5- Rechenberg rule [3], the analysis of the acquired data is performed after 5 generations. The operator reward is updated according to the formula given in Eq.6. When the genetic evolution is finished (i.e. when a solution has been
354
C. Ruican et al.
Fig. 4. Adaptive information flow
evolved), other statistical functions are computed. Thus, we defined statistical functions on each generation and statistical functions over all generations. σ(op) = α ∗ Absolute + β ∗ Relative − γ ∗ InRange − δ ∗ W orse
(6)
where the parameters α, β, γ and δ are introduced to rank the operator performance; they are not adjusted during the algorithm evolution. In our experiments we used the following values: α = 20, β = 5, γ = 1 and δ = 1. 3.3
Self-adapting GA Operators
An important view on the optimization problems is emphasized by the ”No Free Lunch“ theorem, stating that any additional performance over one class of problems is exactly paid in terms of performance over another class [2]. The self-adaptive parameter control algorithm outperforms this limitation due to continuous adjustment of the operators probability (evolution together with the algorithm). The dynamic parameters customization will properly handle the objective function, the encoding and the constraints. This approach leads to a flexible genetic algorithm, where the tuning is automatically performed during the genetic evolution. When an evolutionary computation evolves the new values for its adaptive parameters, it is considered to be self-adaptive. The algorithm goal is to dynamically adjust the values for its parameters to bias the evolution of the offspring (i.e. by increasing the algorithm convergence). Following this approach, the chromosome will store additional information about the applied operator success rate within the self-adaptive component (see Fig. 5). The success rate is defined, in the same way as the performance records from the adaptive approach, and it is used to identify a better operator result at the chromosome level. For the adaptive approach the performance records are saved at the population level, whereas for the self-adaptive approach the save is performed at the chromosome level. The decision component returns the result of the comparison between the success rate values for both GA operators - mutation and crossover - and then decide on which operator has more chance of creating a better offspring. If we compare it with the adaptive behavior, where after only 5 generations the adjustment is made (and decided at the population
Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits
355
Fig. 5. Self-Adaptive information flow
level), at the self-adaptive each chromosome -based on its success rates- decide on the applied GA operator (there is no probability involved). Even if the GA operators parameter are now removed from our equation, a GA contains other parameters that need to be manually adjusted (i.e. the population size). A small number of individuals generate a ramp-up through solution at the start-up, but it is possible that a solution is not evolved later. If the number of individuals is too high then any generation evaluation takes long time. This paper has made the case for the optimal evolution when the solution is evolved in a faster manner.
4 4.1
Evaluating Quantum Circuit Evolution Experimental Platform
The experiments were conducted on a computer with the following configuration: Intel Core2Duo processor at 2GHz, 4GB RAM memory and Open SuSe 11.2 as operating system. In order to avoid lucky guesses the experiments have been repeated for 10 times, the average result being used for comparison in the provided graphics. To measure the performance of an application, it is common to measure the time spent until a solution is evolved. Because the results may appear within a small period of time, a fine granularity for time measurement was necessary. We used the RDTSC (Read Data Time Stamp Counter) to measure the processor ticks in order to provide excellent, high-resolution information. The number of ticks is independent from the processor platform and it accurately measures short duration events (with laptops or systems supporting Intel@Speed Technology the processor frequency will change as a result of CPU utilization when running on batteries). To estimate the time duration, the number of ticks should be divided by the processor frequency. Each case study is started with a benchmark quantum circuit (see Figure 6) that is used for synthesis algorithm evaluation. For each benchmark the name of the circuit is presented along with its number of qubits (for diversity purpose we performed the evaluation on three-qubit, four-qubit and five-qubit circuits).
356
C. Ruican et al.
|a •
•
|a ⊕ |b • ⊕ •
⊕
|c • • ⊕ ⊕ • (a) ham3
|a • •
|b ⊕ •
|b • ⊕ • •
|c
|c
•⊕
|d
|d ⊕
⊕
|e
(b) rd32
⊕• ⊕• ⊕ (c) xor5
Fig. 6. Benchmark circuits used for analysis [7] Table 1. Configuration for Experiments Configuration Parameter Adaptive Self-Adaptive GA type Non-Overlapping Non-Overlapping Population size 150 150 Generations 100 100 Mutation type Multiple Multiple Crossover type Two points Two points Selector type Roulette Wheel Roulette Wheel Elitism percent 10 10 Mutation probability 0.03 NA Crossover probability 0.3 NA Adaptive increase/decrease 0.1/0.1 NA
The following configuration (see Table 1) is used to evolve synthesis solutions, mutation and crossover probabilities being adjusted during the evolution by following the adaptive or self-adaptive algorithm. The experimental results are presented as tables (see Table 2); the tests and the software source code are made available over the personal web site[13]). Table 2. Experimental Results ham3 rd32 xor5 Parameter Adaptive Self-Adaptive Adaptive Self-Adaptive Adaptive Self-Adaptive MPR 70.00 33.67 64.67 66.00 64.67 57.29 MBF 96.25 96.25 97.81 97.81 97.81 99.38 5.22E+08 1.06E+09 9.03E+08 1.09E+10 9.03E+08 6.61E+10 MT S 4 6 3 3 3 7
Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits
4.2
357
Comparison Analysis
During the evaluation of the experiments, configurable variables were used to measure and control the application results. The data analysis creates correlations between the adjusted (adaptive or self-adaptive) operators and the algorithm results. The algorithm was tested over different quantum circuits for different difficulty levels by increasing the number of circuit qubits. Four factors are explored within our experiments (see Table 2): – MPR (Mean Percentage Runs): average number of evolved generations until a solution is evolved, over all runs – MBF (Mean Best Fitness): average of the best fitness in the last population, over all runs – MT (Mean Time): average of executed ticks until a solution is evolved (measure within the current generation) – S (Solutions): number of evolved solutions Figure 7 contains a detailed comparison on each experimental run applied for the xor5 quantum circuit. For full experimental results the reader is kindly referenced to [13].
Fig. 7. MBF experimental results for the xor5 quantum circuit
Before any analysis of our results for these test cases, we note that our quantum synthesis algorithm always converges toward a solution. Considering all aspects, the adaptive approach proves more effective in developing a faster convergence because better offsprings are evolved, although the self-adaptive approach should be better in terms of evolving solutions, at least theoretically. To this end, consisting of synthesizing quantum circuits, the main goal was to reduce the evolution time; this justifies our choice for the adaptive approach.
358
C. Ruican et al.
In more detail, the effectiveness of the genetic adaptive algorithm is proven for quantum circuit synthesis. The computational power overhead, required by the adaptive component is reasonably small (see MT values expressed in comparison with the self-adaptive); however, the number of evolved solutions is higher for the self-adaptive approach.
5
Conclusions and Perspectives
This paper presented our experimental results over using two different optimization strategies for evolving quantum circuits. For this task, the best performance was achieved by using the adaptive as opposed to the self-adaptive approach. As already proven in [19] [20], metaheuristic approaches are more effective in evolving quantum circuits, being able to provide solutions for 8-qubit circuits, considering that conventional GA approaches are effective only for 4 or 5-qubit circuits. These previously experimented methods employ only the adaptive approach. The implementation and testing of the another metaheuristic approach (i.e. self-adaptive) is presented herein, with the emphasis being put on the comparison between the two strategies, at the algorithmic level. The experimental results suggest that the adaptive strategy is better than the self-adaptive one, for all the considered benchmark circuits. In fairness, it has to be mentioned that the experience gained for developing the adaptive metaheuristic will suggest the fact that further research will level this gap. Nevertheless, the difference in performance obtained by performing the experiments can also be explained by performing more mutations than necessary in many cases (due to the fact that each individual decides about its applied genetic operator). Our future work will try to investigate algorithms with a smaller number of parameters, in order to render the most effective metaheuristic strategy when evolving quantum circuits.
Acknowledgements This work was supported in part by the National University Research Council, Romania, under grant PNII-I17/2007.
References 1. Eiben, A.E., Michalewicz, Z., Schoenauer, M., Smith, J.E.: Parameter Control in Evolutionary Algorithms. In: Parameter Setting in Evolutionary Algorithms, Springer, Heidelberg (2007) 2. Wolpert, D.H., Macready, W.G.: No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation 67(1), 67–82 (1997) 3. Rechenberg, I.: Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischen. Frommann-Holzboog, Stuttgart (1973) 4. Ruican, C., Udrescu, M., Prodan, L., Vladutiu, M.: Automatic Synthesis for Quantum Circuits using Genetic Algorithms. In: International Conference on Adaptive and Natural Computing Algorithms, pp. 174–183 (2007)
Adaptive vs. Self-adaptive Parameters for Evolving Quantum Circuits
359
5. Ruican, C., Udrescu, M., Prodan, L., Vladutiu, M.: A Genetic Algorithm Framework Applied to Quantum Circuit Synthesis. In: Nature Inspired Cooperative Strategies for Optimization, pp. 419–429 (2007) 6. Gheorghies, O., Luchian, H., Gheorghies, A.: Walking the Royal Road with Integrated-Adaptive Genetic Algorithms. University Alexandru Ioan Cuza of Iasi (2005), http://thor.info.uaic.ro/~tr/tr05-04.pdf 7. Maslov, D.: Reversible Logic Synthesis Benchmarks Page (2008), http://www.cs.uvic.ca/%7Edmaslov/ 8. Spector, L.: Automatic Quantum Computer Programming. A Genetic Programming Approach, 2nd edn. Springer, Heidelberg (2006) 9. Nielsen, M., Chuang, I.: Quantum Computation and Quantum Information. Cambridge University Press, Cambridge (2000) 10. Yao, X.: An Empirical Study of Genetic Operators in Genetic Algorithms. Microprocessing and Microprogramming 38(1-5), 707–714 (1993) 11. Hilding, F.G., Ward, K.: Automated Operator Selection on Genetic Algorithms. Knowledge-Based Intelligent Information and Engineering Systems, 903–909 (2005) 12. Affenzeller, M., Wagner, S.: Offspring Selection: A New Self-Adaptive Selection Scheme for Genetic Algorithms. Adaptive and Natural Computing Algorithms, 218–221 (2005) 13. Ruican, C.: Projects Web Site Page (2010), http://www.cs.utt.ro/~crys/index_files/public/ices.tar.gz 14. Luke, S.: Essentials of Metaheuristics. Zeroth Edition (2009), http://cs.gmu.edu/~sean/book/metaheuristics/ 15. Smit, S.K., Eiben, A.E.: Comparing Parameter Tuning Methods for Evolutionary Algorithms. In: IEEE Congress on Evolutionary Computation, pp. 399–406 (2009) 16. Maslov, D., Dueck, G.W.: Level Compaction in Quantum Circuits. In: IEEE Congress on Evolutionary Computation, pp. 2405–2409 (2006) 17. Shende, V., Prasad, A.K., Markov, I.L., Hayes, J.P.: Synthesis of Reversible Logic Circuits. IEEE Transaction on CAD 22 22(6), 710–722 (2003) 18. Lukac, M., Perkowski, M.: Evolving quantum circuits using genetic algorithm. In: NASA/DoD Conference on Evolvable Hardware, pp. 177–185 (2002) 19. Ruican, C., Udrescu, M., Prodan, L., Vladutiu, M.: Quantum Circuit Synthesis with Adaptive Parametres Control. In: European Conference on Genetic Programming, pp. 339–350 (2009) 20. Ruican, C., Udrescu, M., Prodan, L., Vladutiu, M.: Genetic Algorithm Based Quantum Circuit Synthesis with Adaptive Parameters. In: IEEE Congress on Evolutionary Computation, pp. 896–903 (2009)
Imitation Programming Larry Bull Department of Computer Science, University of the West of England, Bristol BS16 1QY, U.K. [email protected]
Abstract. Many nature-inspired mechanisms have been presented for computational design and optimization. This paper introduces a population-based approach inspired by a form of cultural learning - imitation. Imitation is typically defined as learning through the copying of others. In particular, it is used in this paper to design simple circuits using a discrete dynamical system representation – Turing’s unorganised machines. Initial results suggest the imitation computation approach presented is competitive with evolutionary computation, i.e., another class of stochastic population-based search, to design circuits from such recurrent NAND gate networks. Synchronous and asynchronous circuits are considered.
1 Introduction Cultural learning is learning either directly or indirectly from others and imitation is a fundamental form of such adaptation. Dawkins [9] has highlighted the similarity between the copying of behaviours through imitation and the propagation of innate behaviours through genetics within populations. That is, he suggests information passed between individuals through imitation is both selected for by the copier and subject to copy errors, and hence an evolutionary process is at work - consequently presenting the cultural equivalent to the gene, the so-called meme. The term “memetic” has already been somewhat inaccurately adopted by a class of search algorithms which combine evolution with individual learning, although a few exceptions include imitation (e.g., [40]). Some previous work has explored the use of imitation (or imitation-like) processes as a general approach to computational intelligence however, including within reinforcement learning (e.g., [29]) and supervised learning (e.g., [5]). The imitation of humans by machines has been used to design robot controllers (e.g., [6]) and computer game agents (e.g., [13]). Other culture-inspired schemes include the use of artifacts (e.g., [17]) or the use of stored information to guide the production of new evolutionary generations, as in Cultural Algorithms [30]. This paper introduces a new form of imitation computation and applies it to the design of (simple) dynamical circuits consisting of uniform components. In 1948 Alan Turing produced a paper entitled “Intelligent Machinery” in which he highlighted cultural learning as a possible inspiration for techniques by which to program machines (e.g., see [8] for an overview). In the same paper, Turing also presented a formalism he termed “unorganised machines” by which to represent intelligence within computers. These consisted of two types: A-type unorganised machines, G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 360–371, 2010. © Springer-Verlag Berlin Heidelberg 2010
Imitation Programming
361
which were composed of two-input NAND gates connected into disorganised networks (Figure 1, left); and, B-type unorganised machines which included an extra triplet of NAND gates on the arcs between the NAND gates of A-type machines by which to affect their behaviour in a supervised learning-like scheme through the constant application of appropriate extra inputs to the network (Figure 1, right). In both cases, each NAND gate node updates in parallel on a discrete time step with the output from each node arriving at the input of the node(s) on each connection for the next time step. The structure of unorganised machines is therefore very much like a simple artificial neural network with recurrent connections and hence it is perhaps surprising that Turing made no reference to McCulloch and Pitts’ [22] prior seminal paper on networks of binary-thresholded nodes. However, Turing’s scheme extended McCulloch and Pitts’ work in that he also considered the training of such networks with his B-type architecture. This has led to their also being known as “Turing’s connectionism”. Moreover, as Teuscher [35] has highlighted, Turing’s unorganised machines are (discrete) nonlinear dynamical systems and therefore have the potential to exhibit complex behaviour despite their construction from simple elements. Around the same time as Turing was working on artificial intelligence in the 1940’s, John von Neumann, together with Stanislaw Ulam, developed the regular lattice-based discrete dynamical systems known as Cellular Automata (CA) [38]. That is, CAs are discrete dynamical systems which exist on a graph of restricted connectivity but with potentially any logical function at each node, whereas unorganised machines exist on a graph of potentially any connectivity topology but with a restricted logical function at each node. Given their simple structure from universal gates, the current work aims to explore the potential for circuit design using unorganised machines through the use of imitation computation.
Fig. 1. A-type unorganised machine consisting of four two-input NAND gates (left). B-type unorganised machine (right) consisting of four two-input NAND gates. Each connecting arc contains a three NAND gate “interference” mechanism so that external inputs such as S1 and S2 can be applied to affect overall behaviour, i.e., a form of supervised learning.
2 Background The most common form of discrete dynamical system is the Cellular Automaton which consists of an array of cells where the cells exist in states from a finite set and
362
L. Bull
update their states in parallel in discrete time. Traditionally, each cell calculates its next state depending upon its current state and the states of its closest neighbours. Packard [26] was the first to use a computational intelligence technique to design CAs such that they exhibit a given emergent global behaviour, using evolutionary computation. Following Packard, Mitchell et al. (e.g., [24]) have investigated the use of a Genetic Algorithm (GA) [16] to learn the rules of uniform one-dimensional, binary CAs. As in Packard’s work, the GA produces the entries in the update table used by each cell, candidate solutions being evaluated with regard to their degree of success for the given task — density and synchronization. Andre et al. [2] repeated Mitchell et al.’s work evolving the tree-based LISP S-expressions of Genetic Programming (GP) [20] to identify the update rules. They report similar results. Sipper [31] presented a non-uniform, or heterogeneous, approach to evolving CAs. Each cell of a one- or twodimensional CA is also viewed as a GA population member, mating only with its lattice neighbours and receiving an individual fitness. He showed an increase in performance over Mitchell et al.’s work by exploiting the potential for spatial heterogeneity in the tasks. The approach was also implemented on a Field-Programmable Gate Array (FPGA) and, perhaps most significantly, the inherent fault-tolerance of such discrete dynamical systems was explored. That is, it appears the behaviour of such systems gives them robustness to certain types of fault without extra mechanisms. This finding partially motivates the current study. Another early investigation into discrete dynamical networks was that by Kauffman (e.g., see [18] for an overview) with his “Random Boolean Networks” (RBN). An RBN typically consists of a network of N nodes, each performing one of the possible Boolean functions with K inputs from other nodes in the network, all updating synchronously. As such, RBN may be viewed as a generalization of A-type unorganised machines (since they only contain NAND gates, with K=2). Again, such discrete dynamical systems are known to display an inherent robustness to faults - with low K (see [1] for related results with such regulatory network models in general). RBN have recently been evolved for (ensemble) computation [28]. A number of representations have been presented by which to enable the evolution of computer programs and circuits. Most relevant to the representation to be explored in this paper is the relatively small amount of prior work on arbitrary graph-based representations. Significantly, Fogel et al. (e.g., [11]) were the first to evolve graphbased (sequential) programs with their use of finite state machines – Evolutionary Programming (EP). Angeline et al. [4] used a version of Fogel et al.’s approach to design highly recurrent artificial neural networks. Teller and Veloso’s [34] “neural programming” (NP) uses a directed graph of connected nodes, each with functionality defined in the standard GP way, with recursive connections included. Here each node executes in synchronous parallelism for some number of cycles before an output node’s value is taken. Luke and Spector [21] presented an indirect, or cellular, encoding scheme by which to produce graphs, as had been used to design artificial neural networks (e.g., [14]), an approach used to design both unorganised machines [35] and automata networks [7]. Poli has presented a scheme wherein nodes are connected in a graph which is placed over a two-dimensional grid. Later, recurrent artificial neural networks were designed such that the nodes were synchronously parallel and variants exist in which some nodes can update more frequently than others (see [27] for an overview). Miller (e.g., [23]) has presented a restricted graph-based representation
Imitation Programming
363
scheme originally designed to consider the hardware implementation of the evolved program wherein a two-dimensional grid of sequentially (feed forward) updating, connected logic blocks is produced. The implementation of arbitrary graphs onto FPGAs has also been considered [37]. An example of what might be identified as a population-based imitation approach is the class of algorithms known as Particle Swarm Optimization (PSO) [19]. Originally intended as a simulation tool for modelling social behaviour, PSO algorithms typically maintain a population of real-valued individuals which move through the problem space by adjusting their constituent variables based upon both their own best ever solution and the current best solution within a social or spatial group. That is, it can be said individuals imitate aspects of other current individuals to try to improve their fitness, typically using randomly weighted coefficients per variable via vector multiplication. In this paper a related form of imitation computation is presented and used to design synchronous and asynchronous dynamical circuits from variable-sized graphs.
3 Designing Unorganised Machines through Imitation A-type unorganised machines have a finite number of possible states and they are deterministic, hence such networks eventually fall into a basin of attraction. Turing was aware that his A-type unorganised machines would have periodic behaviour and he stated that since they represent “about the simplest model of a nervous system with a random arrangement of neurons” it would be “of very great interest to find out something about their behaviour” (see [8]). Figure 2 shows the fraction of nodes which change state per update cycle for 100 randomly created networks, each started from a random initial configuration, for various numbers of nodes N. As can be seen, the time taken to equilibrium is typically around 15 cycles, with all nodes changing state on each cycle thereafter, i.e., oscillating. For the smaller networks (N=5, N=50), some nodes remain unchanging at equilibrium however; with smaller networks, the probability of nodes being isolated is sufficient that the basin of attraction contains a degree of node stasis (see [35] for a similar study).
Fig. 2. Showing the average fraction of two-input NAND gate nodes which change state per update cycle of random A-type unorganised machines with various numbers of nodes N
364
L. Bull
Previously, Teuscher [35] has explored the use of evolutionary computing to design both A-type and B-type unorganised machines together with new variants of the latter. In his simplest encoding, an A-type machine is represented by a string of N pairs of integers, each integer representing the node number within the network from which that NAND gate node receives an input. Turing did not explicitly demonstrate how inputs and outputs were to be determined for A-type unorganised machines. Teuscher used I input nodes for I possible inputs, each of which receive the external input only and are then connected to any of the nodes within the network as usual connections. That is, they are not NAND nodes. He then allows for O outputs from a pre-defined position within the network. Thus his scheme departs slightly from Turing’s for B-type unorganised machines since Turing there showed input NAND nodes receiving the external input (Figure 1). Teuscher uses his own scheme for all of his work on unorganised machines, which may be viewed as directly analogous to specifying the source of inputs via a terminal set in traditional tree-based GP. The significance of this difference is not explored here, with Turing’s input scheme used. Teuscher used a GA to design A-type unorganised machines for bitstream regeneration tasks and simple pattern classification. In the former case, the size of the networks, i.e., the number of nodes, was increased by one after every 30,000 generations until a solution was found. That is, an epochal approach was exploited to tackle the issue of not knowing how complex an A-type unorganised machine will need to be for a given task. Or a fixed, predefined size was used. The basic principle of imitation computation is that individuals alter themselves based upon another individual(s), typically with some error in the process. Individuals are not replaced with the descendants of other individuals as in evolutionary search; individuals persist through time, altering their solutions via imitation. Thus imitation may be seen as a directed stochastic search process, thereby combining aspects of both recombination and mutation used in evolutionary computation. In this paper a variable-length representation of pairs of integers, defining node inputs, each with an accompanying single bit defining the node’s start state, is used. On each round of imitations, each individual in the society/population chooses another to imitate. A number of schemes are possible, such as those used in PSO, but the current highest quality solution is used by all individuals for each trait here. To encourage compact solutions, in the case of a quality tie, the smallest high quality solution is used, or a randomly chosen such individual if a further tie occurs. In the general case, for each trait/variable of an individual, a probability that the imitator will replace their own corresponding variable with a copy of that of the imitated solution (pi) could be tested. If satisfied, a further probability (pe) would then be tested to see if an error will occur in that process. For simplicity, in this paper pi is not used on a per variable basis but deterministically set such that one imitation event occurs per round per individual, with pe = 0.5. The possible imitation operations are to copy a connection, copy a start state, or copy solution size, all with or without error. For node connection without error, a randomly chosen node has one of its randomly chosen connections set to the same value as the corresponding node and its same connection in the individual it is imitating. When an error occurs, the connection is set to the copied connection’s id +/- 1 (equal probability, bounded by solution size). Imitation can also copy the start state for a randomly chosen node from the corresponding node, or do it with error (bit flip here). Varying
Imitation Programming
365
solution size depends upon whether the two individuals are the same size, with perfect and erroneous versions again used. Thus if a change of size imitation event is chosen and if the individual being imitated is larger than the copier, the connections and node start state of the first extra node are copied to the imitator, a randomly chosen node being connected to it. If the individual being imitated is smaller than the copied, the last added node is cut from the imitator and all connections to it re-assigned. If the two individuals are the same size, either event can occur (with equal probability). Node addition adds a randomly chosen node from the individual being imitated onto the end of the copier and it is randomly connected into the network. Node deletion is as before. The operation can also occur with errors such that copied connections are either incremented or decremented within bounds. For a problem with a given number of inputs I and a given number of outputs O, the node deletion operator has no effect if the solution consists of only O + I nodes. Similarly, there is a maximum size defined beyond which the growth operator has no effect. A process similar to the selection scheme typically used in Differential Evolution [33] is adopted here: each individual in the current population (μ) creates one alternative solution under imitation (μ’) and it is adopted by that individual if it is of higher quality. In the case of ties, the solution with the fewest number of variables/traits is adopted to reduce bloat, otherwise the decision is random. Other imitation algorithms have made the adoption of imitated solutions probabilistic (e.g., [15]), whereas PSO always accepts new solutions but then also imitates from the given individual’s best ever solution per learning cycle. This aspect of the approach, like many others, is open to future investigation.
4 Experimentation A simple version of the multiplexer task is used initially in this paper since they can be used to build many other logic circuits, including larger multiplexers. These Boolean functions are defined for binary strings of length l = x + 2x under which the x bits index into the remaining 2x bits, returning the value of the indexed bit. The correct response to an input results in a quality increment of 1, with all possible 2l binary inputs being presented per fitness evaluation. Upon each presentation of an input, each node in an unorganised machine has its state set to its specified start state. The input is applied to the first connection of each corresponding I input node. The unorganised machine is then executed for T cycles, where T is typically chosen to enable the machine to reach an attractor. The value on the output node(s) is then taken as the response. It can be noted that Teuscher [35] used the average output node(s) state value over the T cycles to determine the response, again the significance (or not) of this difference is not explored here. All results presented are the average of 10 runs, with a population/society of μ=20 and T=15. Experience found giving initial random solutions N = O+I+30 nodes was useful across all the problems explored here, i.e., with the other parameter/algorithmic settings described. Figure 3 (left) shows the performance of the approach on the 6-bit (x=2) multiplexer problem. Optimal performance (64) is obtained around 5,000 iterations and solutions are eventually two or three nodes smaller than at initialization.
366
L. Bull
Fig. 3. Performance on multiplexer (left) and demultiplexer (right)
A multiplexer has multiple inputs and a single output. The demultiplexer has multiple inputs and multiple outputs. Figure 3 (right) shows performance of the same algorithm for an x=2 demultiplexer, i.e., one with three inputs and four outputs. Again, quality was determined by feeding each of the possible inputs into the A-type machine. It can be seen that optimal performance (8) is reached around 7,000 iterations and solutions are typically around ten nodes smaller than at initialization. As noted above, A-type machines are similar to RBN. The effects of increasing the logic functions to {AND, NAND, OR, NOR, XOR, XNOR}, with a corresponding extra imitation operation, have been briefly explored on the same tasks. Results (not shown) indicate either no statistically significant difference in performance or a significant reduction in performance is seen: Turing’s simpler scheme appears to be robust. However, significantly smaller solutions were sometimes seen which is potentially useful for circuit design, of course.
5 Asynchrony Turing’s unorganized machines were originally described as updating synchronously in discrete time steps. However, there is no reason why this should be the case and there may be significant benefits from relaxing such a constraint. Asynchronous forms of CA have been explored (e.g., [25]) wherein it is often suggested that asynchrony is a more realistic underlying assumption for many natural and artificial systems. Asynchronous logic devices are also known to have the potential to consume less power and dissipate less heat [39], which may be exploitable during efforts towards hardware implementations of such systems. Asynchronous logic is also known to have the potential for improved fault tolerance, particularly through delay insensitive schemes (e.g., [10]). This may also prove beneficial for direct hardware implementations. See Thompson et al. [36] for evolving asynchronous hardware.
Imitation Programming
367
Fig. 4. Showing the average fraction of two-input NAND gate nodes which change state per update cycle of random asynchronous A-type unorganised machines with various numbers of nodes N.
Asynchronous CAs have also been evolved (e.g., [32]). No prior work on the use of asynchronous unorganized machines is known. Asynchrony is here implemented as a randomly chosen node (with replacement) being updated on a given cycle, with as many updates per overall network update cycle as there are nodes in the network before an equivalent cycle to one in the synchronous case is said to have occurred. Figure 4 shows the fraction of nodes which change state per update cycle for 100 randomly created networks, each started from a random initial configuration, for various numbers of nodes N. As can be seen, the time taken to equilibrium is again typically around 15 cycles, with around 10% of nodes changing state on each cycle thereafter, i.e., significantly different behavior to that seen for the synchronous case shown in Figure 2. For the smaller networks (N=5, N=50), there is some slight variance in this behaviour. Figure 5 shows the performance of the imitation algorithm with the asynchronous unorganized machines for the multiplexer and demultiplexer tasks. The same parameters as before were used in each case. As can be seen, the multiplexer task appears significantly harder, on average IP fails to solve the task on every run with the parameters used, compared to consistent optimality after 5,000 iterations in the synchronous node case (Figure 3). Performance was not significantly improved through a variety of minor parameter alterations tried (not shown). It takes around 150,000 iterations to solve the demultiplexer, again a statistically significant decrease in performance over the synchronous case. Moreover, the use of asynchronous node updating has altered the topology of the graphs evolved with more nodes (T-test, p≤0.05) being exploited. This is perhaps to be expected since redundancy, e.g., through sub-circuit duplication, presumably provides robustness to exact updating order during computation. One of the main motivating factors for exploring such unorganised machines is the potential relevance to designing forms of (nano) technology in un-clocked circuits made from simple, uniform components. However, asynchronous versions of RBN have also been presented (e.g., [12]) and so the same increase in node logic functions has been explored here as in the previous section with similar results (not shown).
368
L. Bull
Fig. 5. Performance on multiplexer (left) and demultiplexer (right) of asynchronous system
6 A Comparison with Evolution These initial results therefore indicate that unorganized machines are amenable to (open-ended) design using the imitation algorithm presented. As noted above, one of the earliest forms of evolutionary computation used a graph-based representation – Fogel et al.’s [11] Evolutionary Programming. EP traditionally utilizes five mutation operators to design finite state machines. In this paper EP has been used with the same representation of pairs of integers, defining node inputs, each with an accompanying single bit defining the node’s start state, as above. Similarly, with equal probability, an individual either has: a new NAND node added, with random connectivity; the last added node removed, and those connections to it randomly re-assigned; a randomly chosen connection to a randomly chosen node is randomly re-assigned; or, a randomly chosen node has its start state flipped. The same minimum and maximum solution size limits are maintained as before. The (μ + μ’) selection scheme of EP is also used: each individual in the parent population (μ) creates one randomly mutated offspring (μ’) and the fittest μ individuals form the next generation of parents. In the case of ties, the individual with the fewest number of nodes is kept to reduce bloat, otherwise the decision is random. Fogel et al. used a penalty function to curtail solution complexity, reducing fitness by 1% of size. All other parameters were the same as used above. Figure 6 (left) shows the performance of the EP-Atype system on the 6-bit (x=2) multiplexer problem. Optimal performance (64) is obtained around 200,000 generations and after an initial period of very slight growth, solutions are eventually no bigger than at initialization. Figure 6 (right) shows that optimal performance (8) in the equivalent demultiplexer is reached around 400,000 generations and solutions are typically five or six nodes smaller than at initialization. Hence these results are statistically significantly (T-test, p≤0.05) slower and bigger than those seen above with the imitation algorithm. The same was found to be true for the asynchronous update scheme, where the multiplexer was again unsolved (not shown).
Imitation Programming
369
Fig. 6. Performance on multiplexer (left) and demultiplexer (right) by EP (synchronous)
The imitation algorithm described can be viewed as a parallel hill-climber, simultaneously updating a number of solutions, in contrast to the traditional global replacement scheme used in evolutionary computation (hybrids are also possible, e.g., [3]). It is therefore of interest whether the imitation process aids performance in comparison to using random alterations to individuals, under the same selection process. Results (not shown) indicate that no statistically significant difference is seen from using imitation over purely random alterations on the demultiplexer task (T-test, p>0.05), but an improvement is seen on the multiplexer task through imitation (T-test, p≤0.05). With asynchronous updating imitation is better on the demultiplexer (T-test, p≤0.05). Of course, all algorithms are parameter sensitive to some degree: the parameters used here were simply chosen since they typically enabled optimal performance with all of the basic schemes, both evolution and imitation, on all tasks used, over the allotted time. Future work is needed to explore parameter sensitivity, the role of selecting who to imitate, multiple imitations per iteration, etc.
7 Conclusions This paper has examined a new form of imitation computation and used it to design circuits from discrete dynamical systems. It has also introduced an asynchronous form of the representation. Current work is exploring ways by which to improve the performance of the imitation algorithm for the design of these and other systems. The degree of inherent fault-tolerance of the NAND gate networks due to their dynamical nature is also being explored (e.g., following [18][31]).
References 1. Aldana, M., Cluzel, P.: A natural class of robust networks. PNAS 100(15), 8710–8714 (2003) 2. Andre, D., Koza, J.R., Bennett, F.H., Keane, M.: Genetic Programming III. MIT, Cambridge (1999)
370
L. Bull
3. Angeline, P.: Evolutionary Optimization vs Particle Swarm Optimization. In: Porto, V.W., Waagen, D. (eds.) EP 1998. LNCS, vol. 1447, pp. 601–610. Springer, Heidelberg (1998) 4. Angeline, P., Saunders, G., Pollock, J.: An Evolutionary Algorithm that Constructs Recurrent Neural Networks. IEEE Transactions on Neural Networks 5, 54–65 (1994) 5. Atkeson, C., Schaal, S.: Robot learning from demonstration. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 12–20. Morgan Kaufmann, San Francisco (1997) 6. Billard, A., Dautenhahn, K.: Experiments in Learning by Imitation - Grounding and Use of Communication in Robotic Agents. Adaptive Behavior 7(3/4), 415–438 (1999) 7. Brave, S.: Evolving Deterministic Finite Automata using Cellular Encoding. In: Koza, J.R., et al. (eds.) Procs of the First Ann. Conf. on Genetic Programming, pp. 39–44. MIT Press, Cambridge (1996) 8. Copeland, J.: The Essential Turing, Oxford (2004) 9. Dawkins, R.: The Selfish Gene, Oxford (1976) 10. Di, J., Lala, P.: Cellular Array-based Delay Insensitive Asynchronous Circuits Design and Test for Nanocomputing Systems. Journal of Electronic Testing 23, 175–192 (2007) 11. Fogel, L.J., Owens, A.J., Walsh, M.J.: Artificial Intelligence Through A Simulation of Evolution. In: Maxfield, M., et al. (eds.) Biophysics and Cybernetic Systems: Proceedings of the 2nd Cybernetic Sciences Symposium, pp. 131–155. Spartan Books (1965) 12. Gershenson, C.: Classification of Random Boolean Networks. In: Standish, R.K., Bedau, M., Abbass, H. (eds.) Artificial Life VIII, pp. 1–8. MIT Press, Cambridge (2002) 13. Gorman, B., Humphreys, M.: Towards Integrated Imitation of Strategic Planning and Motion Modeling in Interactive Computer Games. Computers in Entertainment 4(4) (2006) 14. Gruau, F., Whitley, D.: Adding Learning to the Cellular Development Process. Evolutionary Computation 1(3), 213–233 (1993) 15. Hassdijk, E., Vogt, P., Eiben, A.: Social Learning in Population-based Adaptive Systems. In: Procs of the 2008 IEEE Congress on Evolutionary Computation, pp. 1386–1392. IEEE Press, Los Alamitos (2008) 16. Holland, J.H.: Adaptation in Natural and Artificial Systems. Univ. of Mich. Press (1975) 17. Hutchins, E., Hazelhurst, B.: Learning in the Cultural Process. In: Langton, C.G., et al. (eds.) Artificial Life II, pp. 689–706. Addison Wesley, Reading (1990) 18. Kauffman, S.A.: The Origins of Order, Oxford (1993) 19. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: Proceedings of IEEE International Conference on Neural Networks, pp. 1942–1948. IEEE Press, Los Alamitos (1995) 20. Koza, J.R.: Genetic Programming. MIT Press, Cambridge (1992) 21. Luke, S., Spector, L.: Evolving Graphs and Networks with Edge Encoding: Preliminary Report. In: Koza, J.R. (ed.) Late Breaking Papers at the Genetic Programming 1996 Conference, pp. 117–124. Stanford University, Standford (1996) 22. McCulloch, W.S., Pitts, W.: A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics 5, 115–133 (1943) 23. Miller, J.: An Empirical Study of the Efficiency of Learning Boolean Functions using a Cartesian Genetic Programming Approach. In: Banzhaf, W., Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela, M., Smith, R.E. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference – GECCO 1999, pp. 1135–1142. Morgan Kaufmann, San Francisco (1999) 24. Mitchell, M., Hraber, P., Crutchfield, J.: Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations. Complex Systems 7, 83–130 (1993) 25. Nakamura, K.: Asynchronous Cellular Automata and their Computational Ability. Systems, Computers, Controls 5(5), 58–66 (1974)
Imitation Programming
371
26. Packard, N.: Adaptation Toward the Edge of Chaos. In: Kelso, J., Mandell, A., Shlesinger, M. (eds.) Dynamic Patterns in Complex Systems, pp. 293–301. World Scientific, Singapore (1988) 27. Poli, R.: Parallel Distributed Genetic Programming. In: Corne, D., Dorigo, M., Glover, F. (eds.) New Ideas in Optimisation, pp. 403–431. McGraw-Hill, New York (1999) 28. Preen, R., Bull, L.: Discrete Dynamical Genetic Programming in XCS. In: GECCO-2009: Proceedings of the Genetic and Evolutionary Computation Conference. ACM Press, New York (2009) 29. Price, B., Boutilier, C.: Implicit Imitation in Multiagent Reinforcement learning. In: Procs of Sixteenth Intl Conference on Machine Learning, pp. 325–334. Morgan Kaufmann, San Francisco (1999) 30. Reynolds, R.: An Introduction to Cultural Algorithms. In: Sebald, Fogel, D. (eds.) Procs of 3rd Ann. Conf. on Evolutionary Programming, pp. 131–139. World Scientific, Singapore (1994) 31. Sipper, M.: Evolution of Parallel Cellular Machines. Springer, Heidelberg (1997) 32. Sipper, M., Tomassini, M., Capcarrere, S.: Evolving Asynchronous and Scalable Nonuniform Cellular Automata. In: Proceedings of the Third International Conference on Artificial Neural Networks and Genetic Algorithms, pp. 66–70. Springer, Heidelberg (1997) 33. Storn, R., Price, K.: Differential Evolution - a Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization 11, 341–359 (1997) 34. Teller, A., Veloso, M.: Neural Programming and an Internal Reinforcement Policy. In: Koza, J.R. (ed.) Late Breaking Papers at the Genetic Programming 1996 Conference, pp. 186–192. Stanford University, Standford (1996) 35. Teuscher, C.: Turing’s Connectionism. Springer, Heidelberg (2002) 36. Thompson, A., Harvey, I., Husbands, P.: Unconstrained Evolution and Hard Consequences. In: Sanchez, E., Tomassini, M. (eds.) Towards Evolvable Hardware 1995. LNCS, vol. 1062. Springer, Heidelberg (1996) 37. Upegui, A., Sanchez, E.: Evolving Hardware with Self-reconfigurable connectivity in Xilinx FPGAs. In: Proceedings of the first NASA/ESA conference on Adaptive Hardware and Systems, pp. 153–162. IEEE Press, Los Alamitos (2006) 38. Von Neumann, J.: The Theory of Self-Reproducing Automata. University of Illinois (1966) 39. Werner, T., Akella, V.: Asynchronous Processor Survey. Comput. 30(11), 67–76 (1997) 40. Wyatt, D., Bull, L.: A Memetic Learning Classifier System for Describing ContinuousValued Problem Spaces. In: Krasnagor, N., Hart, W., Smith, J. (eds.) Recent Advances in Memetic Algorithms, pp. 355–396. Springer, Heidelberg (2004)
EvoFab: A Fully Embodied Evolutionary Fabricator John Rieffel and Dave Sayles Union College Computer Science Department Schenectady, NY 12308 USA
Abstract. Few evolved designs are subsequently manufactured into physical objects – the vast majority remain on the virtual drawing board. We suggest two sources of this “Fabrication Gap”. First, by being descriptive rather than prescriptive, evolutionary design runs the risk of evolving interesting yet unbuildable objects. Secondly, in a wide range of interesting and high-complexity design domains, such as dynamic and highly flexible objects, the gap between simulation and reality is too large to guarantee consilience between design and object. We suggest that one compelling alternative to evolutionary design in these complex domains is to avoid both simulation and description, and instead evolve artifacts directly in the real world. In this paper we introduce EvoFab: a fully embodied evolutionary fabricator, capable of producing novel objects (rather than virtual designs) in situ. EvoFab thereby opens the door to a wide range of incredibly exciting evolutionary design domains.
1
Introduction
Evolutionary algorithms have been used to design a wide number of virtual objects, ranging from virtual creatures [12] to telescope lenses [1]. Recently, with the advent of rapid prototpying 3-D printers, an increasing number of evolved designs have been fabricated in the real world as well. One of the earliest examples of an evolved design crossing the “Fabrication Gap” into reality is Funes’ LEGO structures [4]. In this work, the genotypes were a direct encoding of the physical locations of bricks in the structure - a virtual “blueprint” of the design. Fitness, based upon the weight-bearing ability of the structures, was determined inside a quasi-static simulator. The authors were able to translate the virtual phenotype into a physical object by reading the blueprint and placing physical bricks accordingly. Another notable example of a manufactured evolved design is Lohn’s satellite antenna [6]. The genotype in this case was a generative encoding L-system which, when interpreted by a LOGO-like “turtle”, drew a 3-D model of the antenna. Fitness was determined by measuring the performance of the design within an off-the-shelf antenna simulator. Other evolved designs to cross the Fabrication Gap include robots [7], furniture [5], and tensegrity structures [10]. In each of these later cases, phenotypes G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 372–380, 2010. c Springer-Verlag Berlin Heidelberg 2010
EvoFab: A Fully Embodied Evolutionary Fabricator
373
were 3D CAD models which could then be printed directly by rapid protyping 3D printers. The quality of these examples belies their quantity. The vast majority of evolved designs remain on the virtual “drawing board”, never to be manufactured. A closer analysis of the examples above provides some insight into this “Fabrication Gap”. For Funes work, building a physical LEGO structure from a descriptive blueprint was facilitated, at least in principle, by the close correspondence between virtual and physical LEGO bricks. In practice, however, the blueprints alone didn’t contain sufficient assembly information: particularly for large structures, the evolved designs first had to be assembled on a flat horizontal surface and then tilted into place – an operation that cannot be inferred from a blueprint. In Lohn’s antenna work, the final product was manufactured by hand: using the 3D model as a guide, a skilled antenna engineer manually bent and soldered pieces of conductive wire to the specified lengths and angles. As these 3D antenna models become more complex, this process becomes increasingly intractable. We see two primary sources of this “Fabrication Gap” between evolved virtual design and physical object. The first issue is that, conventionally, evolved designs are purely descriptive. By specifying what to build but not how to build it, evolutionary design runs the risk of evolving interesting yet unbuildable objects. Imagine an evolutionary design system which evolves images of chocolate cakes. The image describes what the final product looks like (which may be delicious), but there is nothing in the image which provides insight into how it should be prepared, or whether it can even be prepared at all. Similarly, a descriptive representation shows a finished product, but contains no information about how to manufacture it. Secondly, the evolutionary design of complex objects requires high fidelity simulation in order to guarantee that the physical manifestation behaves like its virtual counterpart. For static and rigid objects, such as the tables and robot parts mentioned above, fabrication is relatively straight forward: their behavior can be realistically simulated, and their descriptive phenotype is easily translated into a print-ready CAD file. However, for high-complexity design domains, such as dynamic and highly flexible objects, the gap between simulation and reality is too large to reliably manufacture designs evolved in simulation. This begs the question: in these high complexity domains, is it at all possible to dispense with simulation and description entirely, and instead evolve assembly instructions directly within a rapid prototyper? In such an “evolutionary fabrication” scenario the genotype consists of a linear encoding of instructions to the printer, and the evaluated phenotype is the resulting structure. These ideas have been motivated and explored using simulations of rapid prototypers [8] [9], but until now haven’t been instantiated in the real world. On the face of it of course this proposition seems extreme, and the reasons against it are obvious. First of all, rapid prototyping is a slow process, and so an evolutionary run of hundreds (even thousands) of individuals might take days or weeks – not to mention the associated cost in print material. Furthermore, commercial rapid prototypers cost hundreds of thousands of dollars, and do not
374
J. Rieffel and D. Sayles
allow the access to their underlying API which this approach requires. Finally, commercial prototypers typically only print relatively rigid materials, and so are incapable of producing objects from more interesting design domains. Fortunately, the recent advent of inexpensive desktop fabricators allows for a reexamination of these constraints. Hobbyist-oriented units, such as the Fab@Home and the Makerbot Cupcake, cost only a couple thousand dollars assembled, are open source and “hackable” and, most importantly, are capable of printing a much wider range of print media - from wax and silicone elastomer to chocolate. Furthermore, evolutionary embedded in the real world has produced some profoundly interesting results in other domains. Consider for instance Thompson’s seminal work on “Silicon Evolution” [13], in which pattern discriminators evolved directly on an FPGA behaved qualitatively differently than those evolved in simulation. In fact, the final product wound up exploiting thermal and analog properties of the FPGA – something well outside the domain of the simulator. Similarly, Watson and Ficici applied “Embodied Evolution” [15] (their term) to a population of simple robots, and produced neural network based control strategies which varied significantly from their simulated-evolution counterparts. In each case, the lesson has been that evolution directly in the real world can produce profound results which would have been impossible to produce via simulation. We draw our inspiration for Evolutionary Fabrication largely from these ground breaking insights. In this paper we introduce EvoFab: a fully embodied evolutionary fabricator, capable of automatically designing and manufacturing soft and dynamic structures, thereby bridging the “Fabrication Gap”. After describing the design of this unit in detail, we demonstrate proof-of-concept Evolutionary Fabrication of flexible silicone objects. The ability to automatically design and build soft and dynamic structures via EvoFab opens the door to a wide range of exciting and vital design domains, such as soft robots and biomedical devices.
2
EvoFab: An Evolutionary Fabricator
The system capable of embodied evolutionary fabrication (EvoFab) consists of two parts: a Fab@Home desktop rapid prototyper [14], and a python-based genetic algorithm which interfaces with the Fab@Home. The Fab@Home printer (Figure 1) was developed as a hobbyist desktop rapid prototyper. Its low price, open source software, and large range of print materials makes it ideally suited as an Evofabber. A print syringe, mounted on a X-Y plotter, extrudes material onto an 8” square Z-axis-mounted print platform. We specify seven specific operations which the printer may perform: in,out - move the print head in the +/−Y direction 3mm left,right - move the print head in the +/−X direction 3mm up,down - move the print head in the +/−Z direction 3mm extrude - pushes a fixed volume of print media through the 0.8mm syringe. We refer to a linear encoding of these operations as an assembly plan.
EvoFab: A Fully Embodied Evolutionary Fabricator
375
Fig. 1. A Fab@Home desktop prototyper is used as the foundation of the EvoFab
In conventional setups, prototypers produce three dimensional objects by printing successive layered “slices” of the object on the horizontal plane, lowering the print platform between slices. In the context of Evolutionary Fabrication however, we prefer a more open-ended freeform approach, and so place no constraints upon the print process. The print head is free to move in almost any direction and to perform any operation during the execution of an assembly plan – even if that means causing the syringe to collide with the object it is printing. We will discuss why this might be beneficial in the last section of this paper. After testing a variety of materials ranging from Play-Doh to alginate (an algaebased plaster), we selected silicone bath caulk (GE Silicone II) because of its relatively short (30-minute) cure time, and viscosity (it is thick enough to remain inside the print syringe until extruded, but thin enough to easily extrude into a single strand). Figure 2 illustrates the extrusion of silicone onto the print surface. Because extruding material from a syringe creates a “thread” which dangles between the syringe and the print platform, an extrude command followed by a directional command such as lef t will print a 3mm long line on the print platform. An example assembly plan capable of printing a 3mm square of silicone might appear as the following: [extrude, lef t, extrude, in, extrude, right, extrude, out]
376
J. Rieffel and D. Sayles
Fig. 2. Freeform three dimensional printing of silicone is accomplished by a syringe mounted to an X-Y plotter. The print platform can be moved vertically along the Z axis.
In the context of evolutionary fabrication, these linear encodings of instructions form can a genotype. Mutation and crossover of genotypes is accomplished in just as it would be in any other linear encoding. Figure 3 illustrates the results of two assembly plan genotypes which differ by a small mutation.
Fig. 3. Small changes to assembly plan genotypes produce corresponding changes to the resulting silicone phenotype. The image above compares an original (top) with its mutant (bottom) in which a trailing sequence of the assembly plan has been replaced. Each object was printed from right to left.
EvoFab: A Fully Embodied Evolutionary Fabricator
377
It is worth emphasizing that assembly plans are an indirect encoding – the object which results from executing a particular assembly plan can be considered its phenotype. This layer of indirection gives rise to some interesting consequences, most significant is that there is no longer a 1 : 1 mapping from genotype to phenotype (as there would be in a direct encoding, just as simple bit string GA). Rather, there is an N : 1 mapping: physically identical phenotypes can arise from distinct underlying genotypes. In fact, when you take into account the stochastic nature of the fabrication process it becomes an N : N mapping, meaning that a single genotype can produce slightly different phenotypes when executed multiple times. We explore the consequences of this in our discussion below.
3
Proof of Concept: Interactive Evolution of Shape
We can demonstrate the potential of evolutionary fabrication using a relatively simple Interactive Genetic Algorithm (IGA). Based upon Dawkin’s idea of the “Blind Watchmaker” [3], IGAs replace an objective and automated fitness function with human-based evaluation. IGAs have been successful in a wide range of evolutionary design tasks, most notably in Karl Sims’ seminal work on artificial creatures [12], [11]. We chose as a design task the simple evolution of circular 2-dimensional shapes. A population of size 20 was initialized with random assembly plans, each of which was 20 instructions long. Individuals were then printed onto the platter in batches of four. Once the population was completely printed, the 10 best (most circular) individuals were then selected as parents for the subsequent generations. New children were created using cut-and-splice crossover [16] Each platter of four individuals took roughly 10 minutes to print, corresponding to slightly less than an hour of print time per generation. Figure 4 compares sample phenotypes from the first and ninth generations. After only a small number of generations, the population is already beginning to converge onto more circular shapes.
Fig. 4. Sample individuals from the first (left) and ninth (right) generations of the interactive evolution in which the user is selecting for roundness of shapes. After relatively few generations the population is beginning to converge onto more circular shapes.
378
4
J. Rieffel and D. Sayles
Discussion
The results presented in our proof-of-concept evolutionary fabrication above are enough to lend credence to the potential of EvoFab for exploring even more interesting and complex design domains. Before discussing these applications in more detail it is first worth discussion some of the implications and limitations of this approach. 4.1
Fabrication and Epigenetic Traits
One of the more fascinating consequences of embodied evolutionary fabrication is the capacity for the system as a whole to produce epigenetic traits - that is, phenotypic characteristics which arise purely from the mechanics of assembly, and have no underlying genotypic source. Consider for example the phenotypes in Figure 5, in which the user was selecting for shapes resembling the letter ’A’. At a glance one would assume that the “cross” of the A shapes was produced by an explicit set of operations within the underlying genotypes. In fact, they are instead caused by the print head “dragging” an extraneous thread of print material across the genotype as it moves between print regions. Explorations into simulated evolutionary fabrication have suggested that there may be some interesting benefits to this kind of phenomenon [8]. Consider for instance a print process which extruded two separate subassemblies and then used the syringe head to dynamically assemble them into a larger structure. We hope to use EvoFab to further explore the consequences in embodied systems as well. 4.2
Material Use and Conservation
A natural consequence of evolutionary fabrication is that a significant amount of print material is consumed over the multiple generations of phenotype evaluations. And, while silicone elastomer is less expensive than the plastics used in high-end commercial rapid prototypers, the costs still add up. In order to address this issue we are exploring a number of alternative and recyclable materials such as wax and even ice [2]. Ideally, once they are evaluated for fitness, phenotypes could then be reduced to their original material for reuse in a subsequent print cycle. 4.3
Design Domains
The domains in which Evolutionary Fabrication holds the most promise are those which are too complex or too inscrutable to realistically simulate. One such area is the design of flexible and dynamical systems, such as the morphology of completely soft robots. In light of recent natural disasters in Haiti and Chile, there is a compelling need for more versatile and robust search and rescue robots. Imagine, for instance, a machine that can squeeze through holes, climb up walls, and flow
EvoFab: A Fully Embodied Evolutionary Fabricator
379
Fig. 5. Example of epigenetic traits in a set of phenotypes evolved for likeness to the letter ’A’. In each case, the “crosspiece” which connects the shorter leg to the longer leg is not caused by a genotypic sequence, but is instead caused by the print head dragging extra material across the phenotype as it finishes one print job and moves to the adjacent print region.
around obstacles. Though it may sound like the domain of science fiction, modern advances in materials such as polymers and nanocomposites such a “soft robot” is becoming an increasing possibility. Unfortunately, soft and deformable bodies can possess near-infinite degrees of freedom, and elastic pre-stresses mean that any local perturbation causes a redistribution of forces throughout the structure. As a consequence, soft structures are incredibly difficult to realistically simulate, even in non-dynamic regimes. Furthermore, there are no established principles or purely analytical approaches to the problem of soft mechanical design and control – instead the design task involves significant amounts of human-based trial and error. EvoFab allows the power of evolutionary design techniques to be applied to this compelling and vital design domain. Soft bodies could be evolved and evaluated in situ, without resorting to simulation or post-hoc methods. The results of such endeavors could have significant consequences not just for search-andrescue, but also in biomedical applications such as endoscopy.
References 1. Al-Sakran, S.H., Koza, J.R., Jones, L.W.: Automated re-invention of a previously patented optical lens system using genetic programming. In: Keijzer, M., Tettamanzi, A.G.B., Collet, P., van Hemert, J., Tomassini, M. (eds.) EuroGP 2005. LNCS, vol. 3447, pp. 25–37. Springer, Heidelberg (2005) 2. Barnett, E., Angeles, J., Pasini, D., Sijpkes, P.: Robot-assisted rapid prototyping for ice structures. In: IEEE Int. Conf. on Robotics and Automation (2009)
380
J. Rieffel and D. Sayles
3. Dawkins, R.: The Blind Watchmaker. W. W. Norton & Company, Inc. (September 1986) 4. Funes, P., Pollack, J.B.: Evolutionary body building: Adaptive physical designs for robots. Artificial Life 4(4), 337–357 (1998) 5. Hornby, G.S., Pollack, J.B.: The advantages of generative grammatical encodings for physical design. In: Proceedings of the 2001 Congress on Evolutionary Computation CEC 2001, COEX, World Trade Center, 159 Samseong-dong, Gangnam-gu, Seoul, Korea, 27-30 2001, pp. 600–607. IEEE Press, Los Alamitos (2001) 6. Lohn, J.D., Hornby, G.S., Linden, D.S.: An Evolved Antenna for Deployment on NASA’s Space Technology 5 Mission. In: O’Reilly, U.-M., Riolo, R.L., Yu, T., Worzel, B. (eds.) Genetic Programming Theory and Practice II. Kluwer, Dordrecht (2005) 7. Pollack, J.B., Lipson, H., Hornby, G., Funes, P.: Three generations of automatically designed robots. Artificial Life 7(3), 215–223 (Summer 2001) 8. Rieffel, J.: Evolutionary Fabrication: the co-evolution of form and formation. PhD thesis, Brandeis University (2006) 9. Rieffel, J., Pollack, J.: The Emergence of Ontogenic Scaffolding in a Stochastic Development Environment. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 804–815. Springer, Heidelberg (2004) 10. Rieffel, J., Valero-Cuevas, F., Lipson, H.: Automated discovery and optimization of large irregular tensegrity structures. Computers & Structures 87(5-6), 368–379 (2009) 11. Sims, K.: Interactive evolution of dynamical systems. In: First European Conference on Artificial Life. MIT Press, Cambridge (1991) 12. Sims, K.: Evolving 3d morphology and behavior by competition. In: Brooks, R., Maes, P. (eds.) Artificial Life IV Proceedings, pp. 28–39. MIT Press, Cambridge (1994) 13. Thompson, A.: Silicon evolution, Stanford University, pp. 444–452. MIT Press, Cambridge (1996) 14. Vilbrandt, T., Malone, E., Lipson, H., Pasko, A.: Universal desktop fabrication. Heterogenous Objects Modeling and Applications, 259–284 (2008) 15. Watson, R.A., Ficici, S.G., Pollack, J.B.: Embodied evolution: Embodying an evolutionary algorithm in a population of robots. In: Angeline, P.J., Michalewicz, Z., Schoenauer, M., Yao, X., Zalzala, A. (eds.) Proceedings of the Congress on Evolutionary Computation, Mayflower Hotel, Washington D.C., USA, 6-9 1999, vol. 1, pp. 335–342. IEEE Computer Society Press, Los Alamitos (1999) 16. Whitley, D., Beveridge, J.R., Guerra-Salcedo, C., Graves, C.: Messy genetic algorithms for subset feature selection. In: International Conference on Genetic Algorithms, ICGA 1997 (1997)
Evolving Physical Self-assembling Systems in Two-Dimensions Navneet Bhalla1 , Peter J. Bentley2 , and Christian Jacob1,3 1
Dept. of Computer Science, Faculty of Science, University of Calgary, 2500 University Drive N.W., Calgary, Alberta, Canada, T2N 1N4 [email protected] 2 Dept. of Computer Science, Faculty of Engineering Sciences, University College London, Malet Place, London, United Kingdom, WC1E 6BT [email protected] 3 Dept. of Biochemistry & Molecular Biology, Faculty of Medicine, University of Calgary, 3280 Hospital Drive N.W., Calgary, Alberta, Canada, T2N 4Z6 [email protected]
Abstract. Primarily top-down design methodologies have been used to create physical self-assembling systems. As the sophistication of these systems increases, it will be more challenging to deploy top-down design, due to self-assembly being an algorithmically NP-complete problem. Alternatively, we present a nature-inspired approach incorporating evolutionary computing, to couple bottom-up construction (self-assembly) with bottom-up design (evolution). We also present two experiments where evolved virtual component sets are fabricated using rapid prototyping and placed on the surface of an orbital shaking tray, their environment. The successful results demonstrate how this approach can be used for evolving physical self-assembling systems in two-dimensions. Keywords: self-assembly, evolutionary computing, rapid prototyping.
1
Introduction
The plethora of complex inorganic and organic systems seen throughout nature is the result of self-assembly. Complex self-assembled entities emerge from decentralised components governed by simple rules. Natural self-assembly is dictated by the morphology of the components and the environmental conditions they are subjected to, as well as their component and environment physical and chemical properties - their information [1] [2]. Components, their environment, and the interactions among them form a system, which can be described by a set of simple rules. Coupled with this bottom-up construction process (self-assembly) bottom-up design is used with living organisms where the process of evolution is displayed through their genetic rule sets - their DNA. Through transcription to RNA and translation to proteins, these rules are mapped to physical shapes, encapsulating the central dogma of molecular biology [3]. Proteins, the resulting self-assembling shapes, are the primary building blocks of living organisms. G. Tempesti, A.M. Tyrrell, and J.F. Miller (Eds.): ICES 2010, LNCS 6274, pp. 381–392, 2010. c Springer-Verlag Berlin Heidelberg 2010
382
N. Bhalla, P.J. Bentley, and C. Jacob
However, designing artificial, physical, self-assembling systems remains an elusive goal. Based on relevant work [4], primarily top-down design methodologies have been used to create physical self-assembling systems. As the sophistication of these systems increases, it will be more challenging to deploy top-down design, due to self-assembly being an algorithmically NP-complete problem [5]. How to design a set of physical components and their environment, such that the component set self-assembles into a target structure remains an open problem. Evolutionary Computing (EC) [6] is well-suited for such problems. In pursuit of addressing this open problem, we present the incorporation of EC into the three-level approach [7] [8] for designing physical self-assembling systems. The three-level approach comprises specifying a set of self-assembly rules, modelling these rules to determine the outcome of a specific system in software, and translating to a physical system by mapping the set of self-assembly rules using physically encoded information. This is consistent with the definition of self-assembly [9], refined here as a process that involves components that can be controlled through their proper design and their environment, and which are adjustable (components can adjust their position relative to one another). Furthermore, the three-level approach is inspired by the central dogma of molecular biology, in being able to map a set of self-assembly rules directly to physical shapes. This is beneficial in that no knowledge of the target structure’s morphology is required, only its functionality. As a result, incorporating EC into the three-level approach is appropriate. The next section presents background material to which our self-assembly model and evolutionary approach is based upon. Next, an overview of the threelevel approach is presented along with details of an example incorporating EC. Two experiments follow which demonstrate the creation of evolved component sets and their translation, via physically encoded information, to physical systems using rapid prototyping1 . We conclude by summarising how this work provides as proof-of-concept a means to evolving physical self-assembling systems.
2
Background
The abstract Tile Assembly Model (aTAM) [10] was originally developed to model the self-assembly of molecular structures, such as DNA Tiles [11], on a square lattice. These tiles use interwoven strands of DNA to create the square body of a tile (double-stranded) with single strands extending from the edges of the tiles. A tile type is defined by binding domains on the North, West, South, and East edges of a tile. A finite set of tile types is specified (which are in infinite supply in the model). At least one seed tile must be specified to start the self-assembly process. Tiles cannot be rotated or reflected. There cannot be more than one tile type that can be used at a particular assembly location in the growing structure (although the same binding domain is permitted on more than one tile type). All tiles are present in the same environment, a one-pot-mixture. 1
Supplementary resources, including all CAD files and videos, pertaining to the experiments can be found at www.navneetbhalla.com/resources.
Evolving Physical Self-assembling Systems in Two-Dimensions
383
Tiles can only bind together if the interactions between binding domains are of sufficient strength (provided by a strength function), as determined by the temperature parameter. The sum of the binding strengths of the edges of a tile must meet or exceed the temperature parameter. For example, if the temperature parameter is two, at least two strength-one bonds must be achieved to assemble a tile, i.e. the temperature parameter dictates co-operative bonding. The seed tile is first placed on the square lattice environment. Tiles are then selected one at a time, and placed on the grid if the binding strength constraints are satisfied. The output is a given shape of fixed size, if the model can uniquely construct it. aTAM has been used to study algorithmic self-assembly complexity, the Minimum Tile Set Problem (MTSP) and the Tile Concentration Problem (TCP) [5]. The goal of MTSP is to find the lowest number of tile types that can uniquely self-assemble into a target structure. The goal of TCP is to find the relative concentrations of tile types that self-assemble into the target structure using the fewest assembly steps. MTSP is an NP-complete problem for general target structures. The algorithmic complexity of TCP has only been calculated for specific classes of target structures. EC has been applied to self-assembly based on aTAM. In [12], EC was used to evolve (in simulation) different co-operative bonding mechanisms between two to five tiles, to create a ten by ten square.
3
Three-Level Approach and Evolution
We extend aTAM to better suit the components (tiles) used in our systems. We also physically realise the results achieved by our EC implementation. The selfassembly design problem we are concerned with is a combination of MTSP and TCP, as well as several other constraints. These three differences are expanded upon in presenting our incorporation of EC into the three-level approach [7] [8]. The three-level approach provides a high-level description to designing selfassembling systems via physically encoded information [1] [2]. The three phases included in our approach are: (1) definition of rule set, (2) virtual execution of rule set, and (3) physical realisation of rule set (Fig. 1). The three-level approach provides a bottom-up method to create self-assembling systems. This is achieved by being able to directly map a set of self-assembly rules to a physical system. Here we present the addition of EC to evolve the level one rules. Results from the level two modelling are used for evaluation by the evolutionary algorithm (Fig. 1). After running the evolutionary algorithm, if the desired results are achieved, the level one rules can be mapped to a physical system. 3.1
Level One: Definition of Rule Set
To demonstrate how the three-level approach and EC can be used, the following example implementation was constructed. Its purpose is to show how to create a set of physical, two-dimensional, components that self-assemble into a set of target structures, created in parallel. Self-assembly rules are divided into three categories, which define a system: component, environment, and system rules.
384
N. Bhalla, P.J. Bentley, and C. Jacob Level 1: Definition of Rule Set
Level 1: Definition of Rule Set
map rule set to physicallyindependent model for evaluation
map rule set to physicallyindependent model for evaluation
Level 2: Virtual Execution of Rule Set
Level 2: Virtual Execution of Rule Set map rule set to physically encoded information
Level 3: Physical Realisation of Rule Set
evaluate modeling results
Evolutionary Computing
if desired result achieved, then map rule set to physically encoded information
Level 3: Physical Realisation of Rule Set
Fig. 1. Three-level approach (left), and incorporating EC (right)
Component rules specify primarily shape and information. Components are similar in concept to DNA Tiles [11]. Abstractly, components are all squares of unit size. Each edge of a component serves as an information location, in a four-point arrangement, i.e. North-West-South-East. Information is abstractly represented by a capital letter (A to G). If no information is associated with an information location (a neutral site), the dash symbol (−) is used. The spatial relationship of this information defines a component type (Fig. 2). Environment rules specify environmental conditions, such as the temperature of a system and boundary constraints. The temperature determines the threshold to which the assembly protocol must satisfy in order for assembly bonds to occur. Components are confined due to the environment boundary, but are permitted to translate and rotate in two-dimensions, and interact with one another and their environment. However, components are not permitted to be reflected. System rules specify the quantity of each component type, and componentcomponent information interactions (i.e. assembly interactions) and componentenvironment information interactions (i.e. transfer of energy and boundary interactions). In this implementation, there are two types of system interaction rules, referred to as fits rules and breaks rules. Abstractly, if two pieces of complementary information come into contact (i.e. they fit together, A fits B), it will cause them to assemble. This rule type is commutative, meaning if A fits B, then B fits A. Abstractly, if two assembled pieces of information experience a temperature above a certain threshold their assembly breaks. 3.2
Level Two: Virtual Execution of Rule Set
At level two, a self-assembly rule set is mapped to an abstract model. We present an extension to aTAM [10], the concurrent Tile Assembly Model (cTAM) [7]. cTAM is a modelling technique better suited to the type of physical selfassembling systems we use for demonstration purposes. There are five features to cTAM. (1) There are no seed tiles, meaning any two compatible tiles can start the self-assembly process. (2) Tiles can self-assemble into multiple substructures concurrently. (3) Tiles can be rotated, but cannot be reflected. (4) More than
Evolving Physical Self-assembling Systems in Two-Dimensions -
A
-
-
A
-
-
C B
B
D
-
-
-
A
-
-
Step 1 -
C A
-
B -
D
A
-
-
B
Neutral Site
-
-
-
-
F
-
B
E -
-
B A
B
G
-
-
Step 4
-
F -
C
D
B
A
-
B
-
A
Boundary Violation
E
F
-
A
A
-
C A
-
A
-
F
B
D -
Step 3
C -
-
-
B
A
Step 2
-
B
D
385
-
-
C
-
Uncomplimentary Information
-
B F
D
E
-
No Assembly Path
-
Fig. 2. Example cTAM steps (left) and assembly violations (right)
one tile type can be used at a particular assembly location in a growing structure. (5) All tiles are present in the same environment, one-pot-mixture. In this implementation, the temperature parameter is set to one. The initial set of tiles in cTAM is a multiset (type and frequency). In cTAM (Fig. 2), a single assembly operation is applied at a time, initialised by selecting a single tile/substructure with an open assembly location at random. If no other tile or substructure has an open complementary information location, then the location on the first tile/substructure is labelled unmatchable. If there are tiles/substructures with open complementary information locations, all those tiles/substructures are put into an assembly candidate list. Tiles and substructures are selected at random (from the assembly candidate list) until a tile/substructure can be successfully added. If no such tile/substructure can be added, due to an assembly violation (Fig. 2), then the location is labelled unmatchable. If a tile/substructre can be added, the open assembly locations on the two tiles/substructures are updated, and labelled match (all applicable assembly locations must match when adding two substructures). The algorithm repeats, and halts when all assembly locations are set to match or unmatchable. At the conclusion of the algorithm, the resulting structures are placed in a single grid environment to determine if any environment boundary violations occur. A post-evaluation of environment constraints is sufficient for this implementation, as we are more concerned with the set of self-assembled structures than environmental constraints. 3.3
Level Three: Physical Realisation of Rule Set
Components are mapped to their physical equivalents using rapid prototyping (Fig. 3). Physical components are defined by their design space, the set of physically feasible designs. The design space is a combination of a shape space and an assembly protocol space. A key-lock-neutral concept defines the shape space. A 3-magnetic-bit encoding scheme defines the assembly protocol space. Either one or two magnets are used in each position. Magnets are placed within the sides of the components. The magnets are not flush with the surface creating an air gap, to be adjustable [9] and allow for selective bonding. Lock-to-lock interactions are guaranteed to never occur. Therefore, this shape characteristic is used to manipulate the designation of the 3-magnetic-bit encodings to keys and locks. One magnet is placed in each position designated to a key,
N. Bhalla, P.J. Bentley, and C. Jacob 1.35
5.00
R0.80
R0.80
5.00 10.00
2.20 2.50
10.00
2.50 5.00
5.00 2.50 5.00
0.50
3.003.00
5.00
386
Fig. 3. Left to right: physical component shape space (solid thick lines represent the base shape, dashed lines represent neutral sites, and thin solid lines represent key shapes), physical component specifications (top and right view in mm), and an example physical component (blue/red paint on top represents magnetic north/south patterns)
and two magnets are placed in each position designated to a lock. This ensures strong binding between keys and locks, and weak binding between key-to-key interactions. The prevention of weak binding can be avoided with an appropriate environment temperature setting. Therefore, key-to-key matching errors can be avoided, and key-to-lock matching errors can be reduced through proper designation of the 3-magnetic-bit encodings to keys and locks (Table 1).
Table 1. Key/lock designations to magnetic patterns with abstract label, and interaction rules ( ’→’ transition, ’+’ assembly, ’;’ disassembly, and ’T2 ’ temperature 2) Key/Lock 3-magnetic-bit Label Lock Lock Lock Lock Key Key Key Key
3.4
000 110 011 101 111 001 100 010
A C E G B D F H
Fits Rule A fits B C fits D E fits F G fits H B fits A D fits C F fits E H fits G
→ → → → → → → →
A+B C+D E+F G+H B+A D+C F+E H+G
Breaks Rule T2 breaks A+B T2 breaks C+D T2 breaks E+F T2 breaks G+H T2 breaks B+A T2 breaks D+C T2 breaks F+E T2 breaks H+G
→ → → → → → → →
A;B C;D E;F G;H B;A D;C F;E H;G
Evolving Self-assembly Rule Sets
The objective of the evolutionary algorithm is to search for the best component set (type and concentration) able to self-assemble into a single target structure. Here we focus on a single structure, as a first step, since being able to effectively evaluate the result of many diverse structures is challenging. Environment and system (fits and breaks) rules are fixed. The following is an overview of the evolutionary algorithm used, genotype and phenotype representations, fitness function, and selection, crossover, and genetic operators used. A generational evolutionary algorithm [6] is used. The evolutionary unit, gene, is a single component. A databank of gene sequences (linear representation of
Evolving Physical Self-assembling Systems in Two-Dimensions
387
the North-West-South-East edges, using A to G and − symbol) is used to identify/compare genes. There are 6,561 total and 1,665 unique genes (when considering two-dimensional shape and rotations). Elitism is used, where the top 10% of individuals are copied to the next generation. An individual’s genotype representation is a variable length list of genes. At least two genes define a genotype (since this is the minimum for self-assembly to occur). An individual’s phenotype representation is the resulting set of selfassembled structures. A single genotype representation may have more than one phenotype representation, depending on the set of components and assembly steps. Therefore, each individual (genotype) is evaluated three times, at each generation, to help determine the fitness of an individual. A multi-objective fitness function is used to evaluate each individual. The seven objectives can be categorised into evaluating a general and refined solution (Fig. 4). The general solution has five objectives: (1) area (A), (2) perimeter (P), (3) Euler (E), (4) z-axis, and (5) matches. Each of these objectives is used to achieve the shape of the target structure. The area, perimeter, and Euler (connectivity of a shape) are calculated using 2D Morphological Image Analysis [13]. The second-moment of inertia in the z-axis [14] is calculated to identify similar, but rotated structures. To distinguish between reflected structures (which are not permitted), the number of matching components between a self-assembled structure and the target structure is calculated. A refined solution is accounted for by using two objectives: (6) locations and (7) error. We consider a refined solution as one that minimises the number of remaining open assembly locations and potential assembly errors (due to magnet interactions). The combination of these two objectives also reduce the number of unique components required. Each objective is normalised, using the highest and lowest values from a generation. For objectives one to five (i), the average normalised objective (AN Oi ) over three cTAM evaluations is calculated and compared to the target objective (T Oi ) value. For objective six, the normalised average over the three cTAM evaluations (AN O6 ) is calculated. For objective seven, the normalised objective (N O7 ) is calculated with respect to a genotype. The objectives are then weighted to give the final fitness score F (Equation 1). The weights were selected from preliminary experiments conducted by the authors. F = (0.9
5
|T Oi − AN Oi |) + 0.1 × AN O6 + 0.1 × N O7
(1)
i=1
The fitness scores for each individual are used during selection. Roulette-wheel selection is used to select two parents (favouring lowest fitness scores). The two parents, using a variable-length crossover operator, are used to create two children. Each common gene (determined by the gene databank) between the two parents is copied to each child. Each uncommon gene, for example the gene from parent one has a 90% probability of being copied to child one (likewise for parent two and child two). After crossover is performed to create two children,
388
N. Bhalla, P.J. Bentley, and C. Jacob
B (1, 1, 1)
-
C I
II
ns: number of squares ne: number of edges nv: number of vertices
III A = ns P = -4 + 2ne E = ns - ne + nv
D (0, 0, 1)
F (1, 0, 0)
H (0, 1, 0)
G D
-
C
D
A
0
D (0, 0, 1)
5
2
F (1, 0, 0)
5
4
2
H (0, 1, 0)
4
3
3
-
IV
B (1, 1, 1)
4
Fig. 4. Fitness objective examples: structure I (A = 5, P = 12, and E = 1); structure II has the same second moment of inertia for its reflected equivalent; number of matches between reflected structure II is 3 (III); number of open locations is 2 (black circles, IV); a sliding window technique is used (matrix) as a sum of magnetic errors (odd number of magnets must match at each position along the sliding window) and is applied to all potential two-component key-to-key interactions in a system, e.g. 2 in IV
the genetic operators duplication, deletion, and mutation are applied to each child. There is a 10% probability of a single gene, chosen at random, of being duplicated, and likewise being deleted. For each information location in a gene, there is a 10% probability of being mutated (equal probability A to G, and −).
4
Experiments and Results
We present two experiments to demonstrate how self-assembling systems can be evolved in two-dimensions. Our hypothesis was, given the attributes of a target structure, an evolutionary algorithm could be used to evolve a set of component rules, which can be mapped to a physical system consisting of an environment containing components that are able to self-assemble into the target structure. The three-level approach with the addition of EC was used to test our hypothesis. Two desired entities (Fig. 5) were specified: cross-shape (experiment 1) and z-shape (experiment 2). For each experiment, enough components are supplied to create up to three target structures. Five trials are run for each experiment. A virtual trial (level two) is evaluated to be successful if all three target structures are created. A physical trial (level three) is evaluated to be successful if at least one target structure is created. The experimental procedure and results are described in terms of the three phases corresponding to the three-level approach. 4.1
Level One: Definition of Rule Set for Experiments
These two target structures were chosen since they offer degrees of complexity in terms of the number of components and their concentration, and symmetric/asymmetric features in the target structures. Consequently, the two target structures cannot be created by pattern formation exclusively. Therefore, it is appropriate for determining if the information encoded in the components is sufficient to achieve the target structures by self-assembly. The independent variable is the set of components. The set of components is defined by their type and
Evolving Physical Self-assembling Systems in Two-Dimensions
389
their concentration. The dependent variable is the resulting self-assembled structures. For each experiment, an evolved component set is generated along with a randomly generated component set, in order to test the independent variable. The evolutionary algorithm used 5,000 generations, with a population size of 50 individuals, for each run. The initial individual (genotype) length was set to the required number of components to create one target structure. Fig. 5 shows the evolutionary algorithm results. Five runs were conducted for each experiment. For experiment one, the two optimal solutions were achieved. The second solution was chosen for these experiments, as components from previous experiments could be reused [7]. For experiment two, the single optimal solution was achieved. For the randomly generated component set, components were created by selecting, with uniform probability, the information assigned to each site. The number of components randomly generated were equal to the required number of components to create one target structure. A summary of the component rules, for each experiment, is provided in Fig. 5. The number of components specified (evolved and random) were multiplied by three in order to create the maximum number of target structures for the experiments.
-
Experiment
-
-
A
B
-
-
I
-
-
B A
B
B
B
A
-
-
-
-
A B
A
A
B
-
B
-
G
-
H -
A -
II
A
A
B -
-
-
-
-
III
IV
B
A
-
Component Set
1 Evolved
(A,A,A,A) × 3, (-,B,-,-) × 12
1 Random
(-,-,B,G) × 3, (-,D,E,E) × 3, (C,-,-,C) × 3, (C, E,-,-) × 3, (-,F,B,H) × 3
2 Evolved
(-,B,G,-) × 3, (-,-,-,A) × 6, (H,-,-,B) × 3
2 Random
(G,H,H,-) × 3, (-,A,-,-) × 3, (-,H,-,-) × 3, (-,-,E,A) × 3
V
Fig. 5. Target structures (I and II); evolutionary results (III, IV and V); component sets for experiments (represented as ’(North, West, South, East) × #’, where the directions refer to component information locations and the # symbol represents quantity), for each evolved and randomly generated component set
4.2
Level Two: Virtual Execution of Rule Set for Experiments
cTAM was used to virtually evaluate the ability of each self-assembly rule set (evolved or random) to create its respective target structure. Although cTAM is used by the evolutionary algorithm, it is used to verify the creation of multiple target structures. The level two experimental set-up and results are provided. Experimental Set-up. The component rules from Fig. 5 were mapped to an abstract representation for cTAM. Each component’s shape was a unit square. The size of the environment was represented as a ratio between the size of the base component shape (square with neutral sites at all four information locations) and the boundary of the environment. Because the environment size represents height and width, the environment size used in cTAM was ten units by ten units, for these experiments. Since cTAM selects tiles/substructures at random to step through the self-assembly process, a different random seed was used to initialise cTAM for each trial. Five trials were conducted for each experiment.
390
N. Bhalla, P.J. Bentley, and C. Jacob
Experimental Results. Each evolved component set successfully created three of their applicable target structures. These results show that even without a component acting as a seed, it is still possible to successfully create target structures. Furthermore, these results show that it is possible to create multiples of the same target structure, when appropriate component information is used. In contrast, none of the randomly generated component sets successfully created at least one target structure, in each experiment. In this case, the same reason applies to both random sets. For the first random set, the first and last component types will form substructures that are independent from substructures formed by the second, third and fourth component types. Likewise for the second random set, the first and third component types will form substructures that are independent from substructures formed by the second and fourth component types. 4.3
Level Three: Physical Realisation of Rule Set for Experiments
With the success of each system using an evolved component set at level two, a level three translation was performed to test if the translated component set of each system could self-assemble into its respective target structure. A level three translation was not performed on the systems using a randomly generated component set, since they were not successful. Experimental Set-up. Component mapping followed Table 1. Components were fabricated using an Eden 333 Polyjet rapid prototyping machine, using Vero Grey resin. Neodymium (NdFeB) disc magnets (1/16” × 1/32”, diameter × radius; grade N50) were inserted into the components. Blue/red paint (north/south) was used to mark magnetic patterns. Mapping for the environment size was done in accordance with the base component size, to specify the dimensions of the circular environment tray. The tray was fabricated using a Dimensions Elite rapid prototyping machine, using ABS plastic (sparse-fill option was used to create a rough surface texture). The outer radius of the tray is 135 mm and the inner radius is 125 mm, while the outer wall height is 9 mm and the inner wall height is 6 mm. The tray was mounted to a Maxi Mix II Vortex Mixer (using a tray mounting bracket, also fabricated using the Dimensions printer). A tray lid was cut using a Trotec Speedy 300 Laser Engraver laser cutting machine, using 2 mm clear acrylic sheet. The tray lid was secured to the tray using polycarbonate screws and wing nuts. Materials/methods details are given in [7]. Each physical trial followed seven steps [7]. (1) Set the continuous speed control on the Maxi Mix II Vortex mixer to 1,050 rpm. This speed was found to create an appropriate shaking level (environment temperature) to maintain fits rules, and to mostly break partially matched magnetic codes. (2) Secure the mixer to a table, using a 3” c-clamp and six hex nuts (to help secure the c-clamp to the back of the mixer). (3) Randomly place components on the surface of the tray (trying to ensure that complementary binding sites on the components are not in-line with each other). (4) Secure the tray lid. (5) Run the mixer for 20 minutes. (6) Turn the mixer off. (7) Record the state of the system, observations including: the number of target structures created, the number of matching errors (between conflicting physical information, where no fits rule is applicable),
Evolving Physical Self-assembling Systems in Two-Dimensions
391
and the number of assembly errors (partial attachment between corresponding physical information, where a fits rule is applicable). Experimental Results. Each trial, for each experiment, was successful in creating at least one target structure. In the second trial for experiment two, two target structures were created. Fig. 6 shows the final state for the best trial for each experiment. In experiment one, there were no matching errors (as this was not possible due to the 3-magnetic-bit codes present) and no assembly errors. In experiment two, there was only one matching error (trial five) and no assembly errors. As structures self-assembled, the environmental free space was reduced, constraining the rotation of substructures and sometimes constraining single components from reaching assembly locations. Fisher’s Exact Test [15] (onesided) for analysing binary data was used to determine the statistical significance of creating target structures. For both experiments, the p-value is 0.004, which we consider statistically significant. As a result, these successful experiments confirm our hypothesis that given the attributes of a target structure, the threelevel approach incorporating EC could be used to evolve a set of component rules, which can be mapped to a physical system consisting of an environment containing components that are able to self-assemble into the target structure.
Fig. 6. Results for the best trial of experiment one (left) and experiment two (right)
5
Conclusions
Here we used EC to evolve the component set required to create one target structure. As future work we look to being able to evolve multiple target structures simultaneously. We have also been able to extend our physical systems and look to applying EC to evolving physical self-assembling systems in three-dimensions. We envision our approach being applicable to the design of (micro)structures, circuits, and DNA Computing using self-assembly. The work presented here progresses techniques to solve an open problem in self-assembly, of being able to create a set of components and their environment, such that the components selfassemble into a target structure. We presented two proof-of-concept experiments
392
N. Bhalla, P.J. Bentley, and C. Jacob
to demonstrate how bottom-up construction (self-assembly) can be coupled with bottom-up design (evolution). EC was incorporated into the three-level approach for designing physical self-assembling systems. The successful results of the experiments presented demonstrate how the three-level approach, by incorporating EC, can be used for evolving physical self-assembling systems in two-dimensions.
References 1. Ball, P.: The Self-made Tapestry. Oxford University Press, Oxford (1999) 2. Thompson, D.W.: On Growth and Form. Dover Publication, New York (1917) (reprint 1992) 3. Crick, F.H.C.: Central Dogma of Molecular Biology. Nature 227, 561–563 (1970) 4. Groß, R., Dorigo, M.: Self-assembly at the Macroscopic Scale. Proc. IEEE 96(9), 1490–1508 (2008) 5. Adlemna, L., Cheng, Q., Goel, A., Huang, M.-D., Kempe, D., de Espan´es, P.M., Rothemund, P.W.K.: Combinatorial Optimization Problems in Self-assembly. In: 34th ACM International Symposium on Theory of Computing, pp. 23–32. ACM Press, New York (2002) 6. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (2002) 7. Bhalla, N., Bentley, P.J.: Programming Self-assembling Systems Via Physically Encoded Information. In: Doursat, R., Sayama, H., Michel, O. (eds.) ME 2010. LNCS. Springer, Heidelberg (2010) 8. Bhalla, N., Bentley, P.J., Jacob, C.: Mapping Virtual Self-assembly Rules to Physical Systems. In: Proceedings of the International Conference on Unconventional Computing, pp. 117–147. Luniver Press, Frome (2007) 9. Whitesides, G.M., Gryzbowski, G.: Self-assembly at all Scales. Science 295, 2418– 2421 (2002) 10. Winfree, E.: Simulations of Computing by Self-assembly. DNA Based Computers IV (1998) 11. Winfree, E., Liu, F., Wenzier, L., Seeman, N.: Design and Self-assembly of Twodimensional DNA crystals. Nature 394(6), 539–544 (1998) 12. Terrazas, G., Gheorghe, M., Kendall, G., Krasnogor, N.: Evolving Tiles for Automated Self-assembly Design. In: Proceeding of the 2007 IEEE Congress on Evolutionary Computation, pp. 2001–2008. IEEE Press, New York (2007) 13. Soille, P.: Morphological Image Analysis, 2nd edn. Springer, Berlin (2003) 14. Johnston Jr., E.R., Eisenberg, E., Mazurek, D.: Vector Mechanics for Engineers: Statics, 9th edn. McGraw-Hill Higher Education, New York (2009) 15. Cox, D.R., Snell, E.J.: Analysis of Binary Data, 2nd edn. Chapman & Hall/CRC, Boca Raton (1989)
Author Index
Bechmann, Matthias 335 Benkhelifa, Elhadj 322 Bentley, Peter J. 121, 381 Bhalla, Navneet 381 Bidlo, Michal 85 Bremner, Paul 37 Bull, Larry 360
Ledwith, Ricky D. 25 Liang, Houjun 193 Lipson, Hod 157 Liu, Yang 238 Li, Zhifang 193 Lowe, David 49 Luo, Wenjian 193
Cagnoni, Stefano 97 Carrillo, Snaider 133 Cornforth, Theodore W.
Madrenas, Jordi 145, 299 McDaid, Liam 133 Mesquita, Antonio 310 Miller, Julian F 25, 61 Miorandi, Daniele 49 Mondada, Francesco 286 Moreno, Juan Manuel 145, 299 Morgan, Fearghal 133 Mujkanovic, Amir 49 Mussi, Luca 97
Dragffy, Gabriel
157
37
Ebne-Alian, Mohammad Ebner, Marc 109 Eiben, A.E. 169
73
Farnsworth, Michael 322 Fonseca Vieira, Pedro da 310 Gajda, Zbyˇsek 13 Gamrat, Christian 262 Glette, Kyrre 250, 274 Graf, Yoan 286 Haasdijk, Evert 169 Harkin, Jim 133 Hilder, James A. 1 Hovin, Mats 274 Ivekovic, Spela Jacob, Christian
97 381
Kaufmann, Paul 250 Kharma, Nawwaf 73 Kim, Kyung-Joong 157 Knieper, Tobias 250 Kobayashi, Kotaro 299 Kotasek, Zdenek 181 Kuyucu, T¨ uze 61
Pande, Sandeep 133 Pena, Carlos 226 Perez-Uribe, Andres 286 Philippe, Jean-Marc 262 Pipe, Tony 37 Platzner, Marco 250 Prodan, Lucian 348 R´etornaz, Philippe 286 Rieffel, John 372 Rossier, Jo¨el 202, 226 Rouhipour, Marjan 121 Ruican, Cristian 348 Rusu, Andrei A. 169 S´ a, Leonardo Bruno de 310 Samie, Mohammad 37 Sanchez, Eduardo 286 S´ anchez, Giovanny 145 Satiz´ abal, H´ector F. 286 Sayles, Dave 372 Sebald, Angelika 335 Sekanina, Luk´ aˇs 13, 214 Shayani, Hooman 121 ˇ aˇcek, Jiˇr´ı 214 Sim´ Skarvada, Jaroslav 181
394
Author Index
Slany, Karel 85 Stareˇcek, Luk´ aˇs 214 Stauffer, Andr´e 202 Stepney, Susan 335 Strnadel, Josef 181 Tain, Benoˆıt 262 Tempesti, Gianluca 238 Thoma, Yann 286 Tiwari, Ashutosh 322 Torresen, Jim 250 Trefzer, Martin A. 61 Tyrrell, Andy M. 1, 37, 61, 238
Udrescu, Mihai Upegui, Andres
348 286
Vasicek, Zdenek Vladutiu, Mircea
85 348
Walker, James Alfred Wang, Xufa 193 Yamamoto, Lidia Zhu, Meiling
322
49
1, 37, 238