Computational Intelligence
Computational Intelligence for Engineering and Manufacturing Edited by
Diego Andina Technical University of Madrid (UPM), Spain
Duc Truong Pham Manufacturing Engineering Center, Cardiff University, Cardiff
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-10 ISBN-13 ISBN-10 ISBN-13
0-387-37450-7 (HB) 978-0-387-37450-5 (HB) 0-387-37452-3 (e-book) 978-0-387-37452-9 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved © 2007 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
This book is dedicated to the memory of Roberto Carranza E., who induced the authors the enthusiasm to jointly prepare this book.
CONTENTS
Contributing Authors
ix
Preface
xi
Acknowledgements
xiii
1.
Soft Computing and its Applications in Engineering and Manufacture D. T. Pham, P. T. N. Pham, M. S. Packianather, A. A. Afify
1
2.
Neural Networks Historical Review D. Andina, A. Vega-Corona, J. I. Seijas, J. Torres-García
39
3.
Artificial Neural Networks D. T. Pham, M. S. Packianather, A. A. Afify
67
4.
Application of Neural Networks D. Andina, A. Vega-Corona, J. I. Seijas, M. J. Alarcón
93
5.
Radial Basis Function Networks and their Application in Communication Systems Ascensión Gallardo Antolín, Juan Pascual García, José Luis Sancho Gómez
109
6.
Biological Clues for Up-to-Date Artificial Neurons Javier Ropero Peláez, Jose Roberto Castillo Piqueira
131
7.
Support Vector Machines Jaime Gómez Sáenz de Tejada, Juan Seijas Martínez-Echevarría
147
8.
Fractals as Pre-Processing Tool for Computational Intelligence Application Ana M. Tarquis, Valeriano Méndez, Juan B. Grau, José M. Antón, Diego Andina
vii
193
CONTRIBUTING AUTHORS
D. Andina, J. I. Seijas, J. Torres-García, M. J. Alarcón, A. Tarquis, J. B. Grau and J. M. Antón work for Technical University of Madrid (UPM), Spain, where they form the Group for Automation and Soft Computing (GASC). D. T. Pham, P. T. N. Pham, M. S. Packianather and A. A. Afify work for Cardiff University . Javier Ropero Peláez, José Roberto Castillo Piqueira work for Escola Politecnica da Universidade de Sao Paulo Departamento de Engenharia de Telecomunicaçoes e Controle, Brazil. A. Gallardo Antolín, J. Pascual García and J. L. Sancho Gómez work for University Carlos III of Madrid, Spain, A. Vega-Corona, V. Méndez and J. Gómez Sáenz de Tejada work for University of Guanajuato, Mexico, Technical University of Madrid and Universidad Autónoma of Madrid, Spain, respectively.
ix
PREFACE
This book presents a selected collection of contributions on a focused treatment of important elements of Computational Intelligence. Unlike traditional computing, Computational Intelligence (CI) is tolerant of imprecise information, partial truth and uncertainty. The principle components of CI that currently have frequent application in Engineering and Manufacturing are: Neural Networks (NN), fuzzy logic (FL) and Support Vector Machines (SVM). In CI, NN and SVM are concerned with learning, while FL with imprecision and reasoning. This volume mainly covers a key element of Computational Intelligence∗ learning. All the contributions in this volume have a direct relevance to neural network learning∗ from neural computing fundamentals to advanced networks such as Multilayer Perceptrons (MLP), Radial Basis Function Networks (RBF), and their relations with fuzzy set and support vector machines theory. The book also discusses different applications in Engineering and Manufacturing. These are among applications where CI have excellent potentials for use. Both novice and expert readers should find this book a useful reference in the field of Computational Intelligence. The editors and the authors hope to have contributed to the field by paving the way for learning paradigms to solve real-world problems D. Andina
xi
ACKNOWLEDGEMENTS
This document has been produced with the financial assistance of the European Community, ALFA project II-0026-FA. The views expressed herein are those of the Authors and can therefore in no way be taken to reflect the official opinion of the European Community. The editors wish to thank Dr A. Afify of Cardiff University and Mr A. Jevtic of the Technical University of Madrid for their support and helpful comments during the revision of this text. The editors also wish to thank Nagib Callaos, President of the International Institute of Informatics and Systemics, IIIS, for his permission and freedom to reproduce in Chapters 2 and 4 of this book contents from the book by D.Andina and F.Ballesteros (Eds), “Recent Advances in Neural Networks” Ed. IIIS press, ILL, USA (2000).
xiii
CHAPTER 1 SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
D. T. PHAM, P. T. N. PHAM, M. S. PACKIANATHER, A. A. AFIFY Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom
INTRODUCTION Soft computing is a recent term for a computing paradigm that has been in existence for almost fifty years. This chapter reviews five soft computing tools. They are: knowledge-based systems, fuzzy logic, inductive learning, neural networks and genetic algorithms. All of these tools have found many practical applications. Examples of applications in engineering and manufacture will be given in the chapter. 1.
KNOWLEDGE-BASED SYSTEMS
Knowledge-based systems, or expert systems, are computer programs embodying knowledge about a narrow domain for solving problems related to that domain. An expert system usually comprises two main elements, a knowledge base and an inference mechanism. The knowledge base contains domain knowledge which may be expressed as any combination of “IF-THEN” rules, factual statements (or assertions), frames, objects, procedures and cases. The inference mechanism is that part of an expert system which manipulates the stored knowledge to produce solutions to problems. Knowledge manipulation methods include the use of inheritance and constraints (in a frame-based or object-oriented expert system), the retrieval and adaptation of case examples (in a case-based expert system) and the application of inference rules such as modus ponens (If A Then B; A Therefore B) and modus tollens (If A Then B; NOT B Therefore NOT A) according to “forward chaining” or “backward chaining” control procedures and “depth-first” or “breadth-first” search strategies (in a rule-based expert system). With forward chaining or data-driven inferencing, the system tries to match available facts with the IF portion of the 1 D. Andina and D.T. Pham (eds.), Computational Intelligence, 1–38. © 2007 Springer.
2
CHAPTER 1
IF-THEN rules in the knowledge base. When matching rules are found, one of them is “fired”, i.e. its THEN part is made true, generating new facts and data which in turn causes other rules to “fire”. Reasoning stops when no more new rules can fire. In backward chaining or goal-driven inferencing, a goal to be proved is specified. If the goal cannot be immediately satisfied by existing facts in the knowledge base, the system will examine the IF-THEN rules for rules with the goal in their THEN portion. Next, the system will determine whether there are facts that can cause any of those rules to fire. If such facts are not available they are set up as subgoals. The process continues recursively until either all the required facts are found and the goal is proved or any one of the subgoals cannot be satisfied, in which case the original goal is disproved. Both control procedures are illustrated in Figure 1. Figure 1a shows how, given the assertion that a lathe is a machine tool and a set of rules concerning machine tools, a forward-chaining system will generate additional assertions such as “a lathe is power driven” and “a lathe has a tool holder”. Figure 1b details the backward-chaining sequence producing the answer to the query “does a lathe require a power source?”. In the forward chaining example of Figure 1a, both rules R2 and R3 simultaneously qualify for firing when inferencing starts as both their IF parts match the presented fact F1. Conflict resolution has to be performed by the expert system to decide which rule should fire. The conflict resolution method adopted in this example is “first come, first served”: R2 fires as it is the first qualifying rule encountered. Other conflict resolution methods include “priority”, “specificity” and “recency”. The search strategies can also be illustrated using the forward chaining example of Figure 1a. Suppose that, in addition to F1, the knowledge base also initially contains the assertion “a CNC turning centre is a machine tool”. Depth-first search involves firing rules R2 and R3 with X instantiated to “lathe” (as shown in Figure 1a) before firing them again with X instantiated to “CNC turning centre”. Breadth-first search will activate rule R2 with X instantiated to “lathe” and again with X instantiated to “CNC turning centre”, followed by rule R3 and the same sequence of instantiations. Breadth-first search finds the shortest line of inferencing between a start position and a solution if it exists. When guided by heuristics to select the correct search path, depth-first search might produce a solution more quickly, although the search might not terminate if the search space is infinite [Jackson, 1999]. For more information on the technology of expert systems, see [Pham and Pham, 1988; Durkin, 1994; Giarratano and Riley, 1998; Darlington, 1999; Jackson, 1999; Badiru and Cheung, 2002; Nurminen et al., 2003]. Most expert systems are nowadays developed using programs known as “shells”. These are essentially ready-made expert systems complete with inferencing and knowledge storage facilities but without the domain knowledge. Some sophisticated expert systems are constructed with the help of “development environments”. The latter are more flexible than shells in that they also provide means for users to implement their own inferencing and knowledge representation methods. More details on expert systems shells and development environments can be found in [Price, 1990].
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
KNOWLEDGE BASE (Initial State) Fact : F1 - A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven
F1 & R2 match KNOWLEDGE BASE (Intermediate State) Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven
F1 & R3 match KNOWLEDGE BASE (Intermediate State) Fact : F1 F2 F3 Rules : R1 R2 R3 -
A lathe is a machine tool A lathe has a tool holder A lathe is power driven If X is power driven Then X requires a power source If X is a machine tool Then X has a tool holder If X is a machine tool Then X is power driven
F3 & R1 match KNOWLEDGE BASE (Final State) Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder F3 - A lathe is power driven F4 - A lathe requires a power source Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven
Figure 1a. An example of forward chaining
3
4
CHAPTER 1
KNOWLEDGE BASE (Initial State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ?
G1 & R1
KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Goal : Satisfied ? G1 - A lathe requires a power source G2 - A lathe is a power driven ?
KNOWLEDGE BASE (Final State) Fact : F1 -A lathe is a machine tool F2 -A lathe is power driven F3 -A lathe requires a power source Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Goal : Satisfied G1 - A lathe requires a power source Yes
F2 & R1
KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool F2 -A lathe is power driven Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ? G2 - A lathe is a power driven Yes
F1 & R3
G2 & R3
KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : G1 - A lathe requires a power source ? G2 - A lathe is a power driven ? ? G3 - A lathe is a machine tool
KNOWLEDGE BASE (Intermediate State) Fact : F1 -A lathe is a machine tool Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven GOAL STACK Satisfied Goal : ? G1 - A lathe requires a power source ? G2 - A lathe is a power driven G3 - A lathe is a machine tool Yes
Figure 1b. An example of backward chaining
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
5
Among the five tools considered in this chapter, expert systems are probably the most mature, with many commercial shells and development tools available to facilitate their construction. Consequently, once the domain knowledge to be incorporated in an expert system has been extracted, the process of building the system is relatively simple. The ease with which expert systems can be developed has led to a large number of applications of the tool. In engineering, applications can be found for a variety of tasks including selection of materials, machine elements, tools, equipment and processes, signal interpreting, condition monitoring, fault diagnosis, machine and process control, machine design, process planning, production scheduling and system configuring. Some recent examples of specific tasks undertaken by expert systems are: • identifying and planning inspection schedules for critical components of an offshore structure [Peers et al., 1994]; • automating the evaluation of manufacturability in CAD systems [Venkatachalam, 1994]; • choosing an optimal robot for a particular task [Kamrani et al., 1995]; • monitoring the technical and organisational problems of vehicle maintenance in coal mining [Streichfuss and Burgwinkel, 1995]; • configuring paper feeding mechanisms [Koo and Han, 1996]; • training technical personnel in the design and evaluation of energy cogeneration plants [Lara Rosano et al., 1996]; • storing, retrieving and adapting planar linkage designs [Bose et al., 1997]; • designing additive formulae for engine oil products [Shi et al., 1997]; • carrying out automatic remeshing during a finite-elements analysis of forging deformation [Yano et al., 1997]; • designing of products and their assembly processes [Zha et al., 1998]; • modelling and control of combustion processes [Kalogirou, 2003]; • optimising the transient performances in the adaptive control of a planar robot [De La Sen et al., 2004].
2.
FUZZY LOGIC
A disadvantage of ordinary rule-based expert systems is that they cannot handle new situations not covered explicitly in their knowledge bases (that is, situations not fitting exactly those described in the “IF” parts of the rules). These rule-based systems are completely unable to produce conclusions when such situations are encountered. They are therefore regarded as shallow systems which fail in a “brittle” manner, rather than exhibit a gradual reduction in performance when faced with increasingly unfamiliar problems, as human experts would. The use of fuzzy logic [Zadeh, 1965] which reflects the qualitative and inexact nature of human reasoning can enable expert systems to be more resilient. With fuzzy logic, the precise value of a variable is replaced by a linguistic description, the meaning of which is represented by a fuzzy set, and inferencing is carried
6
CHAPTER 1
out based on this representation. Fuzzy set theory may be considered an extension of classical set theory. While classical set theory is about “crisp” sets with sharp boundaries, fuzzy set theory is concerned with “fuzzy” sets whose boundaries are “grey”. In classical set theory, an element ui can either belong or not belong to a set A, i.e. ∼ the degree to which element u belongs to set A is either 1 or 0. However, in fuzzy ∼
set theory, the degree of belonging of an element u to a fuzzy set A is a real number ∼
between 0 and 1. This is denoted by A ui , the grade of membership of ui in A. Fuzzy ∼
∼
set A is a fuzzy set in U, the “universe of discourse” or “universe” which includes all ∼
objects to be discussed. A ui is 1 when ui is definitely a member of A and A ui is ∼
∼
∼
0 when ui is definitely not a member of A. For instance, a fuzzy set defining the term “normal room temperature” might be:-
∼
normal room temperature ≡ 00/below10 C + 03/10 C–16 C (1)
+ 08/16 C–18 C + 10/18 C–22 C + 08/22 C–24 C + 03/24 C–30 C + 00/above 30 C
The values 0.0, 0.3, 0.8 and 1.0 are the grades of membership to the given fuzzy set of temperature ranges below 10 C (above 30 C), between 10 C and 16 C24 C–30 C, between 16 C and 18 C22 C–24 C and between 18 C and 22 C. Figure 2(a) shows a plot of the grades of membership for “normal room temperature”. For comparison, Figure 2(b) depicts the grades of membership for a crisp set defining room temperatures in the normal range. Knowledge in an expert system employing fuzzy logic can be expressed as qualitative statements (or fuzzy rules) such as “If the room temperature is normal, then set the heat input to normal”, where “normal room temperature” and “normal heat input” are both fuzzy sets. A fuzzy rule relating two fuzzy sets A and B is effectively the Cartesian product ∼
∼
A × B which can be represented by a relation matrix R. Element Rij of R is the ∼ ∼ ∼ ∼ membership to A × B of pair ui vj ui ∈ A and vj ∈ B. Rij is given by: ∼
(2)
∼
∼
∼
Rij = minA ui B vj ∼
∼
For example, with “normal room temperature” defined as before and “normal heat input” described by: (3)
normal heat input ≡ 02/1 kW + 09/2 kW + 02/3 kW
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
7
µ
1
0.5
10
20
30
40 Temperature ( ˚C )
(a)
µ 1
10
20
30
40 Temperature ( ˚C )
(b) Figure 2. (a) Fuzzy set of “normal temperature” (b) Crisp set of “normal temperature”
R can be computed as: ∼
⎡
(4)
00 ⎢02 ⎢ ⎢02 ⎢ R = ⎢ ⎢02 ∼ ⎢02 ⎢ ⎣02 00
00 03 08 09 08 03 00
⎤ 00 02⎥ ⎥ 02⎥ ⎥ 02⎥ ⎥ 02⎥ ⎥ 02⎦ 00
A reasoning procedure known as the compositional rule of inference, which is the equivalent of the modus-ponens rule in rule-based expert systems, enables conclusions to be drawn by generalisation (extrapolation or interpolation) from the qualitative information stored in the knowledge base. For instance, when the room
8
CHAPTER 1
temperature is detected to be “slightly below normal”, a temperature-controlling fuzzy expert system might deduce that the heat input should be set to “slightly above normal”. Note that this conclusion might not be contained in any of the fuzzy rules stored in the system. A well-known compositional rule of inference is the max-min rule. Let R represent the fuzzy rule “If A Then B” and a ≡ i /ui ∼
∼
∼
∼
i
a fuzzy assertion. A and a are fuzzy sets in the same universe of discourse. The ∼ ∼ max-min rule enables a fuzzy conclusion b ≡ j /vj to be inferred from a and R ∼
j
∼
∼
as follows: (5) (6)
b = a oR
∼
∼
∼
j = maxmin i Rij i
For example, given the fuzzy rule “If the room temperature is normal, then set the heat input to normal” where “normal room temperature” and “normal heat input” are as defined previously, and a fuzzy temperature measurement of temperature ≡ 00/below10 C + 04/10 C–16 C + 08/16 C–18 C (7)
+ 08/18 C–22 C + 02/22 C–24 C + 00/24 C–30 C + 00/above30 C
the heat input will be deduced as: heat input = temperature oR ∼
(8)
= 02/1 kW + 08/2 kW + 02/3 kW
For further information on fuzzy logic, see [Kaufmann, 1975; Klir and Yuan, 1995; 1996; Ross, 1995; Zimmermann, 1996; Dubois and Prade, 1998]. Fuzzy logic potentially has many applications in engineering where the domain knowledge is usually imprecise. Notable successes have been achieved in the area of process and machine control although other sectors have also benefited from this tool. Recent examples of engineering applications include: • controlling the height of the arc in a welding process [Bigand et al., 1994]; • controlling the rolling motion of an aircraft [Ferreiro Garcia, 1994]; • controlling a multi-fingered robot hand [Bas and Erkmen, 1995]; • analysing the chemical composition of minerals [Da Rocha Fernandes and Cid Bastos, 1996]; • monitoring of tool-breakage in end-milling operations [Chen and Black, 1997]; • modelling of the set-up and bend sequencing process for sheet metal bending [Ong et al., 1997]; • determining the optimal formation of manufacturing cells [Szwarc et al., 1997; Zülal and Arikan, 2000];
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
9
• classifying discharge pulses in electrical discharge machining [Tarng et al., 1997]; • modelling an electrical drive system [Costa Branco and Dente, 1998]; • improving the performance of hard disk drive final assembly [Zhao and De Souza, 1998; 2001]; • analysing chatter occurring during a machine tool cutting process [Kong et al., 1999]; • addressing the relationships between customer needs and design requirements [Sohen and Choi, 2001; Vanegas and Labib, 2001; Karsak, 2004]; • assessing and selecting advanced manufacturing systems [Karsak and Kuzgunkaya, 2002; Bozda˘g et al., 2003; Beskese et al., 2004; Kulak and Kahraman, 2004]; • evaluating cutting force uncertainty in turning [Wang et al., 2002]; • reducing defects in automotive coating operations [Lou and Huang, 2003]. 3.
INDUCTIVE LEARNING
The acquisition of domain knowledge to build into the knowledge base of an expert system is generally a major task. In some cases, it has proved a bottleneck in the construction of an expert system. Automatic knowledge acquisition techniques have been developed to address this problem. Inductive learning is an automatic technique for knowledge acquisition. The inductive approach produces a structured representation of knowledge as the outcome of learning. Induction involves generalising a set of examples to yield a selected representation which can be in terms of a set of rules, concepts or logical inferences or a decision tree. An inductive learning program usually requires as input a set of examples. Each example is characterised by the values of a number of attributes and the class to which it belongs. In one approach to inductive learning, through a process of “dividing-and-conquering” where attributes are chosen according to some strategy (for example, to maximise the information gain) to divide the original example set into subsets, the inductive learning program builds a decision tree that correctly classifies the given example set. The tree represents the knowledge generalised from the specific examples in the set. This can subsequently be used to handle situations not explicitly covered by the example set. In another approach known as the “covering approach”, the inductive learning program attempts to find groups of attributes uniquely shared by examples in given classes and forms rules with the IF part as conjunctions of those attributes and the THEN part as the classes. The program removes correctly classified examples from consideration and stops when rules have been formed to classify all examples in the given set. A new approach to inductive learning, “inductive logic programming”, is a combination of induction and logic programming. Unlike conventional inductive learning which uses propositional logic to describe examples and represent new concepts, inductive logic programming (ILP) employs the more powerful predicate
10
CHAPTER 1
logic to represent training examples and background knowledge and to express new concepts. Predicate logic permits the use of different forms of training examples and background knowledge. It enables the results of the induction process, that is the induced concepts, to be described as general first-order clauses with variables and not just as zero-order propositional clauses made up of attribute-value pairs. There are two main types of ILP systems, the first, based on the top-down generalisation/specialisation method, and the second, on the principle of inverse resolution [Muggleton, 1992; Lavrac, 1994]. A number of inductive learning programs have been developed. Some of the well known programs are CART [Breiman et al., 1998], ID3 and its descendants C4.5 and C5.0 [Quinlan, 1983; 1986; 1993; ISL, 1998; RuleQuest, 2000] which are divide-and-conquer programs, the AQ family of programs [Michalski, 1969; 1990; Michalski et al., 1986; Cervone et al., 2001; Michalski and Kaufman, 2001] which follow the covering approach, the FOIL program [Quinlan, 1990; Quinlan and Cameron-Jones, 1995] which is an ILP system adopting the generalisation/specialisation method and the GOLEM program [Muggleton and Feng, 1990] which is an ILP system based on inverse resolution. Although most programs only generate crisp decision rules, algorithms have also been developed to produce fuzzy rules [Wang and Mendel, 1992; Janikow, 1998; Hang and Chen, 2000; Baldwin and Martin, 2001; Wang et al., 2001; Baldwin and Karale, 2003; Wang et al., 2003]. Figure 3 shows the main steps in RULES–3 Plus, an induction algorithm in the covering category [Pham and Dimov, 1997] and belonging to the RULES family of rule extraction systems [Pham and Aksoy, 1994; 1995a; 1995b; Pham et al., 2000; Pham et al., 2003; Pham and Afify; 2005a]. The simple problem of detecting the state of a metal cutting tool is used to explain the operation of RULES-3 Plus. Three sensors are employed to monitor the cutting process and, according to the signals obtained from them (1 or 0 for sensors 1 and 3; −1, 0, or 1 for sensor 2), the tool is inferred as being “normal” or “worn”. Thus, this problem involves three attributes which are the states of sensors 1, 2 and 3 and the signals that they emit constitute the values of those attributes. The example set for the problem is given in Table 1.
Table 1. Training set for the Cutting Tool problem Example
Sensor_1
Sensor_2
Sensor_3
Tool State
1 2 3 4 5 6 7 8
0 1 1 1 0 1 1 0
−1 0 −1 0 0 1 −1 −1
0 0 1 1 1 1 0 1
Normal Normal Worn Normal Normal Worn Normal Worn
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
11
Step 1. Take an unclassified example and form array SETAV. Step 2. Initialise arrays PRSET and T_PRSET (PRSET and T_PRSET will consist of mPRSET expressions with null conditions and zero H measures) and set nco = 0. Step 3. IF nco < na THEN nco = nco + 1 and set m = 0; ELSE the example itself is taken as a rule and STOP. Step 4. DO m = m + 1; Specialise expression m in PRSET by appending to it a condition from SETAV that differs from the conditions already included in the expression; Compute the H measure for the expression; IF its H measure is higher than the H measure of any expression in T_PRSET THEN replace the expression having the lowest H measure with the newly formed expression; ELSE discard the new expression; WHILE m < mPRSET . Step 5. IF there are consistent expressions in T_PRSET THEN choose as a rule the expression that has the highest H measure and discard the others; ELSE copy T_PRSET into PRSET; initialise T_PRSET and go to step 3.
Figure 3. Rule forming procedure of RULES-3 Plus Notes: nco – number of conditions; na -number of attributes; mPRSET – number of expressions stored in PRSET (mPRSET is user-provided); T_PRSET - a temporary array of partial rules of the same dimension as PRSET
In step 1, example 1 is used to form the attribute-value array SETAV which will contain the following attribute-value pairs: [Sensor_1 = 0 Sensor_2 = −1 and Sensor_3 = 0. In step 2, the partial rule set PRSET and T_PRSET, the temporary version of PRSET used for storing partial rules in the process of rule construction, are initialised. This creates for each of these sets three expressions having null conditions and zero H measures. The H measure for an expression is defined as: (9)
H=
Eic Ei Ec Eic Ei 1− c 1− 2−2 −2 E Ec E E E
where E c is the number of examples covered by the expression (the total number of examples correctly classified and misclassified by a given rule), E is the total number of examples, Eic is the number of examples covered by the expression and belonging to the target class i (the number of examples correctly classified by a given rule), and Ei is the number of examples in the training set belonging to the
12
CHAPTER 1
target class i. In Equation (9), the first term (10)
G=
Ec E
relates to the generality of the rule and the second term
Eic Ei Eic Ei (11) A = 2−2 1 − 1 − − 2 Ec E Ec E indicates its accuracy. In steps 3 and 4, by specialising PRSET using the conditions stored in SETAV, the following expressions are formed and stored in T_PRSET: 1 Sensor_3 = 0 ⇒ Alarm = OFF
H = 02565
2 Sensor_2 = −1 ⇒ Alarm = OFF
H = 00113
3 Sensor_1 = 0 ⇒ Alarm = OFF
H = 00012
In step 5, a rule is produced as the first expression in T_PRSET applies to only one class: Rule1 IF Sensor_3 = 0 THEN Alarm = OFF H = 02565 Rule 1 can classify examples 2 and 7 in addition to example 1. Therefore, these examples are marked as classified and the induction proceeds. In the second iteration, example 3 is considered. T_PRSET, formed in step 4 after specialising the initial PRSET, now consists of the following expressions: 1 Sensor_3 = 1 ⇒ Alarm = ON
H = 00406
2 Sensor_2 = −1 ⇒ Alarm = ON
H = 00079
3 Sensor_1 = 1 ⇒ Alarm = ON
H = 00005
As none of the expressions cover only one class, T_PRSET is copied into PRSET (step 5) and the new PRSET has to be specialised further by appending the existing expressions with conditions from SETAV. Therefore the procedure returns to step 3 for a new pass. The new T_PRSET formed at the end of step 4 contains the following three expressions: 1 Sensor_2 = −1Sensor_3 = 1 ⇒ Alarm = ON
H = 03876
2 Sensor_1 = 1Sensor_3 = 1 ⇒ Alarm = ON
H = 00534
3 Sensor_1 = 1Sensor_2 = −1 ⇒ Alarm = ON
H = 00008
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
13
As the first expression applies to only one class, the following rule is obtained: Rule 2 IF Sensor_2 = −1 AND Sensor_3 = 1 THEN Alarm = ON H = 03876 Rule 2 can classify examples 3 and 8, which again are marked as classified. In the third iteration, example 4 is used to obtained the next rule: Rule 3 IF Sensor_2 = 0 THEN Alarm = OFF H = 02565 This rule can classify examples 4 and 5 and so they are also marked as classified. In iteration 4, the last unclassified example 6 is employed for rule extraction, yielding: Rule 4 IF Sensor_2 = 1 THEN Alarm = ON H = 02741 There are no remaining unclassified examples in the example set and the procedure terminates at this point. Due to its requirement for a set of examples in a rigid format (with known attributes and of known classes), inductive learning has found rather limited applications in engineering as not many engineering problems can be described in terms of such a set of examples. Another reason for the paucity of applications is that inductive learning is generally more suitable for problems where attributes have discrete or symbolic values than for those with continuous-valued attributes as in many engineering problems. Some recent examples of applications of inductive learning are: • controlling a laser cutting robot [Luzeaux, 1994]; • controlling the functional electrical stimulation of spinally-injured humans [Kostov et al., 1995]; • modelling job complexity in clothing production systems [Hui et al., 1997]; • analysing the constructability of a beam in a reinforced-concrete frame [Skibniewski et al., 1997]; • analysing the results of tests on portable electronic products to discover useful design knowledge [Zhou, 2001]; • accelerating rotogravure printing [Evans and Fisher, 2002]; • predicting JIT factory performance from past data that includes both good and poor factory performance [Mathieu et al., 2002]; • developing an intelligent monitoring system for improving the reliability of a manufacturing process [Peng, 2004]. • analysing data in a steel bar manufacturing company to help intelligent decision making [Pham et al., 2004]; More information on inductive learning techniques and their applications in engineering and manufacture can be found in [Pham et al., 2002; Pham and Afify, 2005b].
14 4.
CHAPTER 1
NEURAL NETWORKS
Like inductive learning programs, neural networks can capture domain knowledge from examples. However, they do not archive the acquired knowledge in an explicit form such as rules or decision trees and they can readily handle both continuous and discrete data. They also have a good generalisation capability as with fuzzy expert systems. A neural network is a computational model of the brain. Neural network models usually assume that computation is distributed over several simple units called neurons which are interconnected and which operate in parallel (hence, neural networks are also called parallel-distributed-processing systems or connectionist systems). Figure 4 illustrates a typical model of a neuron. Output signal yj is a function f of the sum of weighted input signals xi . The activation function f can be a linear, simple threshold, sigmoidal, hyberbolic tangent or radial basis function. Instead of being deterministic, f can be a probabilistic function, in which case yj will be a binary quantity, for example, +1 or −1. The net input to such a stochastic neuron – that is, the sum of weighted input signals xi – will then give the probability of yj being +1 or −1. How the inter-neuron connections are arranged and the nature of the connections determine the structure of a network. How the strengths of the connections are adjusted or trained to achieve a desired overall behaviour of the network is governed by its learning algorithm. Neural networks can be classified according to their structures and learning algorithms. In terms of their structures, neural networks can be divided into two types: feedforward network and recurrent networks. Feedforward networks can perform a static mapping between an input space and an output space: the output at a given instant is a function only of the input at that instant. The most popular feedforward neural network is the multi-layer perceptron (MLP): all signals flow in a single direction from the input to the output of the network. Figure 5 shows an MLP with three layers: an input layer, an output layer and an intermediate or hidden layer. Neurons in the input layer only act as buffers for distributing the input signals xi to neurons in the hidden layer. Each neuron j in the hidden layer operates according to the model of Figure 4. That is, its output yj is given by: (12)
yj = f wji xi x1
xi
wj1 wji
∑
yj f(.)
wjn xn Figure 4. Model of a neuron
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
15
Output Layer
y1
yn
Hidden Layer w1m w12 w11 Input Layer x1
x2
xm
Figure 5. A multi-layer perceptron
The outputs of neurons in the output layer are computed similarly. Other feedforward networks [Pham and Liu, 1999] include the learning vector quantisation (LVQ) network, the cerebellar model articulation control (CMAC) network and the group-method of data handling (GMDH) network. Recurrent networks are networks where the outputs of some neurons are fedback to the same neurons or to neurons in layers before them. Thus signals can flow in both forward and backward directions. Recurrent networks are said to have a dynamic memory: the output of such networks at a given instant reflects the current input as well as previous inputs and outputs. Examples of recurrent networks [Pham and Liu, 1999] include the Hopfield network, the Elman network and the Jordan network. Figure 6 shows a well-known, simple recurrent neural network, the Grossberg and Carpenter ART-1 network. The network has two layers, an input layer and an output layer. The two layers are fully interconnected, the connections are in both the forward (or bottom-up) direction and the feedback (or top-down) direction. The vector Wi of weights of the bottom-up connections to an output neuron i forms an exemplar of the class it represents. All the Wi vectors constitute the long-term memory of the network. They are employed to select the winning neuron, the latter again being the neuron whose Wi vector is most similar to the current input pattern. The vector Vi of the weights of the top-down connections from an output neuron i is used for vigilance testing, that is, determining whether an input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi form the short-term memory of the network. Vi and Wi are related in that Wi is a normalised copy of Vi , viz. (13)
Wi =
+
Vi
Vji
16
CHAPTER 1
output layer
bottom up weights W
top down weights V
input layer Figure 6. An ART-1 network
where is a small constant and Vji , the jth component of Vi (i.e. the weight of the connection from output neuron i to input neuron j). Implicit “knowledge” is built into a neural network by training it. Neural networks are trained and categorised according to two main types of learning algorithms: supervised and unsupervised. In addition, there is a third type, reinforcement learning, which is a special case of supervised learning. In supervised training, the neural network can be trained by being presented with typical input patterns and the corresponding expected output patterns. The error between the actual and expected outputs is used to modify the strengths, or weights, of the connections between the neurons. The backpropagation (BP) algorithm, a gradient descent algorithm, is the most commonly adopted MLP training algorithm. It gives the change wji in the weight of a connection between neurons i and j as follows:(14)
wji = j xi
where is a parameter called the learning rate and j is a factor depending on whether neuron j is an output neuron or a hidden neuron. For output neurons,
f t yj − yj (15) j = net j and for hidden neurons,
f w (16) j = net j q qj q
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
17
In Equation (15), netj is the total weighted sum of input signals to neuron j and t yj is the target output for neuron j. As there are no target outputs for hidden neurons, in Equation (16), the difference between the target and actual output of a hidden neuron j is replaced by the weighted sum of the q terms already obtained for neurons q connected to the output of j. Thus, iteratively, beginning with the output layer, the term is computed for neurons in all layers and weight updates determined for all connections. The weight updating process can take place after the presentation of each training pattern (pattern-based training) or after the presentation of the whole set of training patterns (batch training). In either case, a training epoch is said to have been completed when all training patterns have been presented once to the MLP. For all but the most trivial problems, several epochs are required for the MLP to be properly trained. A commonly adopted method to speed up the training is to add a “momentum” term to Equation (14) which effectively lets the previous weight change influence the new weight change, viz: (17)
wji k + 1 = j xi + wji k
where wji k + 1 and wji k are weight changes in epochs k + 1 and k respectively and is the “momentum” coefficient. Some neural networks are trained in an unsupervised mode where only the input patterns are provided during training and the networks learn automatically to cluster them in groups with similar features. For example, training an ART-1 network involves the following steps: (i) initialising the exemplar and vigilance vectors Wi and Vi for all output neurons by setting all the components of each Vi to 1 and computing Wi according to Equation (13). An output neuron with all its vigilance weights set to 1 is known as an uncommitted neuron in the sense that it is not assigned to represent any pattern classes; (ii) presenting a new input pattern x; (iii) enabling all output neurons so that they can participate in the competition for activation; (iv) finding the winning output neuron among the competing neurons, i.e. the neuron for which x. Wi is largest; a winning neuron can be an uncommitted neuron as is the case at the beginning of training or if there are no better output neurons; (v) testing whether the input pattern x is sufficiently similar to the vigilance vector Vi of the winning neuron. Similarity is measured by the fraction r of bits in x that are also in Vi , viz. (18)
xV r= i xi
x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance threshold 0 < ≤ 1
18
CHAPTER 1
(vi) going to step (vii) if r ≥ (i.e. there is resonance); else disabling the winning neuron temporarily from further competition and going to step (iv) repeating this procedure until there are no further enabled neurons; (vii) adjusting the vigilance vector Vi of the most recent winning neuron by logically ANDing it with x, thus deleting bits in Vi that are not also in x; computing the bottom-up exemplar vector Wi using the new Vi according to Equation (13); activating the winning output neuron; (viii) going to step (ii). The above training procedure ensures that if the same sequence of training patterns is repeatedly presented to the network, its long-term and short-term memories are unchanged (i.e. the network is stable). Also, provided there are sufficient output neurons to represent all the different classes, new patterns can always be learnt, as a new pattern can be assigned to an uncommitted output neuron if it does not match previously stored exemplars well (i.e. the network is plastic). In reinforcement learning, instead of requiring a teacher to give target outputs and using the differences between the target and actual outputs directly to modify the weights of a neural network, the learning algorithm employs a critic only to evaluate the appropriateness of the neural network output corresponding to a given input. According to the performance of the network on a given input vector, the critic will issue a positive or negative reinforcement signal. If the network has produced an appropriate output, the reinforcement signal will be positive (a reward). Otherwise, it will be negative (a penalty). The intention of this is to strengthen the tendency to produce appropriate outputs and to weaken the propensity for generating inappropriate outputs. Reinforcement learning is a trial-and-error operation designed to maximise the average value of the reinforcement signal for a set of training input vectors. An example of a simple reinforcement learning algorithm is a variation of the associative reward-penalty algorithm [Hassoun, 1995]. Consider a single stochastic neuron j with inputs x1 x2 x3 xn . The reinforcement rule may be written as [Hassoun, 1995] (19)
wji k + 1 = wji k + lrkyj k − Eyj kxi k
wji is the weight of the connection between input i and neuron j, l is the learning coefficient, r (which is +1 or −1) is the reinforcement signal, yj is the output of neuron j, Eyj is the expected value of the output, and xi k is the ith component of the kth input vector in the training set. When learning converges, wji k + 1 = wji k and so Eyj k = yj k = +1 or −1. Thus, the neuron effectively becomes deterministic. Reinforcement learning is typically slower than supervised learning. It is more applicable to small neural networks used as controllers where it is difficult to determine the target network output. For more information on neural networks, see [Michie et al., 1994; Hassoun, 1995; Pham and Liu, 1999; Yao, 1999; Jiang et al., 2002; Duch et al., 2004]. Neural networks can be employed as mapping devices, pattern classifiers or pattern completers (auto-associative content addressable memories and pattern associators). Like expert systems, they have found a wide spectrum of applications in
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
19
almost all areas of engineering, addressing problems ranging from modelling, prediction, control, classification and pattern recognition, to data association, clustering, signal processing and optimisation. Some recent examples of such applications are: • predicting the tensile strength of composite laminates [Teti and Caprino, 1994]; • controlling a flexible assembly operation [Majors and Richards, 1995]; • choosing sheet metal working conditions [Lin and Chang, 1996]; • determining suitable cutting conditions in operation planning [Park et al., 1996; Schultz et al., 1997]; • recognising control chart patterns [Pham and Oztemel, 1996]; • analysing vibration spectra [Smith et al., 1996]; • deducing velocity vectors in uniform and rotating flows by tracking the movement of groups of particles [Jambunathan et al., 1997]; • setting the number of kanbans in a dynamic JIT factory [Wray et al., 1997; Markham et al., 2000]; • generating knowledge for scheduling a flexible manufacturing system [Kim et al., 1998; Priore et al., 2003]; • modelling and controlling dynamic systems including robot arms [Pham and Liu, 1999]; • acquiring and refining operational knowledge in industrial processes [Shigaki and Narazaki, 1999]; • improving yield in a semiconductor manufacturing company [Shin and Park, 2000]; • identifying arbitrary geometric and manufacturing categories in CAD databases [Ip et al., 2003]; • minimising the makespan in a flowshop scheduling problem [Akyol, 2004]. 5.
GENETIC ALGORITHMS
Conventional search techniques, such as hill-climbing, are often incapable of optimising non-linear or multi modal functions. In such cases, a random search method is generally required. However, undirected search techniques are extremely inefficient for large domains. A genetic algorithm (GA) is a directed random search technique, invented by Holland [Holland, 1975], which can find the global optimal solution in complex multi-dimensional search spaces. A GA is modelled on natural evolution in that the operators it employs are inspired by the natural evolution process. These operators, known as genetic operators, manipulate individuals in a population over several generations to improve their fitness gradually. Individuals in a population are likened to chromosomes and usually represented as strings of binary numbers. The evolution of a population is described by the “schema theorem” [Holland, 1975; Goldberg, 1989]. A schema represents a set of individuals, i.e. a subset of the population, in terms of the similarity of bits at certain positions of those individuals. For example, the schema 1∗ 0∗ describes the set of individuals whose first and third bits are 1 and 0, respectively. Here, the symbol ∗ means any value would be
20
CHAPTER 1
acceptable. In other words, the values of bits at positions marked ∗ could be either 0 or 1. A schema is characterised by two parameters: defining length and order. The defining length is the length between the first and last bits with fixed values. The order of a schema is the number of bits with specified values. According to the schema theorem, the distribution of a schema through the population from one generation to the next depends on its order, defining length and fitness. GAs do not use much knowledge about the optimisation problem under study and do not deal directly with the parameters of the problem. They work with codes which represent the parameters. Thus, the first issue in a GA application is how to code the problem, i.e. how to represent its parameters. As already mentioned, GAs operate with a population of possible solutions. The second issue is the creation of a set of possible solutions at the start of the optimisation process as the initial population. The third issue in a GA application is how to select or devise a suitable set of genetic operators. Finally, as with other search algorithms, GAs have to know the quality of the solutions already found to improve them further. An interface between the problem environment and the GA is needed to provide this information. The design of this interface is the fourth issue.
5.1
Representation
The parameters to be optimised are usually represented in a string form since this type of representation is suitable for genetic operators. The method of representation has a major impact on the performance of the GA. Different representation schemes might cause different performances in terms of accuracy and computation time. There are two common representation methods for numerical optimisation problems [Blickle and Thiele, 1995, Michalewicz, 1996]. The preferred method is the binary string representation method. The reason for this method being popular is that the binary alphabet offers the maximum number of schemata per bit compared to other coding techniques. Various binary coding schemes can be found in the literature, for example, Uniform coding, Gray scale coding, etc. The second representation method is to use a vector of integers or real numbers with each integer or real number representing a single parameter. When a binary representation scheme is employed, an important step is to decide the number of bits to encode the parameters to be optimised. Each parameter should be encoded with the optimal number of bits covering all possible solutions in the solution space. When too few or too many bits are used the performance can be adversely affected.
5.2
Creation of Initial Population
At the start of optimisation, a GA requires a group of initial solutions. There are two ways of forming this initial population. The first consists of using randomly produced solutions created by a random number generator, for example. This method
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
21
is preferred for problems about which no a priori knowledge exists or for assessing the performance of an algorithm. The second method employs a priori knowledge about the given optimisation problem. Using this knowledge, a set of requirements is obtained and solutions which satisfy those requirements are collected to form an initial population. In this case, the GA starts the optimisation with a set of approximately known solutions and therefore convergence to an optimal solution can take less time than with the previous method. 5.3
Genetic Operators
The flowchart of a simple GA is given in Figure 7. There are basically four genetic operators, selection, crossover, mutation and inversion. Some of these operators were inspired by nature. In the literature, many versions of these operators can be found. It is not necessary to employ all of these operators in a GA because each operates independently of the others. The choice or design of operators depends on the problem and the representation scheme employed. For instance, operators designed for binary strings cannot be directly used on strings coded with integers or real numbers. 5.3.1
Selection
The aim of the selection procedure is to reproduce more of individuals whose fitness values are higher than those whose fitness values are low. The selection procedure has a significant influence on driving the search towards a promising area and finding good solutions in a short time. However, the diversity of the population
Initial Population
Evaluation
Selection
Crossover
Mutation
Inversion
Figure 7. Flowchart of a basic genetic algorithm
22
CHAPTER 1
must be maintained to avoid premature convergence and to reach the global optimal solution. In GAs there are mainly two selection procedures: proportional selection, also called stochastic selection, and ranking-based selection [Whitely, 1989]. Proportional selection is usually called “Roulette Wheel” selection, since its mechanism is reminiscent of the operation of a Roulette Wheel. Fitness values of individuals represent the widths of slots on the wheel. After a random spinning of the wheel to select an individual for the next generation, slots with large widths representing individuals with high fitness values will have a higher chance to be selected. One way to prevent premature convergence is to control the range of trials allocated to any single individual, so that no individual produces too many offspring. The ranking system is one such alternative selection algorithm. In this algorithm, each individual generates an expected number of offspring which is based on the rank of its performance and not on the magnitude [Baker, 1985]. 5.3.2
Crossover
This operation is considered the one that makes the GA different from other algorithms, such as dynamic programming. It is used to create two new individuals (children) from two existing individuals (parents) picked from the current population by the selection operation. There are several ways of doing this. Some common crossover operations are one-point crossover, two-point crossover, cycle crossover and uniform crossover. One-point crossover is the simplest crossover operation. Two individuals are randomly selected as parents from the pool of individuals formed by the selection procedure and cut at a randomly selected point. The tails, which are the parts after the cutting point, are swapped and two new individuals (children) are produced. Note that this operation does not change the values of bits. An example of one-point crossover is shown in Figure 8. 5.3.3
Mutation
In this procedure, all individuals in the population are checked bit by bit and the bit values are randomly reversed according to a specified rate. Unlike crossover, this is Parent 1
100|010011110
Parent 2
001|011000110
New string 1
100|011000110
New string 2
001|010011110 Figure 8. Crossover
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
Old string
1100|0|1011101
New string
1100|1|1011101
23
Figure 9. Mutation
a monadic operation. That is, a child string is produced from a single parent string. The mutation operator forces the algorithm to search new areas. Eventually, it helps the GA to avoid premature convergence and find the global optimal solution. An example is given in Figure 9.
5.3.4
Inversion
This operator is employed for a group of problems, such as the cell placement problem, layout problem and travelling salesman problem. It also operates on one individual at a time. Two points are randomly selected from an individual and the part of the string between those two points is reversed (see Figure 10).
5.4
Control Parameters
Important control parameters of a simple GA include the population size (number of individuals in the population), crossover rate, mutation rate and inversion rate. Several researchers have studied the effect of these parameters on the performance of a GA [Schaffer et al., 1989; Grefenstette, 1986; Fogarty, 1989; Mahfoud, 1995; Smith and Fogarty, 1997]. The main conclusions are as follows. A large population size means the simultaneous handling of many solutions and increases the computation time per iteration; however since many samples from the search space are used, the probability of convergence to a global optimal solution is higher than with a small population size. The crossover rate determines the frequency of the crossover operation. It is useful at the start of optimisation to discover promising regions in the search space. A low crossover frequency decreases the speed of convergence to such areas. If the frequency is too high, it can lead to saturation around one solution. The mutation operation is controlled by the mutation rate. A high mutation rate introduces high diversity in the population and might cause instability. On the other hand, it is usually very difficult for a GA to find a global optimal solution with too low a mutation rate. Old string
10|1100|11101
New string
10|0011|11101
Figure 10. Inversion of a binary string segment
24 5.5
CHAPTER 1
Fitness Evaluation Function
The fitness evaluation unit in a GA acts as an interface between the GA and the optimisation problem. The GA assesses solutions for their quality according to the information produced by this unit and not by directly using information about their structure. In engineering design problems, functional requirements are specified to the designer who has to produce a structure which performs the desired functions within predetermined constraints. The quality of a proposed solution is usually calculated depending on how well the solution performs the desired functions and satisfies the given constraints. In the case of a GA, this calculation must be automatic and the problem is how to devise a procedure which computes the quality of solutions. Fitness evaluation functions might be complex or simple depending on the optimisation problem at hand. Where a mathematical equation cannot be formulated for this task, a rule-based procedure can be constructed for use as a fitness function or in some cases both can be combined. Where some constraints are very important and cannot be violated, the structures or solutions which do so can be eliminated in advance by appropriately designing the representation scheme. Alternatively, they can be given low probabilities by using special penalty functions. For further information on genetic algorithms, see [Holland, 1975; Goldberg, 1989; Davis, 1991; Mitchell, 1996; Pham and Karaboga, 2000; Freitas, 2002]. Genetic algorithms have found applications in engineering problems involving complex combinatorial or multi-parameter optimisation. Some recent examples of those applications are: • configuring transmission systems [Pham and Yang, 1993]; • designing the knowledge base of fuzzy logic controllers [Pham and Karaboga, 1994]; • generating hardware description language programs for high-level specification of the function of programmable logic devices [Seals and Whapshott, 1994]; • planning collision-free paths for mobile and redundant robots [Ashiru et al., 1995; Wilde and Shellwat, 1997; Nearchou and Aspragathos, 1997]; • scheduling the operations of a job shop [Cho et al., 1996; Drake and Choudhry, 1997; Lee et al., 1997; Chryssolouris and Subramaniam, 2001; Pérez et al., 2003]; • generating dynamic schedules for the operation and control of a flexible manufacturing cell [Jawahar et al., 1998]; • optimising the performance of an industrially designed inventory control system [Disney, 2000]; • forming manufacturing cells and determining machine layout information for cellular manufacturing [Wu et al., 2002]; • optimising assembly process plans to improve productivity [Li et al., 2003]; • improving the convergence speed and reducing the computational complexity of neural networks [Öztürk and Öztürk, 2004].
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
6.
25
SOME APPLICATIONS IN ENGINEERING AND MANUFACTURE
This section briefly reviews five engineering applications of the aforementioned soft computing tools. 6.1
Expert Statistical Process Control
Statistical process control (SPC) is a technique for improving the quality of processes and products through closely monitoring data collected from those processes and products and using statistically-based tools such as control charts. XPC is an expert system for facilitating and enhancing the implementation of statistical process control [Pham and Oztemel, 1996]. A commercially available shell was employed to build XPC. The shell allows a hybrid rule-based and pseudo objectoriented method of representing the standard SPC knowledge and process-specific diagnostic knowledge embedded in XPC. The amount of knowledge involved is extensive, which justifies the adoption of a knowledge-based systems approach. XPC comprises four main modules. The construction module is used to set up a control chart. The capability analysis module is for calculating process capability indices. The on-line interpretation and diagnosis module assesses whether the process is in control and determines the causes for possible out-of-control situations. It also provides advice on how to remedy such situations. The modification module updates the parameters of a control chart to maintain true control over a time-varying process. XPC has been applied to the control of temperature in an injection moulding machine producing rubber seals. It has recently been enhanced by integrating a neural network module with the expert system modules to detect abnormal patterns in the control chart (see Figure 11). 6.2
Fuzzy Modelling of a Vibratory Sensor for Part Location
Figure 12 shows a six-degree-of-freedom vibratory sensor for determining the coordinates of the centre of mass xG yG and orientation of bulky rigid parts. The sensor is designed to enable a robot to pick up parts accurately for machine feeding or assembly tasks. The sensor consists of a rigid platform (P) mounted on a flexible column (C). The platform supports one object (O) to be located at a time. O is held firmly with respect to P. The static deflections of C under the weight of O and the natural frequencies of vibration of the dynamic system comprising O, P and C are measured and processed using a mathematical model of the system to determine xG , yG and for O. In practice, the frequency measurements have low repeatability, which leads to inconsistent location information. The problem worsens when is in the region 80 -90 relative to a reference axis of the sensor because the mathematical model becomes ill-conditioned. In this “ill-conditioning” region, an alternative to using a mathematical model to compute is to adopt an experimentally derived fuzzy model. Such a fuzzy model has to be obtained for
26
CHAPTER 1
Range Chart UCL : 9
15 Mean : 4.5
CL : 4
30
45
Mean Chart LCL : 0.00
60
75
98 PCI: 1.7
St. Dev : 1.5
State of the process: in-control
UCL : 93
15
CL : 78
30
Mean : 72.5
45
60
St. Dev : 4.4
LCL : 63
75
98 PSD : 4.0
State of the process: in-control
Warning !!!!!! Process going out of control!
press any key to continue
the pattern is normal the pattern is inc. trend the pattern is dec. trend the pattern is up. shift the pattern is down. shift the pattern is cyclic
(%) (%) (%) (%) (%) (%)
: 0.00 : 0.00 : 100.00 : 0.00 : 0.00 : 0.00 press 999 to exit
Figure 11. XPC output screen
each specific object through calibration. A possible calibration procedure involves placing the object at different positions xG yG and orientations and recording the periods of vibration T of the sensor. Following calibration, fuzzy rules relating xG , yG and T to could be constructed to form a fuzzy model of the behaviour of the sensor for the given object. A simpler fuzzy model is achieved by observing that xG
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
Platform P
27
Object O Z
Orientation y
z yG
Y
Column C
End of robot arm Figure 12. Schematic diagram of a vibratory sensor mounted on a robot wrist
and yG only affect the reference level of T and, if xG and yG are employed to define that level, the trend in the relationship between T and is the same regardless of the position of the object. Thus, a simplified fuzzy model of the sensor consists of rules such as “IF T-Tref is small THEN -ref is small” where Tref is the value of T when the object is at position xG yG and orientation ref . ref could be chosen as 80 , the point at which the fuzzy model is to replace the mathematical model. Tref could be either measured experimentally or computed from the mathematical model. To counteract the effects of the poor repeatability of period measurements which are particularly noticeable in the “ill-conditioning” region, the fuzzy rules are modified so that they take into account the variance in T. An example of a modified fuzzy rule is: “IF T-Tref is small and T is small, THEN − ref is small” In the above rule, T denotes the standard deviation in the measurement of T. Fuzzy modelling of the vibratory sensor is detailed in Pham and Hafeez (1992). Using a fuzzy model, the orientation can be determined to ±2 accuracy in the region 80 -90 . The adoption of fuzzy logic in this application has produced a compact and transparent model from a large amount of noisy experimental data. 6.3
Induction of Feature Recognition Rules in a Geometric Reasoning System for Analysing 3D Assembly Models
Pham et al. (1999) have described a concurrent engineering approach involving generating assembly strategies for a product directly from its 3D CAD model.
28
CHAPTER 1
A feature-based CAD system is used to create assembly models of products. A geometric reasoning module extracts assembly-oriented data for a product from the CAD system after creating a virtual assembly tree that identifies the components and sub-assemblies making up the given product (Figure 13a). The assembly information extracted by the module includes: placement constraints and dimensions used to specify the relevant position of a given component or sub-assembly; geometric entities (edges, surfaces, etc) used to constrain the component or subassembly; and the parents and children of each entity employed as a placement constraint. An example of the information extracted is shown in Figure 13b. Feature recognition is applied to the extracted information to identify each feature used to constrain a component or sub-assembly. The rule-based feature recognition process has three possible outcomes: 1. The feature is recognised as belonging to a unique class. 2. The feature shares attributes with more than one class (see Figure 13c). 3. The feature does not belong to any known class. Cases 2 and 3 require the user to decide the correct class of the feature and the rule base to be updated. The updating is implemented via a rule induction program. The program employs RULES-3 Plus which automatically extracts new feature recognition rules from examples provided to it in the form of characteristic vectors representing different features and their respective class labels. Rule induction is very suitable for this application because of the complexity of the characteristic vectors and the difficulty of defining feature classes manually.
Figure 13a. An assembly model
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
29
Bolt: • Child of Block • Placement constraints: 1: alignment of two axes 2: mating ofthe bottom surface of the bolt head and the upper surface ofthe block • No child part in the assembly hierarchy
Block: • No parents • No constraints (root component) • Next part in the assembly: Bolt Figure 13b. An example of assembly information
Partial Round Nonthrough Slot (BSL_2)
New Form Feature
Detected Similar Feature Classes
Rectangular Nonthrough Slot (BSL_1)
Figure 13c. An example of feature recognition
6.4
Neural-network-based Automotive Product Inspection
Figure 14 depicts an intelligent inspection system for engine valve stem seals [Pham and Oztemel, 1996]. The system comprises four CCD cameras connected to a computer that implements neural-network-based algorithms for detecting and classifying defects in the seal lips. Faults on the lip aperture are classified by a multilayer perceptron. The inputs to the network are a 20-component vector, where
30
CHAPTER 1
Ethernet link
Vision system
Host PC 4 CCD cameras 512 x 512 resolution Databus
Lighting ring
Good Chute
Seal
Material handling and lighting controller
Bowl Feeder Reject
Rework Indexing machine
Figure 14. Valve stem seal inspection system
the value of each component is the number of times a particular geometric feature is found on the aperture being inspected. The outputs of the network indicate the type of defect on the seal lip aperture. A similar neural network is used to classify defects on the seal lip surface. The accuracy of defect classification in both perimeter and surface inspection is in excess of 80%. Note that this figure is not the same as that for the accuracy in detecting defective seals, that is differentiating between good and defective seals. The latter task is also implemented using a neural network which achieves an accuracy of almost 100%. Neural networks are necessary for this application because of the difficulty of describing precisely the various types of defects and the differences between good and defective seals. The neural networks are able to learn the classification task automatically from examples. 6.5
GA-based Conceptual Design
TRADES is a system using GA techniques to produce conceptual designs of transmission units [Pham and Yang, 1993]. The system has a set of basic building blocks, such as gear pairs, belt drives and mechanical linkages, and generates conceptual designs to satisfy given specifications by assembling the building blocks into different configurations. The crossover, mutation and inversion operators of the GA are employed to create new configurations from an existing population of configurations. Configurations are evaluated for their compliance with the design specifications. Potential solutions should provide the required speed reduction ratio
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
31
and motion transformation while not containing incompatible building blocks or exceeding specified limits on the number of building blocks to be adopted. A fitness function codifies the degree of compliance of each configuration. The maximum fitness value is assigned to configurations that satisfy all functional requirements without violating any constraints. As in a standard GA, information concerning the fitness of solutions is employed to select solutions for reproduction thus guiding the process towards increasingly fitter designs as the population evolves. In addition to the usual GA operators, TRADES incorporates new operators to avert premature convergence to non-optimal solutions and facilitate the generation of a variety of design concepts. Essentially, these operators reduce the chances of any one configuration or family of configurations dominating the solution population by avoiding crowding around very fit configurations and preventing multiple copies of a configuration particularly after it has been identified as a potential solution. TRADES is able to produce design concepts from building blocks without requiring much additional a priori knowledge. The manipulation of the building blocks to generate new concepts is carried out by the GA in a stochastic but guided manner. This enables good conceptual designs to be found without the need to search the design space exhaustively. Due to the very large size of the design space and the quasi random operation of the GA, novel solutions not immediately evident to a human designer are sometimes generated by TRADES. On the other hand, impractical configurations could also arise. TRADES incorporates a number of heuristics to filter out such design proposals. 7.
CONCLUSION
Over the past fifty years, the field of soft computing has produced a number of powerful tools. This chapter has reviewed five of those tools, namely, knowledgebased systems, fuzzy logic, inductive learning, neural networks and genetic algorithms. Applications of the tools in engineering and manufacture have become more widespread due to the power and affordability of present-day computers. It is anticipated that many new applications will emerge and that, for demanding tasks, greater use will be made of hybrid tools combining the strengths of two or more of the tools reviewed here [Michalski and Tecuci, 1994; Medsker, 1995]. Other technological developments in soft computing that will have an impact in engineering include data mining, or the extraction of information and knowledge from large databases [Limb and Meggs, 1994; Witten and Frank, 2000, Braha, 2001; Han ˙ and Kamber, 2001; Pham and Afify, 2002; Klösgen and Zytkow, 2002; Giudici, 2003], and multi-agent systems, or distributed self-organising systems employing entities that function autonomously in an unpredictable environment concurrently with other entities and processes [Wooldridge and Jennings, 1994; Rzevski, 1995; Márkus et al., 1996; Tharumarajah et al., 1996; Bento and Feijó, 1997; Monostori, 2002]. The appropriate deployment of these new soft computing tools and of the tools presented in this chapter will contribute to the creation of more competitive engineering systems.
32 8.
CHAPTER 1
ACKNOWLEDGEMENTS
This work was carried out within the ALFA project “Novel Intelligent Automation and Control Systems II” (NIACS II), the ERDF (Objective One) projects “Innovation in Manufacturing Centre (IMC)”, “Innovative Technologies for Effective Enterprises” (ITEE) and “Supporting Innovative Product Engineering and Responsive Manufacturing” (SUPERMAN) and within the project “Innovative Production Machines and Systems” (I∗ PROMS). REFERENCES Akyol D E, (2004), “Application of neural networks to heuristic scheduling algorithms”, Computers Ind. Engng, 46, 679–696. Ashiru I, Czanecki C and Routen T, (1995), “Intelligent operators and optimal genetic-based path planning for mobile robots”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 1018–1023. Badiru A B and Cheung J Y, (2002), Fuzzy Engineering Expert Systems with Neural Network Applications, John Wiley & Sons, New York. Baker J E, (1985), “Adaptive selection methods for genetic algorithms”, Proc. 1st Int. Conf. on Genetic Algorithms and Their Applications, Pittsburgh, PA, 101–111. Baldwin J F and Karale S B, (2003), “New concepts for fuzzy partitioning, defuzzification and derivation of probabilistic fuzzy decision trees”, Proc. 22nd Int. Conf. of the North American Fuzzy Information Processing Society (NAFIPS-03), Chicago, Illinois, USA, 484–487. Baldwin J F and Martin T P, (2001), “Towards inductive support logic programming”, Proc. Joint 9th IFSA World Congress and 20th NAFIPS Int. Conf., Vancouver, Canada, 4, 1875–1880. Bas K and Erkmen A M, (1995), “Fuzzy preshape and reshape control of Anthrobot-III 5-fingered robot hand”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 673–677. Bento J and Feijó B, (1997), “An agent-based paradigm for building intelligent CAD systems”, Artificial Intelligence in Engineering, 11 (3), 231–244. Beskese A, Kahraman C and Irani Z, (2004), “Quantification of flexibility in advanced manufacturing systems using fuzzy concepts”, Int. J. Production Economics, 89 (1), 45–56. Bigand A, Goureau P and Kalemkarian J, (1994), “Fuzzy control of a welding process”, Proc. IMACS Int. Symp. on Signal Processing, Robotics and Neural Networks (SPRANN 94), Villeneuve d’Ascq, France, 379–342. Blickle T and Thiele L, (1995), “A comparison of selection schemes used in genetic algorithms”, Computer engineering and Communication Networks Lab (TIK)-Report, No. 11, Version 1.1b, Swiss Federation Institute of Technology (ETH), Zurich, Switzerland. Bose A, Gini M and Riley D, (1997), “A case-based approach to planar linkage design”, Artificial Intelligence in engineering, 11 (2), 107–119. Bozda˘g C E, Kahraman C and Ruan D, (2003), “Fuzzy group decision making for selection among computer integrated manufacturing systems”, Computers in Industry, 15 (1), 13–29. Braha D, (2001), Data Mining for Design and Manufacturing: Methods and Applications. Kluwer Academic Publishers, Boston. Breiman L, Friedman J H, Olshen R A and Stone C J, (1984), Classification and Regression Trees, Belmont, Wadsworth. Cervone G, Panait L A and Michalski R S, (2001), “The development of the AQ20 learning system and initial experiments”, Proc. 10th Inter. Symposium on Intelligent Information Systems, Poland. Chen J C and Black J T, (1997), “A fuzzy-nets in-process (FNIP) system for tool-breakage monitoring in end-milling operations”, Int. J Machine Tools Manufacturing, 37 (6), 783–800. Cho B J, Hong S C and Okoma S, (1996), “Job shop scheduling using genetic algorithm”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, 351–358.
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
33
Chryssolouris G and Subramaniam V, (2001), “Dynamic scheduling of manufacturing job shops using genetic algorithms”, J. Intelligent Manufacturing, 12, 281–293. Costa Branco P J and Dente J A, (1998), “An experiment in automatic modelling an electrical drive system using fuzzy logic”, IEEE Trans on Systems, Man, and Cybernetics, 28 (2), 254–262. Da Rocha Fernandes A M and Cid Bastos R, (1996), “Fuzzy expert systems for qualitative analysis of minerals”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 673–680. Darlington K W, (1999), The Essence of Expert Systems, Prentice Hall. Davis L, (1991), Handbook of Genetic Algorithms, Van Nostrand, New York, NY. De La Sen M, Miñambres J J, Garrido A J, Almansa A and Soto J C, (2004), “Basic theoretical results for expert systems: Application to the supervision of adaptation transients in planar robots”, Artificial Intelligence, 152 (2), 173–211. Disney S M, Naim M M and Towill D R, (2000), “Genetic algorithm optimisation of a class of inventory control systems”, Inter. J. Production Economics, 68, 259–278. Drake P R and Choudhry I A, (1997), “From apes to schedules”, Manufacturing Engineer, 76 (1), 43–45. Dubois D and Prade H, (1998), “An introduction to fuzzy systems”, Clinica Chimica Acta, 270 (1), 3–29. Duch W, Setiono R and Zurada J M, (2004), “Computational intelligence methods for rule-based data understanding”, Proc. IEEE, 92 (5), 771–805. Durkin J, (1994), Expert Systems Design and Development, Macmillan, New York. Evans B and Fisher D, (2002), “Using decision tree induction to minimize process delays in printing industry”, In: Handbook of Data Mining and Knowledge Discovery (W. Klösgen and J.M. Zytkow (Eds.)), Oxford University Press. Kong F, Yu J and Zhou X, (1999), “Analysis of fuzzy dynamic characteristics of machine cutting process: Fuzzy stability analysis in regenerative-type-chatter”, Int. J. Machine Tools and Manufacture, 39 (8), 1299–1309. Ferreiro Garcia R, (1994), “FAM rule as basis for poles shifting applied to the roll control of an aircraft”, SPRANN 94 (ibid), 375–378. Fogarty T C, (1989), “Varying the probability of mutation in the genetic algorithm”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 104–109. Freitas A A, (2002), Data mining and knowledge discovery with evolutionary algorithms, SpringerVerlag, Berlin, New York. Giarratano J C and Riley G D, (1998), Expert Systems: Principles and Programming, 3rd Edition, PWS Publishing Company, Boston, MA. Giudici P, (2003), Applied Data Mining: Statistical Methods for Business and Industry, John Wiley & Sons, England. Goldberg D E, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, Addison Wesley, Reading, MA. Grefenstette J J, (1986), “Optimization of control parameters for genetic algorithms”, IEEE Trans on Systems, Man and Cybernetics, 16 (1), 122–128. Han J and kamber M, (2001), Data Mining: Concepts and Techniques, Academic Press, USA. Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA. Holland J H, (1975), Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, MI. Hong T P and Chen J B, (2000), “Processing individual fuzzy attributes for fuzzy rule induction”, Fuzzy Sets and Systems, 112 (1), 127–140. Hui P C L, Chan K C K and Yeung K W, (1997), “Modelling job complexity in garment manufacturing by inductive learning”, Inter. J. Clothing Science and Technology, 9 (1), 34–44. Ip C Y, Regli W C, Sieger L and Shokoufandeh A, (2003), “Automated learning of model classification. Proc. 8th ACM Symposium on Solid Modeling and Applications, Seattle, Washington, USA, ACM Press, 322–327. ISL, (1998), Clementine Data Mining Package. SPSS UK Ltd., 1st Floor, St. Andrew’s House, West Street, Woking, Surrey GU21 1EB, United Kingdom. Jackson P, (1999), Introduction to Expert Systems, 3rd Edition, Addison-Wesley, Harlow, Essex.
34
CHAPTER 1
Jambunathan K, Fontama V N, Hartle S L and Ashforth-Frost S, (1997), “Using ART 2 networks to deduce flow velocities”, Artificial Intelligence in Engineering, 11 (2), 135–141. Janikow C Z, (1998), “Fuzzy decision trees: Issues and methods”, IEEE Trans on System, Man, and Cybernetic, 28 (1), 1–14. Jawahar N, Aravindan P, Ponnambalam S G and Karthikeyan A A, (1998), “A genetic algorithm-based scheduler for setup-constrained FMC”, Computers in Industry, 35, 291–310. Jiang Y, Zhou Z-H and Chen Z-Q, (2002), “Rule learning based on neural network ensemble”, Proc. Inter. Joint Conf. on Neural Networks, Honolulu, HI, 1416–1420. Kalogirou S A, (2003), “Artificial Intelligence for modelling and control of combustion processes: A review”, Progress in Energy and Combustion Science, 29 (6), 515–566. Kamrani A K, Shashikumar S and Patel S, (1995), “An intelligent knowledge-based system for robotic cell design”, Computers Ind. Engng, 29 (1–4), 141–145. Karsak E E, (2004), “Fuzzy multiple objective programming framework to prioritize design requirements in quality function deployment”, Computers Ind. Engng, (Submitted and accepted). Karsak E E and Kuzgunkaya O, (2002), “A fuzzy multiple objective programming approach for the selection of a flexible manufacturing system”, Int. J. Production Economics, 79 (2), 101–111. Kaufmann A, (1975), Introduction to the Theory of Fuzzy Subsets, Vol.1, Academic Press, New York. Kim C-O, Min Y-D and Yih Y, (1998), “Integration of inductive learning and neural networks for multi-objective FMS scheduling”, Inter. J. Production Research, 36 (9), 2497–2509. Klir G J and Yuan B, (1995), Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall, Upper Saddle River, NJ. Klir G J and Yuan B, (Eds.), (1996), Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems – selected papers by L A Zadeh, World Scientific, Singapore. ˙ Klösgen W and Zytkow J M, (2002), Handbook of Data Mining and Knowledge Discovery, Oxford University Press, New York. Koo D Y and Han S H, (1996), “Application of the configuration design methods to a design expert system for paper feeding mechanism”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 49–56. Kostov A, Andrews B, Stein R B, Popovic D and Armstrong W W, (1995), “Machine learning in control of functional electrical stimulation systems for locomotion”, IEEE Trans on Biomedical Engineering, 44 (6), 541–551. Kulak O and Kahraman C, (2004), “Multi-attribute comparison of advanced manufacturing systems using fuzzy vs. crisp axiomatic design approach”, Int. J. Production Economics, (Submitted and accepted). Lara Rosano F, Kemper Valverde N, De La Paz Alva C and Alcántara Zavala J, (1996), “Tutorial expert system for the design of energy cogeneration plants”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 300–305. Lavrac N and Dzeroski S, (1994), Inductive Logic Programming: Techniques and Applications, Ellis Horwood, New York. Lee C-Y, Piramuthu S and Tsai Y-K, (1997),“Job shop scheduling with a genetic algorithm and machine learning”, Inter. J. Production Research, 35 (4), 1171–1191. Li J R, Khoo L P and Tor S B, (2003), “A Tabu-enhanced genetic algorithm approach for assembly process planning”, J. Intelligent Manufacturing, 14, 197–208. Limb P R and Meggs G J, (1994), “Data mining tools and techniques”, British Telecom Technology Journal, 12 (4), 32–41. Lin Z-C and Chang D-Y, (1996), “Application of a neural network machine learning model in the selection system of sheet metal bending tooling”, Artificial Intelligence in Engineering, 10, 21–37. Lou H H and Huang Y L, (2003), “Hierarchical decision making for proactive quality control: System development for defect reduction in automotive coating operations”, Engineering Applications of Artificial Intelligence, 16, 237–250. Luzeaux D, (1994), “Process control and machine learning: rule-based incremental control”, IEEE Trans on Automatic Control, 39 (6), 1166–1171.
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
35
Mahfoud S W, (1995), Niching Methods for Genetic Algorithms, Ph.D. Thesis, Department of General Engineering, University of Illinois at Urbana-Champaign. Majors M D and Richards R J, (1995), “A topologically-evolving neural network for robotic flexible assembly control”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, August, 894–899. Markham I S, Mathieu RG and Wray B A, (2000), “Kanban setting through artificial intelligence: A comparative study of artificial neural networks and decision trees”, Integrated Manufacturing Systems, 11 (4), 239–246. Márkus A, Kis T, Váncza J and Monostori L, (1996), “A market approach to holonic manufacturing”, CIRP Annals, 45 (1), 433–436. Mathieu R G, Wray B A and Markham I S, (2002), “An approach to learning from both good and poor factory performance in a kanban-based just-in-time production system”, Production Planning & Control, 13 (8), 715–724. Medsker L R, (1995), Hybrid Intelligent Systems, Kluwer Academic Publishers, Boston, 298 pp. Michalewicz Z, (1996), Genetic Algorithms + Data Structures = Evolution Programs, 3rd Edition, Springer-Verlag, Berlin. Michalski R S, (1990), “A theory and methodology of inductive learning”, in Readings in Machine Learning, Eds. Shavlik J W and Dietterich T G, Kaufmann, San Mateo, CA, 70–95. Michalski R S and Kaufman KA, (2001), “The AQ19 system for machine learning and pattern discovery: A general description and user guide”, Reports of the Machine Learning and Inference Laboratory, MLI 01-2, George Mason University, Fairfax, VA, USA. Michalski R S and Larson J B, (1983), “Incremental generation of VL1 hypotheses: The underlying methodology and the descriptions of program AQ11”, ISG 83–5, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois. Michalski R S, Mozetic I, Hong J and Lavrac N, (1986), “The multi-purpose incremental learning system AQ15 and its testing application to three medical domains”, American Association of Artificial Intelligence, Los Altos, CA, Morgan Kaufmann, 1041–1045. Michalski R and Tecuci G, (1994), Machine Learning: A Multistrategy Approach, 4, Morgan Kaufmann Publishers, San Francisco, CA, USA. Michie D, Spiegelhalter D J and Taylor C C, (1994), Machine Learning, Neural and Statistical Classification, Ellis Horwood, New York. Mitchell M, (1996), An Introduction to Genetic Algorithms, MIT Press. Monostori L, (2002), “AI and machine learning techniques for managing complexity, changes and uncertainties in manufacturing”, Proc. 15th Triennial World Congress, Barcelona, Spain, 119–130. Muggleton S (ed), (1992), Inductive Logic Programming, Academic Press, London, 565 pp. Muggleton S and Feng C, (1990), “Efficient induction of logic programs”, Proc. 1st Conf. on Algorithmic Learning Theory, Tokyo, Japan, 368–381. Nearchou A C and Aspragathos N A, (1997), “A genetic path planning algorithm for redundant articulated robots”, Robotica, 15 (2), 213–224. Nurminen J K, Karonen O and Hatonen K, (2003), “What makes expert systems survive over 10-years – empirical evaluation of several engineering applications”, Expert Systems with Applications, 24 (2), 199–211. Ong S K, De Vin L J, Nee A Y C and Kals H J J, (1997), “Fuzzy set theory applied to bend sequencing for sheet metal bending”, Int. J. Materials Processing Technology, 69, 29–36. Öztürk N and Öztürk F, (2004), “Hybrid neural network and genetic algorithm based machining feature recognition”, J. Intelligent Manufacturing, 15, 278–298. Park M-W, Rho H-M and Park B-T, (1996), “Generation of modified cutting conditions using neural networks for an operation planning system”, Annals of the CIRP, 45 (1), 475–478. Peers S M C, Tang M X and Dharmavasan S, (1994), “A knowledge-based scheduling system for offshore structure inspection”, Artificial Intelligence in Engineering IX (AIEng 9), Eds. Rzevski G, Adey R A and Russell D W, Computational Mechanics, Southampton, 181–188. Peng Y, (2004), “Intelligent condition monitoring using fuzzy inductive learning”, J. Intelligent Manufacturing, 15 (3), 373–380.
36
CHAPTER 1
Pérez E, Herrera F and Hernández C, (2003), “Finding multiple solutions in job shop scheduling by niching genetic algorithms”, J. Intelligent Manufacturing, 14, 323–339. Pham D T and Afify A A, (2002), “Machine learning: Techniques and trends”, Proc. 9th Inter. Workshop on Systems, Signals and Image Processing (IWSSIP – 02), Manchester Town Hall, UK, World Scientific, 12–36. Pham D T and Afify A A, (2005a), “RULES-6: A simple rule induction algorithm for handling large data sets”, Proc. of the Institution of Mechanical Engineers, Part (C), 219 (10), 1119–1137 . Pham D T and Afify A A, (2005b), “Machine learning techniques and their applications in manufacturing”, Proc. of the Institution of Mechanical Engineers, Part B, 219 (5), 395–412. Pham D T, Afify A A and Dimov S S, (2002), “Machine learning in manufacturing”, Proc. 3rd CIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering – (ICME 2002), Ischia, Italy, III–XII. Pham D T and Aksoy M S, (1994), “An algorithm for automatic rule induction”, Artificial Intelligence in Engineering, 8, 277–282. Pham D T and Aksoy M S, (1995a), “RULES : A rule extraction system”, Expert Systems with Applications, 8, 59–65. Pham D T and Aksoy M S, (1995b), “A new algorithm for inductive learning”, Journal of Systems Engineering, 5, 115–122. Pham D T, Bigot S and Dimov S S, (2003), “RULES-5: A rule induction algorithm for problems involving continuous attributes”, Proc. of the Institution of Mechanical Engineers, 217 (Part C), 1273–1286. Pham D T and Dimov S S (1997), “An efficient algorithm for automatic knowledge acquisition”, Pattern Recognition, 30(7), 1137–1143. Pham D T, Dimov S S and Salem Z, (2000), “Technique for selecting examples in inductive learning”, ESIT 2000 European Symposium on Intelligent Techniques, Erudit Aachen Germany, 119–127. Pham D T, Dimov S S and Setchi RM (1999), “Concurrent engineering: a tool for collaborative working”, Human Systems Management, 18, 213–224. Pham D T and Hafeez K, (1992), “Fuzzy qualitative model of a robot sensor for locating threedimensional objects”, Robotica, 10, 555–562. Pham D T and Karaboga D, (1994), “Some variable mutation rate strategies for genetic algorithms”, SPRANN 94 (ibid), 73–96. Pham D T and Karaboga D, (2000), Intelligent Optimisation Techniques: Genetic Algorithms, Tabu Search, Simulated Annealing and Neural Networks, Springer-Verlag, London, Berlin and Heidelberg, 2nd printing, 302 pp. Pham D T and Liu X, (1999), Neural Networks for Identification, Prediction and Control, Springer Verlag, London, Berlin and Heidelberg, 4th printing, 238 pp. Pham D T and Oztemel E, (1996), Intelligent Quality Systems, Springer Verlag, London, Berlin and Heidelberg, 201 pp. Pham D T, Packianather M S, Dimov S, Soroka A J, Girard T, Bigot S. and Salem Z, (2004), “An application of data mining and machine learning techniques in the metal industry”, Proc. 4th CIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering (ICME-04), Sorrento (Naples), Italy. Pham D T and Pham P T N, (1988), “Expert systems in mechanical and manufacturing engineering”, Int. J. Adv. Manufacturing Technology, Special Issue on Knowledge Based Systems, 3(3), 3–21. Pham D T and Yang Y, (1993), “A genetic algorithm based preliminary design system”, Proc. IMechE, Part D: J. Automobile Engineering, 207, 127–133. Price C J, (1990), Knowledge Engineering Toolkits, Ellis Horwood, Chichester. Priore P, Fuente D, Pino R and Puente J, (2003), “Dynamic scheduling of manufacturing systems using neural networks and inductive learning”, Integrated Manufacturing Systems, 14 (2), 160–168. Quinlan J R, (1983), “Learning efficient classification procedures and their application to chess endgames”, In: Machine Learning: An Artificial Intelligence Approach (Michalski R S, Carbonell J G and Mitchell T M (Eds.)), I, Tiogo Publishing Co., 463–482. Quinlan J R, (1986), “Induction of decision trees”, Machine Learning, 1, 81–106.
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE
37
Quinlan J R, (1990), “Learning logical definitions from relations”, Machine Learning, 5, 239–266. Quinlan J R, (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA. Quinlan J R and Cameron-Jones R M, (1995), “Induction of logic programs: FOIL and related systems”, New Generation Computing, 13, 287–312. Ross T J, (1995), Fuzzy Logic with Engineering Applications, McGraw-Hill, New York. RuleQuest, (2001), Data Mining Tools C5.0, Pty Ltd, 30 Athena Avenue, St Ives NSW 2075, Australia. Available from: http://www.rulequest.com/see5-info.html. Rzevski G, (1995), “Artificial intelligence in engineering : past, present and future”, Artificial Intelligence in Engineering X, Eds Rzevski G, Adey R A and Tasso C, Computational Mechatronics, Southampton, 3–16. Schaffer J D, Caruana R A, Eshelman L J and Das R, (1989), “A study of control parameters affecting on-line performance of genetic algorithms for function optimisation”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 51–61. Schultz G, Fichtner D, Nestler A and Hoffmann J, (1997), “An intelligent tool for determination of cutting values based on neural networks”, Proc. 2nd World Congress on Intelligent Manufacturing Processes and Systems, Budapest, Hungary, 66–71. Seals R C and Whapshott G F, (1994), “Design of HDL programmes for digital systems using genetic algorithms”, AI Eng 9 (ibid), 331–338. Shi Z Z, Zhou H and Wang J, (1997), “Applying case-based reasoning to engine oil design”, Artificial Intelligence in Engineering, 11 (2), 167–172. Shigaki I and Narazaki H, (1999), “A machine-learning approach for a sintering process using a neural network”, Production Planning & Control, 10 (8), 727–734. Shin C K and Park S C, (2000), “A machine learning approach to yield management in semiconductor manufacturing”, Inter. J. Production Research, 38 (17), 4261–4271. Skibniewski M, Arciszewski T and Lueprasert K, (1997), “Constructability analysis : machine learning approach”, ASCE J of Computing in Civil Engineering, 12 (1), 8–16. Smith J E and Fogarty T C, (1997), “Operator and parameter adaptation in genetic algorithms”, Soft Computing, 1 (2), 81–87. Smith P, MacIntyre J and Husein S, (1996), “The application of neural networks in the power industry”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 321–326. Sohen S Y and Choi I S, (2001), “Fuzzy QFD for supply chain management with reliability consideration”, Reliability Eng. and Systems Safety, 72, 327–334. Streichfuss M and Burgwinkel P, (1995), “An expert-system-based machine monitoring and maintenance management system”, Control Eng. Practice, 3 (7), 1023–1027. Szwarc D, Rajamani D and Bector C R, (1997), “Cell formation considering fuzzy demand and machine capacity”, Int. J. Advanced Manufacturing Technology, 13 (2), 134–147. Tarng Y S, Tseng C M and Chung L K, (1997), “A fuzzy pulse discriminating systems for electrical discharge machining”, Int. J. Machine Tools and Manufacture, 37 (4), 511–522. Teti R and Caprino G, (1994), “Prediction of composite laminate residual strength based on a neural network approach”, AI Eng 9 (ibid), 81–88. Tharumarajah A, Wells A J and Nemes L, (1996), “Comparison of the bionic, fractal and holonic manufacturing system concepts”, Int. J. Computer Integrated Manfacturing, 9 (3), 217–226. Vanegas L V and Labib A W, (2001), “A fuzzy quality function deployment (FQFD) model for deriving optimum targets”, Int. J. Production Research, 39 (1), 99–120. Venkatachalam A R, (1994), “Automating manufacturability evaluation in CAD systems through expert systems approaches”, Expert Systems with Applications, 7 (4), 495–506. Wang L X and Mendel M, (1992), “Generating fuzzy rules by learning from examples”, IEEE Trans on Systems, Man and Cybernetics, 22 (6), 1414–1427. Wang W P, Peng Y H and Li X Y, (2002), “Fuzzy-grey prediction of cutting force uncertainty in turning”, J Materials Processing Technology, 129, 663–666. Wang C-H, Tsai C-J, Hong T-P and Tseng S-S, (2003), “Fuzzy Inductive Learning Strategies”, Applied Intelligence, 18 (2), 179–193.
38
CHAPTER 1
Wang X Z, Wang Y D, Xu X F, Ling W D and Yeung D S, (2001), “A new approach to fuzzy rule generation: Fuzzy extension matrix”, Fuzzy Sets and Systems, 123 (3), 291–306. Whitely D, (1989), “The GENITOR algorithm and selection pressure: why rank-based allocation of reproductive trials is best”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications, George Mason University, 116–123. Wilde P and Shellwat H, (1997), “Implementation of a genetic algorithm for routing an autonomous robot”, Robotica, 15 (2), 207–211. Witten I H and Frank E, (2000), Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, USA. Wooldridge M J and Jennings N R, (1994), “Agent theories, architectures and languages : a survey”, Proc. ECAI 94 Workshop on Agent Theories, Architectures and Languages, Amsterdam, 1–32. Wray B A, Rakes T R and Rees L, (1997), “Neural network identification of critical factors in dynamic just-in-time kanban environment”, J. Intelligent Manufacturing, 8, 83–96. Wu X, Chu C-H, Wang Y and Yan W, (2002), “A genetic algorithm for integrated cell formation and layout decisions”, Proc. of the 2002 Congress on Evolutionary Computation (CEC-02), 2, 1866–1871. Yano H, Akashi T, Matsuoka N, Nakanishi K, Takata O and Horinouchi N, (1997), “An expert system to assist automatic remeshing in rigid plastic analysis”, Toyota Technical Review, 46 (2), 87–92. Yao X, (1999), “Evolving artificial neural networks”, Proceedings of the IEEE, 87 (9), 1423–1447. Zadeh L A, (1965), “Fuzzy Sets”, Information Control, 8, 338–353. Zha X F, Lim S Y E and Fok S C, (1998), “Integrated knowledge-based assembly sequence planning”, Int. J. Adv. Manufacturing Technology, 12 (3), 211–237. Zha X F, Lim S Y E and Fok S C, (1999), “Integrated knowledge-based approach and system for product design and assembly”, Int. J. Computer Integrated Manufacturing, 14, 50–64. Zhao Z Y and De Souza R, (1998), “On improving the performance of hard disk drive final assembly via knowledge intensive simulation”, J. Electronics Manufacturing, 1, 23–25. Zhao Z Y and De Souza R, (2001), “Fuzzy rule learning during simulation of manufacturing resources”, Fuzzy Sets and Systems, 122, 469–485. Zhou C, Nelson P C, Xiao W, Tirpak T M and Lane S A, (2001), “An intelligent data mining system for drop test analysis of electronic products”, IEEE Trans on Electronics Packaging Manufacturing, 24 (3), 222–231. Zimmermann H-J, (1996), Fuzzy Set Theory and its Applications, 3nd Edition, Kluwer Academic Publishers, Boston. Zülal G and Arikan F, (2000), “Application of fuzzy decision making in part-machine grouping”, Int. J. Production Economics, 63, 181–193.
CHAPTER 2 NEURAL NETWORKS HISTORICAL REVIEW
D. ANDINA1 , A. VEGA-CORONA2 , J. I. SEIJAS3 , J. TORRES-GARCÍA 1
Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España.
[email protected] 2 Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato (UG), Salamanca, Gto., México.
[email protected] 3 Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España.
[email protected] Abstract:
This chapter starts with a historical summary of the evolution of Neural Networks from the first models which are very limited in application capabilities to the present ones that make possible to think in applying automatic process to tasks that formerly had been reserved to the human intelligence. After the historical review, Neural Networks are dealt from a computational point of view. This perspective helps to compare Neural Systems with classical Computing Systems and leads to a formal and common presentation that will be used throughout the book
INTRODUCTION Computers used nowadays can make a great variety of tasks (whenever they are well defined) at a higher speed and with more reliability than those reached by the human beings. None of us will be, for example, able to solve complex mathematical equations at the speed that a personal computer will. Nevertheless, mental capacity of the human beings is still higher than the one of machines in a wide variety of tasks. No artificial system of image recognition is able to compete with the capacity of a human being to discern between objects of diverse forms and directions; in fact it would not even be able to compete with the capacity of an insect. In the same way, whereas a computer performs an enormous amount of computation and restrictive conditions to recognize, for example, phonemes, an adult human recognizes without no effort words pronounced by different people, at different speeds, accents and intonations, even in the presence of environmental noise. It is observed that, by means of rules learned from the experience, the human being is much more effective than the computers in the resolution of imprecise 39 D. Andina and D.T. Pham (eds.), Computational Intelligence, 39–65. © 2007 Springer.
40
CHAPTER 2
problems (ambiguous problems), or problems that require great amount of information. Our brain reaches these objectives by means of thousands of millions of simple cells, called neurons, which are interconnected to each other. However, it is estimated that the operational amplifiers and logical gates can make operations several orders of magnitude faster than the neurons. If the same processing technique of biological elements were implemented with operational amplifiers and logical gates, one could construct machines relatively cheap and able to process as much information, at least, as the one that processes a biological brain. Of course, we are too far from knowing if these machines will be constructed one day. Therefore, there are strong reasons to think about the viability to tackle certain problems by means of parallel systems that process information and learn by means of principles taken from the brain systems of living beings. Such systems are called Artificial Neural Networks, connexionist models or distributed parallel process models. Artificial Neural Networks (ANNs or, simply, NNs) come then from the man’s intention of simulating the biological brain system in an artificial way. 1.
HISTORICAL PERSPECTIVE
The science of Artificial Neural Networks did his first significant appearance during the 1940’s. Researchers who tried to emulate the functions of the human brain developed physical models (later, simulations by means of programs) of the biological neurons and their interconnections. As the neurobiologists were deepening in the knowledge of the human neural system, these first models were being considered more and more rudimentary approaches. Nevertheless, some of the results obtained in these first times were impressive, which encouraged future research and developments of sophisticated and powerful Artificial Neural Networks. 1.1
First Computational Model of Nervous Activity: The Model of McCulloch and Pitts
McCulloch and Pitts published the first systematic studies of the artificial neural networks [McCulloch and Pitts, 1943] [Pitts and McCulloch, 1947]. This study appeared in terms of a computational model of the nervous activity of the human nervous system cells. Most of their work is focused on the behavior of a simple neuron, whose mathematical or computational model is shown in Figure 1. Inside the artificial neuron, the sum of each input xi multiplied by a scale factor (or weight wi ) is made. The inputs emulate the excitations received by the biological neurons. The weights represent the force of the synaptic union: a positive weight represents an excitatory effect, and a negative weight an inhibitory effect. If the result of the sum is higher than a certain threshold value or bias (represented by the weight w0 ), the cell activates providing a positive value (normally +1); in the opposite case, the output presents a negative value (usually −1) or zero. Therefore, it is a binary output. In general,
NEURAL NETWORKS HISTORICAL REVIEW
⎡ x1 ⎤ ⎢ ⎥ x2 X =⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣ xm⎥⎦
41
w0 w1 ∑wi xi
f (Z ) Z
wM
1
O −1
O Activation function O = f (Z )
Figure 1. Artificial model [McCulloch and Pitts, 1943] of a biological neuron. As it can be observed, the relation between the input and output follows a nonlinear function called activation function. In the first model shown in this figure, the activation is a hard threshold function
the model follows the neurobiologic behavior: the nervous cells produce nonlinear answers when provided of an excitation by a certain input. In particular, McCulloch and Pitts proposed an activation function, that represents the nonlinearity of the model, called hard threshold function (see Figure 1). Although this first model can only perform very simple tasks, as it will be described below, the potentiality of the neural systems is essentially in the interconnection between neurons to form networks. This interconnection is normally arranged forming layers of nodes (artificial neurons). This kind of neural networks are called Multi-Layer Perceptron (MLP). In general, it is possible to speak about Feed Forward Neural Networks like those in which the information always is transmitted in the direction of input layer to output layer. Or Feedback Neural Networks, where the information can be transmitted in both directions; that is, connections between nodes of higher layers with nodes of lower layers are allowed. Figure 2 shows a Feed Forward Neural Network of two layers: a hidden layer (located right after the input layer) and the output layer. The input layer is usually not considered as being properly a layer of the network. Each component of the input vector x = 1x1 xM T is connected to all the nodes of the first hidden layer. The forces of these connections are determined by the weight associated with each one of them. When the same philosophy is applied to the rest of the network’s layers, it is said that full connectivity exists. Trying to proceed chronologically, we will leave the Multilayer Neural Networks (MLP) by the moment. The first artificial neural networks proposals were networks of a single layer as the one shown in Figure 2 but eliminating the output layer.
INPUT w01 1 ⎡1 ⎤ ⎢x ⎥ 2 1 X= ⎢ ⎥ wM2 ⎢ ⎥ o1m w ⎢ ⎥ m ⎣⎢xM ⎦⎥ w1m m
o2 OUTPUT
Figure 2. Two Layer Feed Forward Neural Networks
42
CHAPTER 2
This joint disposition of the first model of neuron (see Figure 1) in parallel was suggested in order to solve the limitations of a neuron acting alone. It is easy to verify that the model of McCulloch and Pitts divides the input space into two parts by means of the hyperplane described by the equation (1)
hx =
M
wj xj + wo = 0
j=1
This effect can be observed in Figure 3 that shows this hyperplane for the particular case of M = 2. A simple neuron can solve two-class classification problems of M-dimensional data, assuming that they are linearly separable. That is, it can assign an output equal to 1 to all the data of class “A” (that fall in the same side of the hyperplane), whereas it assigns a value equal to −1 to the rest of the data that fall in the opposite side. Mathematically, we can express this classification as (2)
M
CB
wj xj > − wo
j=1
< CA
where CA and CB denotes class A and class B, respectively. We have now a very simple neuron behavior model, that does not consider many of the biological characteristics that tries to emulate. For example, it does not consider the real delays existing in all inter-neural transmission –that have an important effect on the dynamic system–, or, more importantly, it does not include effects of synchronism or frequency modulation features, which is considered crucial by many researchers. x2 h(x) = 0
Class1
h(x) > 0 w0 w2
h(x) < 0
x1
Class 2 x2
w1 w2
x1
w0 w2
Figure 3. The hyperplane determined by the McCulloch and Pitts neuron model for the case of two dimensional inputs. This hyperplane depends on the neuron’s parameters (weights wj , and threshold value w0 ) according with the mathematical expression M j=1 wj xj + w0 = 0
NEURAL NETWORKS HISTORICAL REVIEW
43
In spite of their limitations, the networks designed in this way have characteristics classically restricted to the biologic systems. Perhaps researchers have been able to shape the main biological neuron operations in this model, or perhaps the similarities in some applications are mere coincidence. Only the necessary time to continue this research will solve this question.
1.2
Training of Neural Networks: Hebbian Learning
The equation of the hyperplane border that characterizes the operation of the artificial neuron depends on the synaptic weights w1 , wM and on the threshold value wo , which is normally considered as another weight of the network. The remaining problem consists in the way of choosing, determining or looking for the appropriate value of these weights that solve the problem in hand. This task is called learning or training of the network. From a historical point of view, the Hebbian Learning is the oldest and one of the most studied learning procedures. In 1961, Hebb proposed a learning model that has given rise to many of the learning systems which nowadays exist for training neural networks. Hebb proposed that the value of the synaptic union would be increased whenever the input and the output of a neuron were simultaneously activated. In this way, the network’s connections used more frequently are reinforced, emulating the biological phenomenon of the habit and the learning by means of repetition. It is said that a neural network uses Hebbian learning when it increases the value of its weights accordingly with the product of the levels of excitation of the source and destiny neurons. The Hebbian learning of the network is performed by means of successive iterations using only the information of the input and output network, it never used never the desired output or target. For this reason, this type of learning is called unsupervised learning. It distinguishes it from other models of learning that use the additional information of the desired values of the output, as a teacher, and that we will expose next.
1.3
Supervised Learning: Rosenblatt and Widrow
Although many learning methods following the Hebbian model have been developed, it seems logical to expect that the most efficient results can be achieved by those methods that use information of the network output (supervised learning. Learning is so guided in order to perform a given function. About 1960, Rosenblatt [Rosenblatt, 1962] dedicated his efforts in developing supervised learning algorithms to train a neural network that called perceptron. A perceptron is a Feed Forward neural network as that shown in Figure 2, where the nonlinearities of the neurons are of the hard type. Some of the common functions used as alternatives to the hard threshold functions will be shown later on. In this way, the Mcculloch and Pitts model can be considered as the simplest kind of hard threshold perceptron.
44
CHAPTER 2
Concretely, Rosenblatt showed that a one layer perceptron is able to learn many practical functions. He proposed a learning rule for the perceptron called the perceptron rule. Let us consider the simplest case of a one layer perceptron composed by one single neuron, that is, the model proposed by McCulloch and Pitts. If certain pairs of input and corresponding output is known, DN = x1 d1 x2 d2 xN dN , then, at a given input pattern xk of the input data set, the perceptron rule updates the network weights w = wo w1 wM T in the following way (3)
wk + 1 = wk + dk − ok xk
The parameter controls the updating magnitude values, and so the speed of the algorithm convergence. It is called the learning rate and it usually takes values in the range between 0 and 1. The DN set is called learning set and, as it includes values of the desired outputs, it is of the supervised type. If the linear separability is accomplished by the training data set, Rosenblatt showed that the algorithm always converge in a finite number of steps, independently of the value. On the contrary, if the problem is not linearly separable, it will have to be forced to stop, as always there will be at least one pattern erroneously classified. Usually, the training starts giving small random values to the perceptron weights. In each step of the algorithm, a new input xk is applied to the network, then the corresponding output is calculated, ok , and the weights are updated only if error dk − ok is not equal to zero. It is interesting to note that if the learning rate has a value close to 0, the weights will have a little variation with each new input, and the learning is slow; if the value is next to 1 there can be large differences between weight values for one iteration and the following one, reducing the influence of past iterations and the algorithm could not converge. This problem is called instability. Therefore, the gain rate should be adapted to the distribution changes on the input patterns, satisfying the conflict between training time and stable updating of weights. Also at early 1960’s, Widrow and Hoff [Widrow and Hoff, 1960] performed several demonstrations on perceptron-like systems, that called ADALINE (“ADAptive LINear Elements”), proposing a learning rule called LMS algorithm (“Least Mean Square” algorithm) or Widrow-Hoff algorithm. This rule minimizes the Sum of Square Errors (SSE, “sum-of-square errors”) between the desired output and the output given by the perceptron before the hard threshold activation function. That is, it minimizes the error function (4)
Ew =
N 1 d − zj 2 2 j=1 j
through a gradient algorithm. The linear output z can be observed in Figure 1. When the gradient to w is applied in Equation (4) and actualized in the opposite
45
NEURAL NETWORKS HISTORICAL REVIEW
direction to the gradient one, the LMS rule is obtained. (5)
wk + 1 = wk +
N
dj − zj kxj
j=1
where zj k = wT kxj . This “block” (in the sense that it uses all training patterns in each iteration) version of the LMS is usually substituted for an “estocastic approximation” (pattern by pattern) as shown in equation (6)
wk + 1 = wk + dk − zk xk
Unlike the perceptron rule, the application of LMS delivers reasonable results (the best that can be achieved through a linear discriminator in the SSE sense) when the training set is not linearly separable. During these years, researchers all around the world become enthusiastic with the application possibilities that these systems promised. 1.4
Partial eclipse of Neural Networks: Minsky and Papert
The initial euphoria aroused in the early sixties was substituted by disappointment when Minsky and Papert [Minsky and Papert, 1969] rigorously analyzed the problem and showed that there exists severe restrictions in the class of functions that a perceptron can perform. One of their results shows how a one layer perceptron with two inputs and one output is unable of performing a simple function as the or-exclusive (Xor). The inputs of this function are of the type 1 or −1 being the output −1 when the two inputs are different and 1 if they are equal. In the Figure 4 this problem is illustrated. It can be observed how a linear discriminator is unable of separating the patterns of the two classes. This limitation was well known by the end of the sixties and it was also known that the problem could be solved adding more layers to the system. As an example, let us analyze a two layer perceptron. The first layer can classify input vectors separated
x2 Class A
1
−1
X
d
⎡(1,1) ⎤ ⎢(1,–1)⎥ ⎢ ⎥ ⎢(–1,1)⎥ ⎢ ⎥ ⎣(–1,–1)⎦
⎡1 ⎤ ⎢–1⎥ ⎢ ⎥ ⎢–1⎥ ⎢ ⎥ ⎣1 ⎦
Class B
1 x1 −1
Figure 4. The or-exclusive (Xor) problem
46
CHAPTER 2
by hyperplanes. The second layer can implement the logical functions AND and OR, because both problems are linearly separable. In this way, a perceptron as the one shown in Figure 5 (a) can implement boundaries as the one shown in Figure 5 (b) and, so, solve the Xor problem. In the general case, it can be shown that a two layer perceptron can implement simply convex and connex regions –a region is said to be convex if any straight line that joins two points of its boundary goes only through points included in the region limited by the boundary. Convex regions are limited by the (hyper)planes performed by each node in the first layer, and can be open or closed. It has to be noted that the possibilities of Multi Layer Perceptrons rely on the nonlinearities of their neurons. If the activation function performed by these neurons was linear, then the MLP capabilities would be the same as those of the single layer perceptron. For example, let us think of a two layer perceptron with a threshold value, wo = zero and with a linear activation function, fz = z (see Figure 1). In this case, the outputs of the first layer can be easily expressed through a matrix O1 = W1T X, and those of the second layer as O2 = W2T O1 . Then, the output as a function of the input is obtained as (7)
T O2 = W2T O1 = W2T W1T X = Wtotal
1 O
x1
X= x2
2
x2 Class A
1
Decision boundary (node 1) Class B
–1 1 Class B
x1 Class A
–1 Decision boundary (node 2)
Figure 5. (a) Two layers perceptron, able to solve the Xor problem, implementing a boundary as shown in (b)
NEURAL NETWORKS HISTORICAL REVIEW
47
This function could be performed by a single layer perceptron whose layer weights were Wtotal . Therefore, if the nodes are linear elements, the performance of the structure is not improved by adding new layers, as an equivalent one layer perceptron can be found. In spite of the possibilities opened by the MLP, Minsky and Papert, prestigious scientists of their time, emphasized that algorithms to train such structures were not known, and showed their scepticism on the possibilities of they would ever be developed. The book of Minsky and Papert [Minsky and Papert, 1969], showed some critical examples of the disadvantages of NNs vs classical computers in terms of their capabilities for storing information, was a strong punch on the NNs research enthusiasm, eclipsing their developing for the next twenty years.
1.5
Backpropagation algorithm: Werbos, Rumelhart et al. and Parker
It is true that the single layer perceptron has the limitation of being a simple discriminator. There are reasons to affirm that it is only able of solving “toy” problems. Although their limitations reduce when the number of layers raises, it is difficult to find the adequate weights to solve a given problem. This problem was solved with the incorporation of “soft”, derivable, nonlinearities in the neurons in the place of the classical hard threshold. Concretely, the sigmoidal function is very appropriate Figure 6. Among others, there exists an specially relevant theorem on the capabilities of the MLP with soft activation functions, Cybenko’s Theorem [Cybenko, 1989]: it is sufficient with a two layers perceptron with the nodes (in indefinite number) in the first layers performing sigmoidal activation functions to establish any correspondence between No and −1 1NL (therefore, it will also be possible to establish any classification). For a first revision on the perceptron capabilities as “approximators” in the case of soft nonlinearities, it is worth to mention the work of Hornik et al., [Hornik et al., 1989, Hornik et al., 1990]. But, again, we must come back on the question of how to train the network weights. In a completely analogous form to the LMS algorithm previously described, the retropropagation algorithm updates the network weights (in this case of a MLP) in the opposite direction of the error function gradient that we aim to minimize (i.e. SSE). For that purpose, the chain rule is applied as many times as required
⎡ x1 ⎤ ⎢x ⎥ 2 X= ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣⎢ xm ⎦⎥
1 O2
2
O1m 1 O1m
m
MPL-NODE
wTmx
–1
Figure 6. Multilayer Perceptron with sigmoidal nonlinearities
48
CHAPTER 2
to calculate that gradient for all the weights in the network. As the output is a derivable function, this calculation is relatively easy [Haykin, 1994]. The backpropagation algorithm was proposed independently and consecutively by Werbos [Werbos, 1974], Rumelhart et al., [Rumerlhart et al., 1986] and Parker [Parker, 1985]. It can be said that the pessimism aroused by the book by Minsky and Papert had its counterpart twenty years later with the developing of the backpropagation algorithm. 2.
NEURAL NETWORKS VS CLASSICAL COMPUTERS
Classical digital computers process the information at two basic levels: hardware and software. The computations performed are algorithmic and sequential. Each problem is solved through an algorithm coded in a program, physically located in the computer memory. Problems are solved one after the other. Algorithms are performed as many times as needed, with the same reliability and at electronic speed. Nevertheless, there are many real problems where computers cannot be successfully applied yet. For example, let us think of a little mosquito finding its way to survive in the world. Such a problem is a not-solved challenge to any automatic device. But the difference probably relies on the fact that living beings do not follow the computer processing scheme. Biological brains process information in a massive, parallel, not sequential way. Problems are solved by the cooperative participation of millions of highly interconnected elemental processors, called neurons. The neurons do not need to be programmed. From the stimulus they receive from other neurons, they are able to modify, adapt or learn its functioning. The system does not need a central processing unit to control the activities of the system. It is interesting to note that biological neural systems work at a speed several orders of magnitude lower than electronic systems. Therefore, brain is an adaptive, non-linear, sophisticated processing system. Knowledge is distributed in the neurons activation state and memory is not addressed through fixed labels. Their architecture tries to emulate the basic neural features of brains and are designed by learning from examples. They could be defined as networks that massively connect simple units (usually adaptive units), hierarchically organized, that try to interact with the real world objects in the biological systems fashion. Advantages of NNs over classical computers are: 1 Adaptive Learning: they are able to learn and to perform tasks by an adaptive procedure. 2 Self-organized: they are able to build their own internal organization or representation of the information provided in a learning phase. 3 Fault Tolerance: ability of performing the same function despite of the partial destruction of the Network. 4 Real time operation: its hardware architecture is oriented to massive parallel processing of the information.
NEURAL NETWORKS HISTORICAL REVIEW
49
5 Simplicity of integration with present technology: these systems can be easily simulated using the present computers and are also implemented in specific neural hardware, that allows their modular integration in present systems. 3. 3.1
BIOLOGICAL AND ARTIFICIAL NEURONS The Biological Neuron
The biological neuron, whose basic operation is not yet completely known nor understood, is composed of a cellular body and series of ramifications that are in branches, called dendrites. Among all these branches, one of them is particularly long and receives the name of axon. It starts from the cellular body and ends in another series of dendrites. These last nervous terminals are used by the neurons to be in contact with each other by means of the synaptic connections. When a cell receives signals of other cells, (these can be excitatory or inhibitory signals) the global effect is an excitation that exceeds a certain threshold value. Then it responds transmitting a certain nervous signal through the axon to the adjacent cells by means of the synapse of the nervous terminations. Human nervous system is made up of these cells and is of a fascinating complexity. It is estimated that 1011 neurons participate in more than 1015 interconnections on channels that can measure more of a meter. Studies on the human brain anatomy conclude that there are more than 1000 synapses in the input and output of each neuron. It is important to note that, although the commutation time of a neuron (few milliseconds) is almost a million times lower than the one of the actual computer elements, the biological neurons have a very higher connectivity (thousands of times) than the actual supercomputers. Neurons are composed of the cell core, soma, and several branches called the axon and the dendrites. The dendrites of different neurons are connected in what are called sinapses and play the role of establishing the connection with the neighbor neurons in order to make possible the communication among them. Each neuron has two basic states: activation and rest. When a neuron is activated it emits through the axon a chain of electrical excitements, of different frequencies depending of its level of activation. Information is coded in the frequency of generation and not in its amplitude. The signal produced in the neuron body propagates to other neurons from the axon to other neurons through chemical interchanges that take place in the synapses of the dendrites. The chemical components liberated by the dendrites are called neurotransmitters and contribute to increase or inhibit the activation level of the neuron that receives the neurotransmitters. Due to the action of the neurotransmitters – that are basically chemical signals – ionic channels are opened in the receiver neuron and electrical ions are received, contributing to the overall electrical charge of the neuron or excitation level. When the excitation level surpasses a certain activation level, the neuron is activated. The efficiency of the synapse depends on several factors: the number of
50
CHAPTER 2
the neurotransmitter glands, concentration in the membrane of the neighbor neuron, efficiency of the ionic channels and other physical and chemical variables. As to the learning procedure, last discoveries make believe that it is also of electrochemical nature, taking place among neighbor neurons, hierarchically close in a layered structure. The chemical liberated in the learning process seems to be nitric oxide (NO). Its molecules are able to go through the membrane and route to the neighbor neurons controlling the efficiency of the connection by reactions with other chemicals in this last neuron. This efficiency regulation of the electrochemical connection among neighbor neurons is the responsible of the learning procedure. 3.2
The Artificial Neuron
The simplest model of artificial neuron, as presented in Figure 1, is obtained through approximating the action of all neuron inputs by a linear function. Let us call this function Base Function, u·. In this case, the Base Function is a weighted sum u = w0 +
ni
w j xj
j=1
where w0 is a threshold and wj are the synaptic weights, that correspond to the effect of the inputs on the activation function. The output function of an artificial neuron can be expressed as ni y = fx = f w0 + wj xj j=1
In an artificial neuron, this function can be computed in three steps: the calculation of the base function value, u·, as the sum of the input values xj weighted by the synaptic weights wij plus the threshold value w0 and a non-linear activation function fu. Typical activation functions are explained in Figure 7: • Step function 0 si t < 0 ut = 1 si t ≥ 0 • Sign function sgnt =
−1 1
• Gaussian function x2
fx = ae− 2
si t < 0 si t ≥ 0
NEURAL NETWORKS HISTORICAL REVIEW
51
f (x)
f (x−a)
1
1
a
x
−1 Sign function
⎧1, if x ≥ a f (x−a) = ⎨ ⎩–1, if x ≤ a
x −1 Hyperbolic function f (x) = tanh(βx), β > 0
Figure 7. Some typical activation functions
• Exponential function fx =
1 > 0 1 + e− x
• Hyperbolic Function fx = tanh x > 0 Hyperbolic and exponential functions are classified as sigmoids or sigmoidal functions. They are real class functions, limited and monotonic f x > 0. In the case of sigmoidal functions, the mean value of the slope in the origin is called gain and such a value represents a measurement of the transition slope steepness.Therefore, if the gain tends to an infinite value, the sigmoid tends to a Sign function. According to this, Exponential and Hyperbolic functions have a gain of 4 and , respectively. As assumed in the previous point, the activation function of a neuron is nonlinear. If the function fu is linear, fu = u, then the artificial neuron is called Linear Neuron or Linear Node of the NN. 4.
NEURAL NETWORKS: CHARACTERISTICS AND TAXONOMY
A Neural Network can be represented as an oriented pair G E, composed of a set of nodes or basic processing elements G, also called processing units, artificial neurons or nodes, and a set of interconnections, E, among them. The nodes set G is partitioned in different sets called layers. Each processing unit can also have a local memory and always a transfer function. Depending upon this function of the weighted input values and the values stored in the local memory, the output y is computed. There are four main aspects that can characterize all NNs: a) Data Representation. According to the input-output form, ANNs can be classified as: continuous type NNs, digital NNs or hybrid NNs. In the continuous type, input-output data are of analogic nature. Their values are real and continuous. In digital NNs, input-output data is of digital nature. In the hybrid case, inputs are analogic and outputs are binary.
52
CHAPTER 2
b) Topology. Architecture or Topology of the NN refers to the way that the nodes are physically disposed in the network. The nodes form layers or groups of nodes that share a common input and feed their output to common nodes. Only neurons in the input and output layers interact with the external systems. The rest of nodes in the network present internal connections, forming what is called hidden layers. Therefore, topology of the NNs is characterized by the number of layers, number of neurons inside the layers, connectivity degree and type of connections among the nodes. c) Input-Output Association. With respect to the input-output association type NNs can be classified as heteroassociative or autoassociative. Heteroassociative NNs: implement a certain function, frequently of difficult analytical expression. They associate a set of inputs with a set of outputs in such a way that each input has a corresponding output. Autoassociative networks: outputs have the purpose to rebuild a certain input information that has been corrupted by associating to each input data the more similar stored data. d) Learning Procedure. All the connections or synapsis of the nodes in a NN have an associated synaptic weight efficiency factor. Each connection or synapsis between the node i and the node j is weighted by wji . This weight is responsible of the learning of the neural network. In the learning phase, the NN modifies its weights as a result of a new input information. Weights are modified following a convergent algorithm in such a way that when all the weight values are stabilized to a certain value and the learning phase ends, it is said that the NN has“learnt”. For the learning process it is crucial to establish the weights updating algorithm for the NN to correctly learn the new input information. According to the learning criteria NNs can be classified as neural networks of supervised learning or unsupervised learning NNs. Figure 8 represents the most common way of NNs classification.
5.
FEED FORWARD NEURAL NETWORKS: THE PERCEPTRON
First presented in section 1.1 Feed Forward Neural Networks are generally defined as those networks composed of one or more layers whose nodes are connected in such a way that their input comes only from nodes in the previous layer and their outputs connect exclusively to neurons of the following layer. Their name comes from the fact that the output of each layer feeds to the units of the following layer. Of all feed forward NNs the most popular, is the Multilayer Perceptron, developed as an extension to the Perceptron proposed by Rossenblatt in 1962 [Rosenblatt, 1962]. In this type of networks, the learning is supervised because it uses information of the output that the network must provide to the current input. Learning phase
NEURAL NETWORKS HISTORICAL REVIEW
53
Figure 8. Neural Networks basic taxonomy
or training phase consists in presenting to the network an input-output pair, called training pattern DN = x1 d1 x2 d2 xM dM in such a way that the weights are adjusted by xi ∈ p and di ∈ k , i = 1 2 N. Once the training phase is completed, the network is designed and ready to work in what is called the direct mode phase. In this phase, the network classifies the
54
CHAPTER 2
inputs by the following binary decision rule 1 if x w > 0 g= 0 if x w < 0 where x w is the discriminating function, that is, the space p is divided into two regions by the decision boundary x w = 0. Logically, the choice of the discriminating function x w depends on the distribution of the training patterns. 5.1
One Layer Perceptron
It basically consists in a set of nodes whose activation is produced for the action of the weighted sums of the input values and, consequently, the discriminating function takes the form p (8) x w w = wi xi + = 0 i=1
Also, if we make = w0 and we consider the inputs in the space p+1 such as x = x1 x2 xp 1 and w = w1 w2 wp w0 , Equation (8) can be expressed as x w = wxT = 0 Among other things, it serves to perform the pattern classification task, through a discriminating function of the form [Karayiannis and Venetsanopoulos, 1993], [Hush and Horne, 1993]: uk xn =
N
wkj xnj
j=0
The classification rule is based on the assignment of class k to the input pattern if the kth network output is the highest of all outputs. The network must be trained following an appropriate algorithm, to produce the desired output for each pattern uk xn ≥ uj xn ∀j = k −→ xn ∈ Wk This decision rule is, sometimes, substituted by a binary decision rule with a decision threshold. The Perceptron is a system that operates in such a way. After the learning or training, the Perceptron structure can separates the classification space in regions, one region for each class. The decision boundaries are composed by hyperplane segments defined as: uk xn − uj xn = 0 The Perceptron was initially proposed by Rosenblatt and a group of his students. In their work, the Perceptron versatility was shown. Unfortunately, the fact problem of the linear separability made its use out of interest.
55
NEURAL NETWORKS HISTORICAL REVIEW
5.1.1
Perceptron Training
It can be summarized in five steps: 1 Weights and Threshold initialization. Each one of the weights wi has to be initialized to low random values w0 = . 2 For i = 1 2 N, presenting the training pattern (a new E/S training pair is composed by a new input Xp = x1 x2 xN i = 1 2 N and its corresponding desired output dt. 3 Computing present output M M yi t = f wij xj t − i = f wij xj t = fNeti j=1
j=1
4 Weight adaptation: Wi = dt − ytxi t. • : learning rate 0 < < 1. • dt: desired output, yt: present output. • This process is repeated till the error et = dt − yt for each one of the patterns is zero or less than a preset value. 5 back to step 2 The convergence of the perceptron training is established by the following theorem: If the training set of a multiple classification problem is linearly separable then the perceptron training algorithm converges to a correct solution in a finite number of iterations. The mathematical proof of this theorem can be found in [Rosenblatt, 1962] and its significance relays in the fact that a multiple class problem can be reduced to a binary classification. Two typical examples of this situation are shown in the Figure 9. 6.
LMS LEARNING RULE
Nevertheless, even with the simple Perceptron structure, a reasonable solution can be achieved for a set that does not accomplish the linear separability property, by x2
x2
01
01 11
11
x1 00
10 OR -function
x1 00
10 AND -function
Figure 9. Logical functions OR and AND reduced to a binary classification problem
56
CHAPTER 2
the use of the Least Mean Square convergence algorithm (LMS) to update the NN weights during the learning phase. In general, the error function Equation (4), also called cost function or objective function, to be minimized by the LMS algorithm can be expressed as follows [Hush and Horne, 1993]: E=
M
uxn − k
k=1 xn ∈Wl
where k is a k elements vector with all its components of zero value, except those of k order, that corresponds to the correct classification. Therefore, for a given training set DN where dk represents the computed value, if the desired output to the k-th input vector is yk , then the Mean Square Error (MSE) corresponding to the input-output pair is given by < k2 >=
N N 1 1 k2 = d − yk 2 N i=1 N i=1 k
or, in vectorial notation, < k2 >=< dk2 > −2dk < wT x > +w < xk xkT > The minimum square error corresponds to the matrix w that satisfies the equation = 0 w In the case N = 2 the equation is an error paraboloid as shown in Figure 10. From Figure 10 it can be observed that the optimum value for the weights of the network is the one that makes the gradient null. A possible search procedure is the maximum step descent. The gradient direction is perpendicular to the contour lines in each point of the error surface. At the algorithm starting point, the weight vector does not derives to a minimum except in the case of spherical level curves. The weight updates in each iteration step must be small or the weight vector could wander over the hypersurface without never reaching the searched minimum. 6.1
The Multilayer Perceptron
A Perceptron of n layers is composed of n + 1 layers Ll l = 0 1 n, of several processing units in each one, corresponding L0 to the input layer and Ln to the output layer and Ll l = 1 n − 1 to the hidden layers. The nodes in the hidden and output layers are individual processing units. The overall output is obtained by adding all weighted inputs and passing the result through a non-linear function of sigmoidal type (see Figure 6).
57
NEURAL NETWORKS HISTORICAL REVIEW
80 70 60 50 40 30 20 10 0 2 2
1 1
0 y
0
–1
–1 –2
x
–2
Figure 10. Error Paraboloid of the LMS learning
Usually, in a Multilayer Perceptron, the nodes in each layer are fully interconnected with the neurons in the adjacent layer. This fact is repeated layer by layer through all the network. 6.1.1
Learning Algorithm (“Backpropagation”)
Before detailing the learning algorithm, let us introduce the following nomenclature: ulj : output of the j-th node in layer l. wlji : weight that connects the i-th node in layer l − 1 to the node j-th in layer l. xp : p-th training pattern. u0i : i-th component of the input vector. dj xp : desired output of the j-th node in the output layer when a p-th pattern is presented at the network input. NL : number of nodes in a given layer. L: number of layers. P: number of training patterns. Obviously, in a Perceptron-like structure, outputs depend upon the synaptic weights that connect neurons in the different layers. Such weights are actualized in the following way 1. Associating a set of input patterns to a set of desired outputs. In a pattern classification problem it is the same as making a primary classification on them by the designer (supervised training).
58
CHAPTER 2
2. Presenting all training patterns to the network. The network then processes all patterns and presents an output. The classification offered by the network can be an erroneous one, thus the error is easily quantified. 3. Defining an objective function. For example, the Mean Square Error (MSE) between the desired and real outputs of the units in the output layer [Hush and Horne, 1993]: Jp w =
NL 1 u x − dq xn 2 q=1 Lq n
This objective function represents an error function in a parametric hyperspace. The training or learning then consists in the search for the minimum of that surface through a gradient descent algorithm in the opposite direction of the surface gradient by examining a set of weights that minimizes the error. Each weight is modified or adapted in each iteration step in an amount that is proportional to the partial derivative of the function to that weight (9)
wlji k + 1 = wlji k −
Jp w wlji
In Equation (9), constant is the learning rate. The speed of the convergence of the algorithm depends on because the amount of the weight modification in each iteration step is proportional to the gradient in the weight direction, but it is weighted by the constant value of the learning rate. In this point, the training algorithm can be designed if we know how to calculate the partial derivative to each weight of the network. This derivative can be easily calculated using the chain rule: Jp w Jp w ulj = wlji ulj wlji that is, Jp w Jp w = f wlji ulj
Nl−1 −1
wljm ul−1m ul−1i
m=0
where f·, represents the sigmoidal function previously defined. This function has a very simple derivative: f =
f = f1 − f d
when the parameter is of unit value. In this expression we can observe that the sensibility of the objective function to each weight depends on the sensibility of this function to the output of the neuron that is fed by the synaptic weight input.
NEURAL NETWORKS HISTORICAL REVIEW
59
This last sensibility can be in its turn calculated from the objective function sensibilities with respect to the node outputs of the following layer, and so on [Hush and Horne, 1993]. This process is repeated till we reach to the output layer. The sensibility of the objective function to each node output can be calculated from the output layer in a recursive from. The sensibilities to the outputs of nodes in hidden layers are also denominated “error”, although, strictly speaking, they do not represent a real error. In order to calculate the error in the hidden layers, the error in the output layer must be computed and backpropagated to previous layers. That is performed by the Backpropagation algorithm. In this algorithm, training usually is started with random small values of the synaptic weights in order to provide a safe to the backpropagation algorithm. Once the structure of the network is chosen, the key parameter to be controlled is the learning rate. A too small value will slow the learning process. A too high value will accelerate the learning, but can produce loosing the minimum of the error surface. To find the optimal value of this parameter, an empirical method has to be used. Once the learning has started, it must continue till a minimum error is found, or till no variation in weight values is achieved. In that point, the network is said to have finished learning. It is not always practical to wait till this point of the learning and several other criteria are adopted, among them: 1. When the value of the gradient error surface is sufficiently small, it means that the gradient learning algorithm has found a set of weight values in a local minimum of the error surface. 2. When the error between the real network output and the desired one is under certain tolerable value for our application. Obviously, this case needs the knowledge of the maximum tolerable error for the given application. 3. In pattern classification problems, when all the learning patterns have been correctly classified, the training procedure can be stopped. 4. Training can be stopped after a fixed number of iterations. 5. Finally, a more appropriate and developed procedure is to train the network with a set of patterns and supervise the error over a different set called test set. The training phase is stopped when a minimum error on the test set is found. This last method prevents the overspecialization of the network on the training set, a phenomenon that happens when the error on the training set is lower than the error over other set of patterns of the same application, showing that the network has lost generalization capabilities. The method needs to use a double number of patterns, a fact that can be expensive or even not possible. Therefore, in order to efficiently apply neural networks to real problems it is very important to have a number of patterns in sufficient number. 6.2
Acceleration of the training procedure
The training procedure described in the previous section presents two main problems: in one hand the convergence or training phase is very slow, and, on the other hand, it is not easy to precisely elect the appropriate learning rate. A simple solution
60
CHAPTER 2
to accelerate the network training is the usage of second order methods that use the information contained in the second matrix of derivates (Hessian). These methods reduce the number of iterations needed in the training phase in order to achieve a local or global minimum of the error surface. Nevertheless, they cost a higher amount of computation and this increases the time of training. For this reason, only the diagonal matrix of the Hessian is usually used. Another solution is to rise the gradient value by adding a term that is a fraction of the past changes in the weights. This term, usually known as momentum term, is the weight by a new constant value, usually designated by : wkj k + 1 = wkj k −
Jw + wkj k wkj
This term tends to smooth the changes in the weights, leading to increase the learning speed by avoiding divergent learning fluctuations. It has been shown that adding noise to the training patterns, decreases the training time and helps to avoid local minima in the learning process. Another way to decrease the training time consists in the use of alternative transfer functions in the network nodes. When allowing a function to take positive and negative values in a symmetric dynamical range, it is probable that several activations will be next to zero and their corresponding weights will not need to be actualized. An example of this type of activation function is the hyperbolic one. In Table 1, typical parameters of this kind of networks and their influence in the processing are summarized. 6.3
On-Line and Off-Line training
During the training, weight update can be carried out in two different ways [Bourland and Morgan, 1994]: • Off-line or “Block training”: in this case, modifications on the weights over the whole training set are accumulated. The weights are modified only when all the training patterns are presented to the network. Table 1. Design Properties of NNs Transfer Function Derivate of Transfer Function Learning rate Effects on the NN Moment
Sign f x =
Exponential = 1 f x = fx1 − fx
Hyperbolic = 1 f x = 1 − f 2 x
=1 Learning not guaranteed With a small value, the vectors of weights increment take very divergent directions
= 01 Quick but not precise convergence With a big value, the vectors of weights increment take similar directions, helping to the convergence of the training
= 001 Precise and slow convergence
NEURAL NETWORKS HISTORICAL REVIEW
61
• On-line training: the network weights are modified each time that a new training pattern is presented to the network. It can be proved that this method leads to the same result as that of the off-line training [Widrow and Stearns, 1985]. In practice, this method shows some advantages that make it much more attractive to be used: it converges much more quickly to a minimum of the error surface and usually avoids the local minima. A possible explanation is that with the on-line training some “noise” is introduced over the set of training patterns. 6.4
Selection of the Network size
The selection of the appropriate network size is a task of the utmost importance: if the network is too small, it will not be able to achieve an efficient solution for the problem that is representing, while if its size is too big it can happen that the network can represent too many solutions to solve the problem over the training patterns but none of them is optimum to the application problem. If there is no preliminary experience, the dimension of the network size is a trial and error problem. To start with, an option is to try a small network and to increase the size progressively in order to find an efficient dimension for the network. The other option is to try a big network and reduce the size progressively, removing the nodes or weights that do not have significance on the overall output of the NN. Several studies have settled some size limits that should not be exceeded. In this sense, a proposal is that the number of nodes in the hidden layer should not exceed the number of training patterns. In practice, this is always accomplished, as the number of nodes will always be much lower than the number of training patterns. In fact, big networks can be able to memorize the whole training set loosing generalization capabilities. 7.
KOHONEN NETWORKS
A main principle in biological brain organization is that neurons group in such a way that those that are physically close collaborate in the same stimulus that is being processed. That is the way that nerve connections are organized. For example, to each level of the auditive path, nerve cells and fibers are disposed in relation to the frequency that is responsible of a higher output for each neuron [Lipmann, 1987]. Therefore, the physical disposition of the neurons in the brain structure is in somehow related to the function they perform. Kohonen [Kohonen, 1984] proposed an algorithm to adjust the weights of a network whose input is a vector of N components and its output is another vector of different dimension, MM < N. In this way, the dimension of the input subspace is reduced, physically grouping the data. Vectors defined over a continuous variable are used as input to the network. The network is trained without supervision in a way that the network itself establishes the input data grouping criteria, extracting regularities and correlations. When a sufficient number of input vectors has been presented, the weights are self-organized in
62
CHAPTER 2
a way that, topologically speaking, close nodes are sensible to similar inputs. Nodes physically far will stand completely inactive. Clusters that have their topological equivalent in the network are produced. For this reason this kind of networks are known as Self-Organized Feature Map (SOFM). The algorithm that assigns values to the connections of the synaptic weights is based on the concepts of neighborhood and competitive learning. The distance between the input and the weights of all the nodes is computed, establishing the closest one as the winner node. The updating of weights is performed for this node and the neighbor nodes. The rest are not actualized favoring a concrete physical organization. This kind of network has always two layers: the input and the output one. The dimension of the input vector establishes the number of nodes of the input layer: one node for each component of the input vector. The input neurons drive the input vectors to the output layer controlled by the connections weights. In this type of network it is very important to establish a neighborhood and a distance measure in the network. In the example of Figure 11, the nodes are configured in a bidimensional structure. The algorithm used to compute the output is designed in such a way that only one output neuron is activated when one input vector is applied to the network. The fired node corresponds to the category of classification corresponding to the input vector. Similar input vectors activate the same output, while different vectors activate different neurons. Only the neuron with the minimum difference between the input vector and the output weights vector node is activated. When the training algorithm starts, the adjustment is done in a wide zone surrounding the fired node or winner node. As the training progresses, the neighbor area is progressively reduced. Through this little adjustment, the network follows any systematic change in the input vectors: the network self-organizes. Therefore, this algorithm behaves as a vectorial quantifier when the number of desired clusters can be a priori specified and a sufficient amount of data relative to the number of the desired clusters is Outputs
Input layer
Figure 11. Structure of the Kohonen Network
NEURAL NETWORKS HISTORICAL REVIEW
63
known. However, the results depend on the order of the presentation of the input data, specially when the amount of input data is small. 7.1
Training
Training of the SOMF network can be summarized in five steps: 1. Weights initialization: The network structure is N input nodes and M output nodes. Random values are assigned to each of the weight wij connections. Initial neighbor radius is fixed for the neighbor mask. 2. Presentation of a new E/S pair: A new pattern is presented at the input Xp t = x1 t x2 t xN t. 3. Computation of the distance dj between the input and each one of the output nodes dj =
N −1
xi t − wij t2
i=0
where xi t is the input to the node i in the iteration t, and wij t is the input weight i to the output j in the iteration t. 4. Selection of the output node as the node with the minimum distance: node j ∗ is selected as the node with the minimum distance dj . 5. Updating node j ∗ and its neighbor: weights are updated for node j ∗ and all its neighbors in the vicinity matrix defined by NEj ∗ t. The new weights are: wij t + 1 = wij t + txi t − wij t for j ∈ NEj ∗ t 0 ≤ i ≤ N − 1. The term t is a gain term 0 < t < 1 that decreases with time. 6. Back to step 2. An standard example introduced by Kohonen illustrates the self-organized networks capacity to learn random distributions of the input vectors presented to the network. For example, if the input is an order two vector with component uniformly distributed and the output is designed as bidimensional, then the network weights will organize in a reticular fashion as shown in Figure 12. 1.5
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
3 1
2.5 2
0.5
1.5 1
0
0.5 –0.5 –0.5
0 0
0.5
1
1.5
0.5 0 0.5 1 1.5 2 2.5 3
0.4 0.6 0.8 1 1.2 1.4 1.6
Figure 12. Kohonen Map for the two-dimensional case
64 8.
CHAPTER 2
FUTURE PERSPECTIVES
Artificial neural networks are inspired from the biological performance of the human brain, where the former attempts to emulate the latter. This is the main link between biological and artificial neural networks. From this starting point, both disciplines follow separate ways. The present understanding of the brain mechanisms is so limited that the systems designer has not sufficient data to emulate its performing. Therefore, the engineer has to be one step forward from the biological knowledge, searching and devising useful algorithms and structures that efficiently solve given problems. In the vast majority of cases, this search delivers a result that diverges completely from the biological reality and the brain similarities become metaphors. Despite this faint and usually inexisting analogy between biology and artificial neural networks, the results of the latter frequently evoke comparisons with the former, because they are frequently reminiscent of the performing of the brain. Unfortunately, these comparisons are not benign and produce unrealistic expectations that lead to disappointment. Researching based on false expectations can evaporate when illuminated by the light of reality, as happened in the sixties. This promising researching field could eclipse again if we do not contain the temptation of comparing our results with those of the brain. It has been said that NNs are capable of being applied in all activities specific of the human brain. Currently, they are considered an alternative for all those tasks where the conventional computation does not achieve satisfactory results. There has been speculations about a next future where NNs will be able to reach a place together with classical computation. However, this will only happen if the researchers achieve sufficient knowledge for that developing. Currently, the theoretical knowledge is not robust enough to justify such predictions.
REFERENCES W.W. McCulloch and W. Pitts, A Logical Calculus of the Ideas Inminent in Nervous Activity, Bulletin of Mathematical Biophysics, 5:115–133, 1943. W. Pitts and W.W. McCulloch, How We Know Universals, Bulletin of Mathematical Biophysics, 9:127– 147, 1947. D.O. Hebb, Organization of Behaviour, Science Editions, New York, 1961. F. Rosenblatt, Principles of Neurodynamics, Science Editions, New York, 1962. B. Widrow, M. E. Hoff, Adaptive Switching Circuits, In IRE WESCON Convention Record, pages 96–104, 1960. M. Minsky, S. Papert, Perceptrons, MIT press, Cambridge, MA, 1969. G. Cybenko, Approximation by Superposition of a Sigmoidal Function, Mathematics of Control, Signals, and Systems, 2:303–314, 1989. K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are Universal Approximators, Neural Networks, 2(5):359–366, 1989. K. Hornik, M. Stinchcombe and H. White, Universal Aproximation of an Unknown Mapping and Its Derivatives using Multilayer Feedforward Networks, Neural Networks, 3:551–560, 1990.
NEURAL NETWORKS HISTORICAL REVIEW
65
S. Haykin, Neural Networks. A Comprehensive Foundation, Macmillan College Publishing, Ontario, 1994. P.J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences, PhD thesis, Harvard University, Boston, 1974. D.E. Rumerlhart, G. E. Hinton and R. J. Williams, Learning Internal Representations by Error Propagation, In D. E. Rumelhart, J. L. McClelland and the PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1: Foundations, pages 318–362, MIT Press, Cambridge, MA, 1986. D.B. Parker, Learning Logic, Technical report, Technical Report TR-47, Cambridge, MA: MIT Center for Research in Computational Economics and Management Science, 1985. D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippman?, IEEE Signal Processing Magazine, 2:721–729, January, 1993. N.B. Karayiannis and A.N. Venetsanopoulos, Artificial Neural Networks, Learning Algorithms, Perfomance Evaluation and Applications, Kluwer Academic Publishers, Boston, MA, 1993. H.A. Bourland and N. Morgan, Connectionist Speech recognition. A hybrid Approach, Kluwer Academic Publishers, Boston, MA, 1994. B. Widrow and S.D. Stearns, Adaptative Signal Processing, Prentice-Hall, Signal Processing Series, Englewood Cliffs, NJ, 1985. R.P. Lipmann, An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 328–339, April, 1987. T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1984.
CHAPTER 3 ARTIFICIAL NEURAL NETWORKS
D. T. PHAM, M. S. PACKIANATHER, A. A. AFIFY Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom
INTRODUCTION Artificial neural networks are computational models of the brain. There are many types of neural networks representing the brain’s structure and operation with varying degrees of sophistication. This chapter provides an introduction to the main types of networks and presents examples of each type. 1.
TYPES OF NEURAL NETWORKS
Neural networks generally consist of a number of interconnected processing elements (PEs) or neurons. How the inter-neuron connections are arranged and the nature of the connections determine the structure of a network. How the strengths of the connections are adjusted or trained to achieve a desired overall behaviour of the network is governed by its learning algorithm. Neural networks can be classified according to their structures and learning algorithms. 1.1
Structural Categorisation
In terms of their structures, neural networks can be divided into two types: feedforward networks and recurrent networks. Feedforward networks: In a feedforward network, the neurons are generally grouped into layers. Signals flow from the input layer through to the output layer via unidirectional connections, the neurons being connected from one layer to the next, but not within the same layer. Examples of feedforward networks include the multi-layer perceptron (MLP) [Rumelhart and McClelland, 1986], the radial basis function (RBF) network [Broomhead and Lowe, 1988; Moody and Darken, 1989], the learning vector quantization (LVQ) network [Kohonen, 1989], the cerebellar 67 D. Andina and D.T. Pham (eds.), Computational Intelligence, 67–92. © 2007 Springer.
68
CHAPTER 3
model articulation control (CMAC) network [Albus, 1975a], the group-method of data handling (GMDH) network [Hecht-Nielsen, 1990] and some spiking neural networks [Maass, 1997]. Feedforward networks can most naturally perform static mappings between an input space and an output space: the output at a given instant is a function only of the input at that instant. Recurrent networks: In a recurrent network, the outputs of some neurons are fedback to the same neurons or to neurons in preceding layers. Thus, signals can flow in both forward and backward directions. Examples of recurrent networks include the Hopfield network [Hopfield, 1982], the Elman network [Elman, 1990] and the Jordan network [Jordan, 1986]. Recurrent networks have a dynamic memory: their outputs at a given instant reflect the current input as well as previous inputs and outputs.
1.2
Learning Algorithm Categorisation
Neural networks are trained by two main types of learning algorithms: supervised and unsupervised learning algorithms. In addition, there exists a third type, reinforcement learning, which can be regarded as a special form of supervised learning. Supervised learning: A supervised learning algorithm adjusts the strengths or weights of the inter-neuron connections according to the difference between the desired and actual network outputs corresponding to a given input. Thus, supervised learning requires a teacher or supervisor to provide desired or target output signals. Examples of supervised learning algorithms include the delta rule [Widrow and Hoff, 1960], the generalised delta rule or backpropagation algorithm [Rumelhart and McClelland, 1986] and the LVQ algorithm [Kohonen, 1989]. Unsupervised learning: Unsupervised learning algorithms do not require the desired outputs to be known. During training, only input patterns are presented to the neural network which automatically adapts the weights of its connections to cluster the input patterns into groups with similar features. Examples of unsupervised learning algorithms include the Kohonen [Kohonen, 1989] and Carpenter-Grossberg Adaptive Resonance Theory (ART) [Carpenter and Grossberg, 1988] competitive learning algorithms. Reinforcement learning: As mentioned before, reinforcement learning is a special case of supervised learning. Instead of using a teacher to give target outputs, a reinforcement learning algorithm employs a critic only to evaluate the goodness of the neural network output corresponding to a given input. An example of a reinforcement learning algorithm is the genetic algorithm (GA) [Holland, 1975; Goldberg, 1989].
2.
NEURAL NETWORKS EXAMPLE
This section briefly describes the example neural networks and associated learning algorithms cited previously.
69
ARTIFICIAL NEURAL NETWORKS
2.1
Multi-layer Perceptron (MLP)
MLPs are perhaps the best known type of feedforward networks. Figure 1a shows an MLP with three layers: an input layer, an output layer and an intermediate or hidden layer. Neurons in the input layer only act as buffers for distributing the input signals xi to neurons in the hidden layer. Each neuron j (Figure 1b) in the hidden layer sums up its input signals xi after weighting them with the strengths of the respective connections wji from the input layer and computes its output yj as a function f of the sum, viz. (1)
yj = f
wji xi
f can be a simple threshold function or a sigmoidal, hyperbolic tangent or radial basis function (see Table 1). The output of neurons in the output layer is computed similarly. The backpropagation (BP) algorithm, a gradient descent algorithm, is the most commonly adopted MLP training algorithm. It gives the change wji in the weight
Output Layer
y1
yn
Hidden Layer w1m w12 w11 Input Layer x1
x2
xm
Figure 1a. A multi-layer perceptron
x1
wj1 wji
xi
Σ
yj f(.)
wjn xn
Figure 1b. Details of a neuron
70
CHAPTER 3 Table 1. Activation functions Type of Functions
Functions
Linear
fs = s
Threshold
fs =
Sigmoid Hyperbolic tangent Radial basis function
+1 if s > st −1 otherwise fs = 1/1 + exp−s fs = 1 − exp−2s/1 + exp2s fs = exp−s2 /2
of a connection between neurons i and j as follows: (2)
wji = j xi
where is a parameter called the learning rate and j is a factor depending on whether neuron j is an output neuron or a hidden neuron. For output neurons, f t yj − yj (3) j = netj and for hidden neurons, f w (4) j = netj q qj q In Equation (3), netj is the total weighted sum of input signals to neuron j and t yj is the target output for neuron j. As there are no target outputs for hidden neurons, in Equation (4), the difference between the target and actual output of a hidden neuron j is replaced by the weighted sum of the q terms already obtained for neurons q connected to the output of j. Thus, iteratively, beginning with the output layer, the term is computed for neurons in all layers and weight updates determined for all connections. The weight updating process can take place after the presentation of each training pattern (pattern-based training) or after the presentation of the whole set of training patterns (batch training). In either case, a training epoch is said to have been completed when all training patterns have been presented once to the MLP. For all but the most trivial problems, several epochs are required for the MLP to be properly trained. A commonly adopted method to speed up the training is to add a “momentum” term to Equation (2) which effectively lets the previous weight change influence the new weight change, viz: (5)
wji k + 1 = j xi + wji k
where wji k + 1 and wji k are weight changes in epochs k + 1 and k respectively and is the “momentum” coefficient.
71
ARTIFICIAL NEURAL NETWORKS
Another learning method suitable for training MLPs is the genetic algorithm (GA). This is an optimisation algorithm based on evolution principles. The weights of the connections are considered genes in a chromosome. The goodness or fitness of the chromosome is directly related to how well trained the MLP is. The algorithm starts with a randomly generated population of chromosomes and applies genetic operators to create new and fitter populations. The most common genetic operators are the selection, crossover and mutation operators. The selection operator chooses chromosomes from the current population for reproduction. Usually, a biased selection procedure is adopted which favours the fitter chromosomes. The crossover operator creates two new chromosomes from two existing chromosomes by cutting them at a random position and exchanging the parts following the cut. The mutation operator produces a new chromosome by randomly changing the genes of an existing chromosome. Together, these operators simulate a guided random search method which can eventually yield the optimum set of weights to minimise the differences between the actual and target outputs of the neural network. Further details of genetic algorithms can be found in the chapter on Soft Computing and its Applications in Engineering and Manufacture. 2.2
Radial Basis Function (RBF) Network
Large multi-layer perceptron (MLP) networks take a long time to train. This has led to the construction of alternative networks such as the Radial Basis Function (RBF) network [Cichocki and Unbahauen, 1993; Hassoun, 1995; Haykin, 1999]. The RBF network is the most used network after MLPs. Figure 2 shows the structure of a RBF network which consists of three layers. The input layer neurons receive the inputs x1 xM . The hidden layer neurons provide a set of activation functions that constitute an arbitrary “basis” for the input patterns in the input space to be expanded into the hidden space by way of non-linear transformation. At the input of each hidden neuron, the distance between the centre of each activation or basis function and the input vector is calculated. Applying the basis function to this distance produces the output of the hidden neuron. The RBF network output y is formed by the neuron in the output layer as a weighted sum of the hidden layer neuron activation.
Input Layer
x1 xk
Hidden Layer
w1
Output Layer
wi wN
xM
Figure 2. The RBF network
y
72
CHAPTER 3
K(x) 1.0
Standard Deviation σ=1
x 0 Figure 3. The Radial Basis Function
The basis function is generally chosen to be a standard function which is positive at its centre x = 0 and then decreases uniformly to zero on either side as shown in Figure 3. A common choice is the Gaussian distribution function: 2 x (6) Kx = exp − 2 This function can be shifted to an arbitrary centre, x = c, and stretched by varying its standard deviation as follows: x − c2 x − c = exp − (7) K 2 2 The output of the RBF network y is given by: N x − ci (8) y = wi K i ∀x i=1 where wi is the weight of the hidden neuron i, ci the centre of basis function i and i the standard deviation of the function. x − ci is the norm of x − ci . There are various ways to calculate the norm. The most common is the Euclidean norm given by:
(9) x − ci = x1 − ci1 2 + x2 − ci2 2 + + xM − ciM 2 This norm gives the distance between the two points x and ci in N-dimensional space. All points x that are the same radial distance from ci give the same value for the norm and hence the same value for the basis function. Hence the basis functions are called Radial Basis Functions. Obtaining the values for wi , ci and i requires training the RBF network. Because the basis functions are differentiable, back-propagation could be used as with MLP networks. Training of a multiple-input single-output RBF network can proceed as follows: (i) choose the number N of hidden units; There is no firm guidance available for this. The selection of N is normally made by trial and error. In general, the smallest N that gives the RBF network an acceptable performance is adopted.
ARTIFICIAL NEURAL NETWORKS
73
(ii) choose the centres, ci ; Centre selection could be performed in three different ways [Haykin, 1999]: a) Trial and error: Centres can be selected by trial and error. This is not always easy if little is known about underlying functional behaviour of data. Usually, the centres are spread evenly or randomly over N -dimensional input space. b) Self-organized selection: An adaptive unsupervised method can be used to learn where to place the centres. c) Supervised selection: A supervised learning process, commonly error correction learning, can be deployed to fix the centres. (iii) choose stretch constants, i ; Several heuristics are available. A popular way is to set i equal to the distance to nearest neighbour. First the distances between centres are computed then the nearest distance is chosen to be the value of i . (iv) calculate weights, wi . When ci and wi are known, the outputs of hidden units O1 ON T can be calculated for any pattern of inputs x = x1 xM . Assuming there are P input patterns x in the training set, there will be P sets of hidden unit outputs that can be calculated. These can be assembled in a N × P matrix: 1 2 P ⎤ O1 O1 O1 ⎢O1 O2 OP ⎥ 2 2 ⎥ ⎢ 2 ⎥ O=⎢ ⎥ ⎢ ⎦ ⎣ 1 2 P ON ON ON
⎡
(10)
If the output yi of the RBF network corresponding to training input pattern i i i xi is yi = O1 w1 + O2 w2 + + ON wN , the following equation can be obtained: ⎤ ⎡ 1 O1 y1 ⎢ ⎥ ⎢ ⎥ ⎢ y=⎢ ⎣ ⎦=⎣ P yP O1 ⎡
(11)
1 ⎤
ON
⎡
w1
⎤
⎥ ⎢ ⎥ ⎥ · ⎢ ⎥ = OT · w ⎦ ⎣ ⎦ P wN ON
y is the vector of actual outputs corresponding to the training inputs x. Ideally, y should be equal to d, the desired/target outputs. Unknown coefficients wi can be chosen to minimise the sum-squared-error of y compared with d. It can be shown that this is achieved when: (12)
w = O OT −1 O d
74 2.3
CHAPTER 3
Learning Vector Quantization (LVQ) Network
Figure 4 shows an LVQ network which comprises three layers of neurons: an input buffer layer, a hidden layer and an output layer. The network is fully connected between the input and hidden layers and partially connected between the hidden and output layers, with each output neuron linked to a different cluster of hidden neurons. The weights of the connections between the hidden and output neurons are fixed to 1. The weights of the input-hidden neuron connections form the components of reference vectors (one reference vector is assigned to each hidden neuron). They are modified during the training of the network. Both the hidden neurons (also known as Kohonen neurons) and the output neurons have binary outputs. When an input pattern is supplied to the network, the hidden neuron whose reference vector is closest to the input pattern is said to win the competition for being activated and thus allowed to produce a “1”. All other hidden neurons are forced to produce a “0”. The output neuron connected to the cluster of hidden neurons that contains the winning neuron also emits a “1” and all other output neurons a “0”. The output neuron that produces a “1” gives the class of the input pattern, each output neuron being dedicated to a different class. The simplest LVQ training procedure is as follows: (i) initialise the weights of the reference vectors; (ii) present a training input pattern to the network; (iii) calculate the (Euclidean) distance between the input pattern and each reference vector; (iv) update the weights of the reference vector that is closest to the input pattern, that is, the reference vector of the winning hidden neuron. If the latter belongs
Output layer
Hidden (Kohonen) Layer Reference vector
Input layer
Input vector Figure 4. Learning Vector Quantization network
75
ARTIFICIAL NEURAL NETWORKS
to the cluster connected to the output neuron in the class that the input pattern is known to belong to, the reference vector is brought closer to the input pattern. Otherwise, the reference vector is moved away from the input pattern; (v) return to (ii) with a new training input pattern and repeat the procedure until all training patterns are correctly classified (or a stopping criterion is met). For other LVQ training procedures, see for example [Pham and Oztemel, 1994]. 2.4
CMAC Network
CMAC (Cerebellar Model Articulation Control) [Albus, 1975a, 1975b, 1979a, 1979b; An et al 1994] can be considered a supervised feedforward neural network with the characteristics of a fuzzy associative memory. A basic CMAC module is shown in Figure 5. CMAC consists of a series of mappings: (13)
f
e
g
S −→M −→A−→u
where S = input vectors M = intermediate variables A = association cell vectors u = output of CMAC ≡ hS h ≡ g·f ·e (a) Input encoding (S → M mapping) The S → M mapping is a set of submappings, one for each input variable: ⎤ ⎡ s 1 → m1 ⎢ s2 → m 2 ⎥ ⎥ (14) S→M =⎢ ⎦ ⎣ sn → mn
M
S >M
Input S
:
Input Encoding
:
>A
Weight Table
A
>u
Actual Output u
:
+
_
Desired Output Figure 5. A basic CMAC module
+
76
CHAPTER 3
The range of s1 is coarsely discretised using the quantising functions q1 q2 qk . Each function divides the range into k intervals. The intervals produced by function qj+1 are offset by one kth of the range compared to their counterparts produced by function qj . mi is a set of k intervals generated by q1 to qk respectively. An example is given in Figure 6 to illustrate the internal mappings within a CMAC module. The S → M mapping is shown in the leftmost part of the figure. In Figure 6, two input variables s1 and s2 are represented with unity resolution in the range of 0 to 8. The range of each input variable is described using three quantising functions. For example, the range of s1 is described by functions q1 q2 , and q3 . q1 divides the range into intervals A, B, C and D. q2 gives intervals E, F , G, and H and q3 provides intervals I, J , K and L. That is, q1 = A B C D q2 = E F G H q3 = I J K L For every value of s1 , there exists a set of elements, m1 , which are the intersection of the functions q1 to q3 , such that the value of s1 uniquely defines set m1 and vice versa. For example, value s1 = 5 maps to set m1 = B G K and vice versa. Similarly, value s2 = 4 maps to set m2 = b g j and vice versa. The S → M mapping gives CMAC two advantages: the first is that a single precise variable si can be transmitted over several imprecise information channels. Each channel carries only a small part of the information of si . This increases the reliability of the information transmission. The other advantage is that small changes in the value of si have no influence on most of the elements in mi . This leads to the property of input generalisation which is important in an environment where random noise exists.
S
M
M
d c b a
m2 d
l h
c k g b
j f
a
i e
s2
A *
*
*
*
*
*
*
*
X1 *
*
*
*
*
*
*
A
u
A B C D
8 7 6 5 4 3 2 1 0
h g f e
*
*
*
*
*
X2 *
*
*
*
*
*
*
*
*
*
+
E F G H
_ 0 1 2 3 4 5 6 7 8 A
E
B F
I
C G
J
s1 D
H K
L
m1
l k j i
*
*
*
*
*
*
*
*
*
* X3 *
*
*
I
J K L
*
*
Figure 6. Internal mappings within a CMAC module
+
ARTIFICIAL NEURAL NETWORKS
77
(b) Address computing (M → A mapping) A is a set of address vectors associated with weight tables. A is obtained by combining the elements of mi . For example, in Figure 6, the sets m1 = B G K and m2 = b g j are combined to give the set of elements A = a1 a2 a3 = Bb Gg Kj. (c) Output computing (A → U mapping) This mapping involves looking up the weight tables and adding the contents of the addressed locations to yield the output of the network. The following formula is employed: (15) u = wi ai i
That is, only the weights associated with the addresses ai in A are summed. For this given example, these weights are: wBb = x1 wGg = x2 wKj = x3 Thus the output is: u = x1 + x2 + x3
(16)
Training a CMAC module consists of adjusting the stored weights. Assuming that f is the function that CMAC has to learn, the following training steps could be adopted: (i) select a point S in the input space and obtain the current output u corresponding to S; (ii) let u be the desired output of CMAC, that is, u = f S; (iii) if u − u ≤ , where is an acceptable error, then do nothing; the desired value is already stored in CMAC. However, if u − u > , then add to every weight which contributed to u the quantity (17)
=
u−u A
where A = the number of weights which contributed to u and is the learning rate. 2.5
Group Method of Data Handling (GMDH) Network
Figure 7 shows a GMDH network and the details of one of its neurons. Unlike the feedforward neural networks previously described which have a fixed structure,
78
CHAPTER 3
N-Adaline x1 N-Adaline
x2
N-Adaline N-Adaline
x3
N-Adaline
N-Adaline
N-Adaline
N-Adaline
N-Adaline
y
N-Adaline x4
N-Adaline
Figure 7a. A trained GMDH network Note: Each GMDH neuron is an N-Adaline, which is an Adaptive Linear Element with a nonlinear preprocessor
Nonlinear processor x1
x1
Square
X
x2
Square
x21
x1x2
x22
x2
w1
+1
w2
w0
w3
+
output
–
w4
e +
w5
yd desired output
Figure 7b. Details of a GMDH Neuron
a GMDH network has a structure which grows during training. Each neuron in a GMDH network usually has two inputs x1 and x2 and produces an output y that is a quadratic combination of these inputs, viz. (18)
y = wo + w1 x1 + w2 x12 + w3 x1 x2 + w4 x22 + w5 x2
Training a GMDH network consists of configuring the network starting with the input layer, adjusting the weights of each neuron, and increasing the number of layers until the accuracy of the mapping achieved with the network deteriorates.
ARTIFICIAL NEURAL NETWORKS
79
The number of neurons in the first layer depends on the number of external inputs available. For each pair of external inputs, one neuron is used. Training proceeds with presenting an input pattern to the input layer and adapting the weights of each neuron according to a suitable learning algorithm, such as the delta rule (see for example [Pham and Liu, 1994]), viz. (19)
Wk+1 = Wk +
Xk Xk
2
ykd − WkT Xk
where Wk , the weight vector of a neuron at time k, and Xk the modified input vector to the neuron at time k, are defined as (20) (21)
Wk = w0 w1 w2 w3 w4 w5 T T Xk = 1 x1 x12 x1 x2 x22 x2
and ykd is the desired network output at time k. Note that, for this description, it is assumed that the GMDH network only has one output. Equation (19) shows that the desired network output is presented to each neuron in the input layer and an attempt is made to train each neuron to produce that output. When the sum of the mean square errors SE over all the desired outputs in the training data set for a given neuron reaches the minimum for that neuron, the weights of the neuron are frozen and its training halted. When the training has ended for all neurons in a layer, the training for the layer stops. Neurons that produce SE values below a given threshold when another set of data (known as the selection data set) is presented to the network are selected to grow the next layer. At each stage, the smallest SE value achieved for the selection data set is recorded. If the smallest SE value for the current layer is less than that for the previous layer (that is, the accuracy of the network is improving), a new layer is generated, the size of which depends on the number of neurons just selected. The training and selection processes are repeated until the SE value deteriorates. The best neuron in the immediately preceding layer is then taken as the output neuron for the network. 2.6
Hopfield Network
Figure 8 shows one version of a Hopfield network. This network normally accepts binary and bipolar inputs (+1 or −1). It has a single “layer” of neurons, each connected to all the others, giving it a recurrent structure, as mentioned earlier. The training of a Hopfield network takes only one step, the weights wij of the network being assigned directly as follows: ⎧ P ⎨ 1 xc xc i = j (22) wij = N c=1 i j ⎩ 0 i=j where wij is the connection weight from neuron i to neuron j, and xic (which is either +1 or −1) is the ith component of the training input pattern for class c, P
80
CHAPTER 3
y1
y2
Outputs y3
w12
yN
w13
w1N
Hopfield Layer
x1
x2
x3
xN
Inputs Figure 8. A Hopfield network
the number of classes and N the number of neurons (or the number of components in the input pattern). Note from Equation (22) that wij = wji and wii = 0, a set of conditions that guarantee the stability of the network. When an unknown pattern is input to the network, its outputs are initially set equal to the components of the unknown pattern, viz. (23)
yi 0 = xi
1≤i≤N
Starting with these initial values, the network iterates according to the following equation until it reaches a minimum energy state, i.e. its outputs stabilise to constant values: N (24) yi k + 1 = f wij yi k 1 < i ≤ N j=1
where f is a hard limiting function defined as −1 x < 0 (25) fx = 1 x>0 2.7
Elman and Jordan Nets
Figures 9a and b show an Elman net and a Jordan net, respectively. These networks have a multi-layered structure similar to the structure of MLPs. In both nets, in addition to an ordinary hidden layer, there is another special hidden layer sometimes called the context or state layer. This layer receives feedback signals from the
81
ARTIFICIAL NEURAL NETWORKS
outputs output units
1 1 hidden units
input units context unit inputs Figure 9a. An Elman network
output output feedback
output unit
hidden layer
input unit self feedback input context unit Figure 9b. A Jordan network
ordinary hidden layer (in the case of an Elman net) or from the output layer (in the case of a Jordan net). The Jordan net also has connections from each neuron in the context layer back to itself. With both nets, the outputs of neurons in the context layer, are fed forward to the hidden layer. If only the forward connections are to be adapted and the feedback connections are preset to constant values, these networks can be considered ordinary feedforward networks and the BP algorithm used to train them. Otherwise, a GA could be employed [Pham and Karaboga, 1993b; Karaboga, 1994]. For improved versions of the Elman and Jordan nets, see [Pham and Liu, 1992; Pham and Oh, 1992].
82 2.8
CHAPTER 3
Kohonen Network
A Kohonen network or a self-organising feature map has two layers, an input buffer layer to receive the input pattern and an output layer (see Figure 10). Neurons in the output layer are usually arranged into a regular two-dimensional array. Each output neuron is connected to all input neurons. The weights of the connections form the components of the reference vector associated with the given output neuron. Training a Kohonen network involves the following steps: (i) initialise the reference vectors of all output neurons to small random values; (ii) present a training input pattern; (iii) determine the winning output neuron, i.e. the neuron whose reference vector is closest to the input pattern. The Euclidean distance between a reference vector and the input vector is usually adopted as the distance measure; (iv) update the reference vector of the winning neuron and those of its neighbours. These reference vectors are brought closer to the input vector. The adjustment is greatest for the reference vector of the winning neuron and decreased for reference vectors of neurons further away. The size of the neighbourhood of a neuron is reduced as training proceeds until, towards the end of training, only the reference vector of a winning neuron is adjusted. In a well-trained Kohonen network, output neurons that are close to one another have similar reference vectors. After training, a labelling procedure is adopted where input patterns of known classes are fed to the network and class labels are assigned to output neurons that are activated by those input patterns. As with the LVQ network, an output neuron is activated by an input pattern if it wins the competition against other output neurons, that is, if its reference vector is closest to the input pattern.
Output neurons
Reference vector
Input neurons
Input vector Figure 10. A Kohonen network
ARTIFICIAL NEURAL NETWORKS
2.9
83
ART Networks
There are different versions of the ART network. Figure 11 shows the ART-1 version for dealing with binary inputs. Later versions, such as ART-2 can also handle continuous-valued inputs. ART-1 As illustrated in Figure 11, an ART-1 network has two layers, an input layer and an output layer. The two layers are fully interconnected, the connections are in both the forward (or bottom-up) direction and the feedback (or top-down) direction. The vector Wi of weights of the bottom-up connections to an output neuron i forms an exemplar of the class it represents. All the Wi vectors constitute the long-term memory of the network. They are employed to select the winning neuron, the latter again being the neuron whose Wi vector is most similar to the current input pattern. The vector Vi of the weights of the top-down connections from an output neuron i is used for vigilance testing, that is, determining whether an input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi form the short-term memory of the network. Vi and Wi are related in that Wi is a normalised copy of Vi , viz. (26)
Wi =
+
Vi
Vji
where is a small constant and Vji , the jth component of Vi (i.e. the weight of the connection from output neuron i to input neuron j).
output layer
bottom up weights W
top down weights V
input layer Figure 11. An ART-1 network
84
CHAPTER 3
Training an ART-1 network occurs continuously when the network is in use and involves the following steps: (i) initialise the exemplar and vigilance vectors Wi and Vi for all output neurons, setting all the components of each Vi to 1 and computing Wi according to Equation (26). An output neuron with all its vigilance weights set to 1 is known as an uncommitted neuron in the sense that it is not assigned to represent any pattern classes; (ii) present a new input pattern x; (iii) enable all output neurons so that they can participate in the competition for activation; (iv) find the winning output neuron among the competing neurons, i.e. the neuron for which x. Wi is largest; a winning neuron can be an uncommitted neuron as is the case at the beginning of training or if there are no better output neurons; (v) test whether the input pattern x is sufficiently similar to the vigilance vector Vi of the winning neuron. Similarity is measured by the fraction r of bits in x that are also in Wi , viz. (27)
x V r= i xi
x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance threshold 0 < ≤ 1; (vi) go to step (vii) if r ≥ (i.e. there is resonance); else disable the winning neuron temporarily from further competition and go to step (iv) repeating this procedure until there are no further enabled neurons; (vii) adjust the vigilance vector Vi of the most recent winning neuron by logically ANDing it with x, thus deleting bits in Vi that are not also in x; compute the bottom-up exemplar vector Wi using the new Vi according to Equation (26); activate the winning output neuron; (viii) go to step (ii). The above training procedure ensures that if the same sequence of training patterns is repeatedly presented to the network, its long-term and short-term memories are unchanged (i.e. the network is stable). Also, provided there are sufficient output neurons to represent all the different classes, new patterns can always be learnt, as a new pattern can be assigned to an uncommitted output neuron if it does not match previously stored exemplars well (i.e. the network is plastic). ART-2 The architecture of an ART-2 network [Carpenter and Grossberg, 1987; Pham and Chan, 1998; 2001] is depicted in Figure 12. In this particular configuration, the “feature representation” field F 1 consists of 4 loops. An input pattern will be circulated in the lower two loops first. Inherent noise in the input pattern will be suppressed (this is controlled by the parameters a and b and the feedback function f·) and prominent features in it will be accentuated. Then the enhanced input
ARTIFICIAL NEURAL NETWORKS
85
pattern will be passed to the upper two F 1 loops and will excite the neurons in the “category representation” field F 2 via the bottom-up weights. The “established class” neuron in F 2 that receives the strongest stimulation will fire. This neuron will read out a “top-down expectation” in the form of a set of top-down weights sometimes referred to as class templates. This top-down expectation will be compared against the enhanced input pattern by the vigilance mechanism. If the vigilance test is passed, the top-down and bottom-up weights will be updated and, along with the enhanced input pattern, will circulate repeatedly in the two upper F 1 loops until stability is achieved. The time taken by the network to reach a stable state depends on how close the input pattern is to passing the vigilance test. If it passes the test comfortably, i.e. the input pattern is quite similar to the top-down expectation, stability will be quick to achieve. Otherwise, more iterations are required. After the top-down and bottom-up weights have been updated, the current firing neuron will become an established class neuron. If the vigilance test fails, the current firing neuron will be disabled. Another search within the remaining established class neurons in the F 2 layer will be conducted. If none of the established class neurons has a top-down expectation similar to the input pattern, an unoccupied F 2 neuron will be assigned to classify the input pattern. This procedure repeats itself until either all the patterns are classified or the memory capacity of F 2 has been exhausted. The basic ART-2 training algorithm can be summarised as follows: (i) initialising the top-down and bottom-up long term memory traces; (ii) presenting an input pattern from the training data set to the network; (iii) triggering the neuron with the highest total input in the category representation field; (iv) checking the match between the input pattern and the exemplar in the topdown filter (long term memory) using a vigilance parameter; (v) starting the learning process if the mismatch is within the tolerance level defined by the vigilance parameter and then going to step (viii); otherwise, moving to the next step; (vi) disabling the current active neuron in the category representation field and returning to step (iii); go to step (vii) if all the established classes have been tried; (vii) establishing a new class for the given input pattern; (viii) repeating (ii) to (vii) until the network stabilises or a specified number of iterations are completed. In the recall mode, only steps (ii), (iii), (iv) and (viii) will be utilised. Dynamics of ART-2: The dynamics of the ART-2 network illustrated in Figure 12 is controlled by a set of mathematical equations. They are as follows: (28)
wi = Ii + au i
(29)
xi =
wi
W
86
CHAPTER 3
F2 reset
ρ
Yj Zij
vigilance mechanism
g(Yj) = d Zji
cpi
ri
F2
qi
pi
bf(qi) ui
vi
aui F2
f(xi)
wi
xi
F1 q′i
p ′i
bf(q′i ) v′i
u′i
f(x′i )
au′i
x′i
w′i
Ii Figure 12. Architecture of an ART-2 network
(30)
vi = f xi + bf qi
(31)
u i =
(32)
pi = u i
(33)
qi =
(34)
wi = qi
wi xi = W
(35) (36) (37)
vi
V pi
P
vi = f xi + bf qi v ui = i V
ARTIFICIAL NEURAL NETWORKS
(38)
pi = ui +
(39)
p qi = i P
87
g Yj zji j
The symbol
X represents the L2 norm of the vector X. If X = x1 x2 xn , then X = x12 + x22 +
+ xn2 . The output of the jth neuron in the classification layer is denoted by gYj . The L2 norm is used in the equations for the purpose of normalising the input data. The function f· used in Equations (30) and (36) is a non-linear function, the purpose of which is for suppressing the noise in the input pattern down to a prescribed level. The definition of f· is 0 if 0 ≤ x < (40) fx = x if x ≥ where is a user defined parameter, it has a value between 0 and 1. Learning Mechanism of ART-2: When an input pattern is applied to the ART-2 network, it will pass through the 4 loops comprising F 1 and then stimulate the classification neurons in F 2. The total excitation received by the jth neuron in the classification layer is equal to Tj where (41)
Tj =
pi zij
i
The neuron which is stimulated by the strongest total input signal will fire by generating an output with the constant value d. Therefore, for the winning neuron, gYj equals d. When a winning neuron is determined, all the other neurons will be prohibited from firing. The value d will be used to multiply the top-down expectation of the firing class before the top-down expectation pattern is read out for comparison in the vigilance test. When the winning neuron fires, all the other neurons are inhibited from firing so it can be inferred that when there is a firing neuron (say j), Equation (38) becomes: (42)
pi = ui + dzji
otherwise if there is no winning neuron, it can be simplified as: (43)
pi = u i
The top-down expectation pattern is merged with the enhanced input pattern at point ri before they enter the vigilance test (see Figure 12). ri is defined by (44)
ri =
qi + cpi Q + cP
88
CHAPTER 3
The vigilance test is failed and the firing neuron will be reset if the following condition is true: (45) >1 R where is the vigilance parameter. On the other hand, if the vigilance test is passed (in other words, the current input pattern can be accepted as a member of the firing neuron), the top-down and the bottom-up weights are updated so that the special features present in the current input pattern can be incorporated into the class exemplar represented by the firing neuron. The updating equations are as follows: (46) (47)
d z = d pi − zji dt ji d zij = d pi − zij dt
The bottom-up weights are denoted by Zij and the top-down weights by Zji . According to the recommendations in [Carpenter and Grossberg, 1987], all the topdown weights should be initialised with the value 0 at the beginning of the learning process. This can be expressed by the following equation: (48)
Zji 0 = 0
This measure is designed to prevent a neuron from being reset when it is allocated to classify an input pattern for the first time. The bottom-up weights are initialised using the equation: (49)
Zji 0 =
1 √ 1 − d M
where M is the number of neurons in the input layer. This number is equal to the dimension of the input vectors. This arrangement ensures that after all the neurons with the top-down expectations similar to the input pattern have been searched, it would be easy for the input pattern to access a previously uncommitted neuron. 2.10
Spiking Neural Network
Experiments with biological neural systems have shown that they use the timing of electrical pulses or “spikes” to encode and transmit information. Spiking neural networks, also known as pulsed neural networks, are attempts at modelling the operation of biological neural systems more closely than is the case with other artificial neural networks. An example of spiking neural network is shown in Figure 13. Each connection between neurons i and j could contain multiple connections associated with a weight value and delay [Natschläger and Ruf, 1998].
89
ARTIFICIAL NEURAL NETWORKS
1
I N P U T
wlij , dlij
1 O U T P U T
2 j
i n
wkij , dkij
i
j
wkij , dkij
m Figure 13. Spiking neural network topology showing a single connection composed of multiple weights wijk with corresponding delays dijk
PSP
ε ij (t − s)
s
t a)
PSP s
t
ε ij (t − s)
b) Figure 14. Different shapes of response functions. a) Excitatory post synaptic potentials (EPSPs) function b) Inhibitory post synaptic potentials (IPSPs) function
90
CHAPTER 3
In the leaky integrate-and-fire model proposed by Maass [Maass, 1997], a neuron is regarded as a homogeneous unit that generates spikes when the total excitation exceeds a threshold value. Consider a network that consists of a finite set V of such spiking neurons, a set E ⊆ V × V of synapses, a set of weights Wuv ≥ 0, a response function uv R+ → R for each synapse u v ∈ E where R+ = x ∈ R x ≥ 0 and a threshold function v R+ → R for each neuron v ∈ V . If Fu ⊆ R+ is the set of firing times of a neuron u, then the potential at the trigger zone of each neuron v at time t is given by: (50) Pv t = u uv∈E s∈F s
In the simplest model of a spiking neuron, a neuron v fires whenever its potential Pv t reaches a certain threshold v t. This potential Pv t is the sum of the so-called excitatory post synaptic potentials (EPSPs) and inhibitory post synaptic potentials (IPSPs), which result from the firing of other neurons u that are connected through a synapse to neuron v. The firing of a presynaptic neuron u at time s contributes to the potential Pv t at time t an amount that is modeled by the term wuv ∗ uv t − s, where wuv is the weight of the connection between neurons u and v, and uv t − s is the response function. Some biologically realistic shapes of the post synaptic potentials (PSPs) are shown in Figure 14. The change in threshold as a function of time is illustrated in Figure 15. Learning can be achieved in spiking neural networks as in traditional neural networks for tasks such as classification, pattern recognition and function approximation [Lannella and Back 2001; Bohte et al., 2002a; 2002b]. Different learning
Threshold
Time Figure 15. Firing threshold of a neuron
ARTIFICIAL NEURAL NETWORKS
91
algorithms have been proposed for the training of spiking neural networks [Maass and Bishop, 98; Gerstner and Kistler, 2002]. A supervised learning algorithm has been proposed [Bohte et al., 2002b], where it is shown that a feedforward network of spiking neurons can be trained for classification tasks by means of error backpropagation. Unsupervised learning can also be achieved for spiking neural networks [Bohte et al., 2002a] with self organisation as for Radial Basis Function (RBF) networks. 3.
SUMMARY
This chapter has presented the main types of existing neural networks and has described examples of each type. For an overview of the different systems engineering applications of these neural networks, see the chapter on Soft Computing and its Applications in Engineering and Manufacture for example. 4.
ACKNOWLEDGEMENTS
This work was carried out within the ALFA project “Novel Intelligent Automation and control systems 11” (NIACS 11), the ERDF (Objective One) projects “Innovation in Manufacturing Centre” (IMC), “Innovative Technologies for Effective Enterprises” (ITEE) and “Supporting Innovative Product Engineering and Responsive Manufacturing” (SUPERMAN) and within the project “Innovative Production Machines and Systems” (I∗ PROMS). REFERENCES Albus J S, (1975a), “A new approach to manipulator control: cerebellar model articulation control (CMAC)”, Trans. ASME, J. of Dynamics Syst., Meas. and Contr., 97, 220–227. Albus J S, (1975b), “Data storage in the cerebellar model articulation controller (CMAC)”, Trans. ASME, J. of Dynamics Syst., Meas. and Contr., 97, 228–233. Albus J S, (1979a), “A model of the brain for robot control”, Byte, 54–95. Albus J S, (1979b), “Mechanisms of planning and problem solving in the brain”, Math. Biosci., 45, 247–293. An P E, Brown M, Harris C J, Lawrence A J and Moore C J, (1994), “Associative memory neural networks: adaptive modelling theory, software implementations and graphical user”, Engng. Appli. Artif. Intell., 7 (1), 1–21. Bohte S M, La Poutre H and Kok J N, (2002a), “Unsupervised clustering with spiking neurons by sparse temporal coding and multilayer RBF networks”, IEEE Trans. on Neural Networks, 13 (2), 415–425. Bohte S M, La Poutre H and Kok J N, (2002b), “Error-back propagation in temporally encoded networks of spiking neurons”, Neuro Computing, 17–37. Broomhead D S and Lowe D, (1988), “Multivariable functional interpolation and adaptive networks”, Complex Systems, 2, 321–355. Carpenter G A and Grossberg S, (1987), “ART2: Self-organisation of stable category recognition codes for analog input patterns”, Appl. Optics, 26 (23), 4919–4930. Carpenter G A and Grossberg S, (1988), “The ART of adaptive pattern recognition by a self-organising neural network”, Computer, 77–88. Cichocki A and Unbahauen R, (1993), Neural Networks for Optimisation and Signal Processing, Chichester: Wiley.
92
CHAPTER 3
Elman J L, (1990), “Finding structure in time”, Cognitive Science, 14, 179–211. Gerstner W and Kistler W M, (2002), Spiking Neuron Models: Single Neurons, Populations and Plasticity, Cambridge University Press, UK. Goldberg D, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, Reading, MA: Addison-Wesley. Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA. Haykin S, (1999), Neural Networks: A Comprehensive Foundation, 2nd Edition, Upper Saddle River, NJ: Prentice Hall. Hecht-Nielsen R, (1990), Neurocomputing, Reading, MA: Addison-Wesley. Holland J H, (1975), Adaptation in Natural and Artificial Systems, Ann Arbor, MI: University of Michigan Press. Hopfield J J, (1982), “Neural networks and physical systems with emergent collective computational abilities”, Proc. National Academy of Sciences, 79, 2554–2558. Iannella N and Back A D, (2001), Spiking neural network architecture for nonlinear function approximation, Neural Networks, Special Issue, 14(6), 922–931. Jordan M I, (1986), “Attractor dynamics and parallelism in a connectionist sequential machines”, Proc. 8th Annual Conf. of the Cognitive Science Society, 531–546. Karaboga D, (1994), Design of Fuzzy Logic Controllers Using Genetic Algorithms, PhD thesis, University of Wales, Cardiff, UK. Kohonen T, (1989), Self-Organising and Associative Memory (3rd ed.), Berlin: Springer-Verlag. Lannella N and Back A D, (2001), Spiking neural network architecture for nonlinear function approximation, Neural Networks, Special Issue, 14(16), 922-931. Maass W, (1997), “Networks of spiking neurons: The third generation of neural network models”, Neural Networks, 10, 1659–1671. Maass W and Bishop CM, (1998), Pulsed Neural Networks, Cambridge: MIT Press. Moody J and Darken C J, (1989), “Fast learning in networks of locally-tuned processing units”, Neural Computation, 1 (2), 281–294. Natschläger T and Ruf B, (1998), “Spatial and temporal pattern analysis via spiking neurons”, Network: Computation in Neural systems, 9 (3), 319–332. Pham D T and Chan A J, (1998), “Control chart pattern recognition using a new type of self-organising neural network”, Proc. of the Institution of Mechanical Engineers, 212 (Part I), 115–127. Pham D T and Chan A J, (2001), “Unsupervised adaptive resonance theory neural networks for control chart pattern recognition”, Proc. of the Institution of Mechanical Engineers, 215 (Part B), 59–67. Pham D T and Karaboga D, (1993), “Dynamic system identification using recurrent neural networks and genetic algorithms”, Proc. 9th Int. Conf. on Mathematical and Computer Modelling, San Francisco. Pham D T and Liu X, (1992), “Dynamic system modelling using partially recurrent neural networks”, Journal of Systems Engineering, 2 (2), 90–97. Pham D T and Liu X, (1994), “Modelling and prediction using GMDH networks of Adalines with nonlinear preprocessors”, Int. J. Systems Science, 25 (11), 1743–1759. Pham D T and Oh S J, (1992), “A recurrent backpropagation neural network for dynamic system identification”, Journal of Systems Engineering, 2 (4), 213–223. Pham D T and Oztemel E, (1994), “Control chart pattern recognition using learning vector quatization networks”, Int. J. Production Research, 32 (3), 721–729. Rumelhart D and McClelland J, (1986), Parallel distributed processing: exploitations in the microstructure of cognition, volumes 1 and 2, Cambridge: MIT Press. Widrow B and Hoff M E, (1960), “Adaptive switching circuits”, Proc. 1960 IRE WESCON Convention Record, Part 4, IRE, New York, 96–104.
CHAPTER 4 APPLICATION OF NEURAL NETWORKS
D. ANDINA1 , A. VEGA-CORONA2 , J. I. SEIJAS3 , M. J. ALARCÓN 1
Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España.
[email protected] 2 Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato (UG), Salamanca, Gto., México.
[email protected] 3 Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España.
[email protected] Abstract:
This chapter is dedicated to the scope of which facts should be considered when deciding whether a Neural Network (NN) solution is suitable to solve a given problem. This is followed by a detailed example of a successful and useful application: a Neural Binary Detector
INTRODUCTION There are three main criteria which need to be applied when deciding whether a given problem lend to a neural solution. a) Unknown Algorithm. The solution to the problem cannot be explicitly described by an algorithm, a set of equations or a set of rules. Making the decision between a conventional and a neural computing solution on the basis of this criterion is not always entirely clear. There are problems where both, conventional and neural computing, may be able to provide appropriate solutions. The choice then depends on the resources available and the ultimate goals of the designer. b) Conviction of the existence of a solution. There must exists some evidence that an input-output mapping exists between a set of input variables x and corresponding output data y, such as y = fx. The form of f , however is not known. For example, let us suppose that we have input data patterns and corresponding desired output patterns. We pretend using a supervised algorithm to train a NN that will establish high-order (non-linear) correlations between the input features x and the output data y in order to minimize the output error. But, to train the NN, the fact that a database with a set of input and output pairs can be assembled does not necessarily means that a mapping can be constructed 93 D. Andina and D.T. Pham (eds.), Computational Intelligence, 93–108. © 2007 Springer.
94
CHAPTER 4
between the input variables and the desired data. Also, no matter how good the algorithm is, efficient results will not be achieved unless relevant features are used as the input to the network. c) Availability of data. There should be a large amount of data available, i.e. many different examples with which to train the network. If there is any doubts about their availability then, in all probability, the NN will not be efficient to solve the problem. Lack of suitable data is one of the main causes of problems during neural computing applications. Such data has to be collected and compiled in a computer readable form. Designer of the NN application must take account that there may be considerable practical problems associated with collecting and processing data; special instrumentation and recording facilities may be required, and specific experiments may be needed to ensure that the data cover the necessary range of conditions. For example, if a NN is to be applied to detect signals in noisy environments, data corresponding to all expected values of noise and signal should be used to train the network.
1.
FEASIBILITY STUDY
If a given application accomplishes the three main criteria stated above, the designer has to choose the type of the NN to apply. A thorough literature search for references to related work should be carried out at this designing stage. Nowadays, neural computing has a large applications literature and it is highly likely that there are papers on applications similar to the one you are considering. Once the literature search has been carried out, an outline design should be prepared as early as possible in the feasibility study stage. This may only be a paper exercise, but it could extend to the collection of sample of data and even some prototype work. Specially relevant is the consideration of pre-processing the input to extract and select the features that characterize the problem and a preliminary assessment of the required neural network architecture. If the prototype work is successfully, and the designer has sufficient training data available, the design of the NN can start; a work that involves a lot of empirical research. That will be shown in the following section.
2.
APPLICATION OF NNs TO BINARY DETECTION
It is expected that neural hardware will provide crucial tools for the enhancement and development of algorithms applied to automatic classification. In this sense, some relevant studies about the possibilities of neural nets [Kim and Guest, 1990, Decatur, 1992, Roth, 1990] have selected four principal areas of application: automatic target recognition, speech recognition, seismic signal processing and sonar signal processing. The availability of Neural Networks (NNs) as detectors is based on their capability to incorporate a great quantity of information of several classes, their ability
APPLICATION OF NEURAL NETWORKS
95
to generalize from noisy or incomplete information, and their robustness to ambiguities. Another advantage of NNs is the computational simplicity of their nodes. Although they are computationally intensive in general purpose computers, they can be implemented in specific massive parallel processing hardware that can overcome the implementations on the most powerful computers. Neural networks can have interesting robust capabilities when applied as binary detectors. This type of networks has proved its abilities in classifying problems, and we could reduce the binary detection problem as having to decide if an input has to be classified as one of two outputs, 0 or 1. However, the NNs detectors have some typical drawbacks: slow and unpredictable training, and some difficulty for adding new information to the net that employs too much time in its retraining. These problems become more critical as the size of the net increases. This application example deals with the possibilities of applying a Multilayer Perceptron (MLP) Neural Network to binary detection. After an optimization process, the performance of the proposed neural detector is compared with the optimal one, whose performance is given by the Optimum Neyman-Pearson detector, commonly used in radar detection applications. The detector input is modeled as a signal with additive noise (given by its complex envelope). The binary output is 1 or 0. As the MLP output can accurately estimate the a posteriori probabilities of the input classes [Ruck et al., 1990], a threshold T is established at the NN output to satisfy the Neyman-Pearson requirements [Van Trees, 1968]. The performance is evaluated by Monte-Carlo simulations over several input models. Their Receiver Operating Characteristics (ROC) and detection curves are obtained. The main topics of designing the network are: (a) The network structure design. Variations in performance are analyzed for different numbers of inputs and hidden layers, and different number of nodes in these hidden layers. (b) The network training. Using the BackPropagation (BP) algorithm with momentum term, we study the training parameters: initial set of weights, threshold value for training, momentum value, and the training method. The preparation of the training set, whose key parameter, after preliminary exprimental results to be Training Signal-to-Noise Ratio (TSNR), is discussed. Also, it is shown how to find the TSNR value that maximizes the detection probability Pd for a given false alarm probability Pfa , as is required by the Neyman-Pearson criterion. (c) The dependence of the training and the performance on the criterion function to be minimized by the BP algorithm. The following criterion functions are analyzed: Least Mean Square (LMS) Error [Hush and Horne, 1993], Minimum Misclassification Error (MME) [Telfer and Szu, 1994, Telfer and Szu, 1992] and El-Jaroudi and Makhoul (JM) criterion [El-Jaroudi and Makhoul, 1990]. (d) The results (training time, threshold values, error probability curves, ROC and detection curves) are analyzed, and appropriate conclusions are extracted.
96 3.
CHAPTER 4
THE NEURAL DETECTOR
The detector under consideration is a modified envelope detector [Andina and Sanz-González, 1995, Root, 1970], as is shown in Figure 1. The binary detection problem is reduced to decide if an input complex value (the complex envelope involving signal and noise) has to be classified as one of two outputs, 0 or 1. The need of processing complex signals with an all-real coefficient NN, requires to split off the input in its real and imaginary parts (the number of inputs doubles the number of integrated pulses); then, a threshold T is established at the NN output. The input rt is a band-pass signal, and the complex envelope xt = xC t + jxS t is sampled each T0 seconds. Then √ (1) xkT0 = xC kT0 + jxS kT0 k = 1 Mj = −1 At the neural network output, values in 0 1 are obtained. A threshold value T ∈ 0 1 is chosen so that output values in 0 T will be considered as binary output 0 (decision D0 ) and values in T 1 will represent 1 (decision D1 ). The two hypotheses H0 (target absent) and H1 (target present) are defined as follows (2a)
H0 xkT0 = nkT0
(2b)
H1 xkT0 = SkT0 · ej kT0 + nkT0
where T0 is the pulse repetition period, k varies from 1 to the number of integrated pulses M, SkT0 is the signal amplitude sequence, is the signal phase and nkT0 is the complex envelope of the noise sequence, i.e. nkT0 = nC kT0 + jnS kT0 . 3.1
The Multi-Layer Perceptron (MLP)
The NNs have clear advantages over classical detectors as nonparametric detectors. In this case, the statistical description of the input is not available. The only information available in the design process is the performance of the detector on a
xc(k0T )=xc(k)
LPF
z–1
xc(t)
r(t)
π − 2
z–M–1
~ sinwct
Comparison Device
Neural Net (MLP)
D0 D1
z–1 Threshold, T
LPF xs(t)
xs(k0T)=xs(k)
z
–M–1
Figure 1. The Neural Detector
APPLICATION OF NEURAL NETWORKS
97
group of patterns called training patterns. For this task, the BP algorithm carries out an implicit histogram of the input distribution, adapting freely to the distributions of each class; so, NNs contribute with a new level of sophistication to the classical techniques of nonparametric detection. For detection purposes, the Multi-Layer Perceptron (MLP) trained with the BackPropagation (BP) algorithm, has been found more powerful than other types of NNs [Kim and Guest, 1990]. While other types of NNs can learn topological maps using lateral inhibition and Hebbian learning, a MLP trained with BP can also discover topological maps. Furthermore, the MLP can be even superior to the parametric bayesian detectors when the input distribution departures from the assumptions. 3.2
The MLP Structure
Since mathematical methods to calculate the dimensions of the net are not yet available, one main question is: “How should the net be designed in order to obtain the desired decision regions in a reasonable amount of time?”. Generally, each problem requires different capacities of the net and it is not clear that a general rule to calculate its size would be found. If the net is very small, it could not be capable of forming a good model for the problem; if it is too large, it could implement several solutions, and many of them would probably be suboptimal. At least, for the majority of problems, only one hidden layer has demonstrated to be necessary, and it seems to be the case of binary detection. After a thorough study of different MLP structures, that included growing and pruning of the NN, there have not been found any performance improvements by adding more than one hidden layer, with the inconvenient of increasing critically the training time. So, the structure we have chosen is a MLP with one hidden layer, and we write 2M/N/1 for a MLP with 2M input nodes, N hidden layer nodes and one output node. Also, there is no way of establishing a priori the number of nodes in the hidden layer. For realizing exactly the input-output relation, it has been demonstrated that an upper bound for the number of nodes in the hidden layer of a MLP is of the number of training patterns [Hush and Horne, 1993]. But, for practical purposes, the number of hidden nodes should be much lower than the number of training patterns; otherwise, the net simply “memorizes” the training set, losing its generalization capability. In general, the size of the net has to be determined by means of a test and error procedure. Therefore, to choose the size of the hidden layer, empirical curves as the one presented in Figure 2 have been used [Andina, 1995]. The parameter Training Signal-to-Noise Ratio (TSNR) is the Signal-to-NoiseRatio used to generate the training set patterns, and is one of the key design parameters of the NN. If 2 is the noise power and A is the received signal amplitude (Marcum model [Marcum, 1960]), the Signal-to-Noise Ratio (SNR) is defined as (3)
SNR =
A2
2
More details will be given later.
98
CHAPTER 4
Pe(%) 40 TSNR=0dB TSNR=3dB TSNR=6dB TSNR=12dB
35 30 25 20 15 10 5 0
0
10
20
30
40
50
60
70
80
90
100
Num. Nodes Figure 2. Error probability Pe vs. number of nodes (N) in the hidden layer for an MLP, depending on Training Signal-to-Noise Ratio (TSNR)
In Figure 2, each curve presents a knee where the most effective relation between complexity and performance is verified. After a thorough study of the values of M, N and TSNR the structure 16/8/1 has been chosen as the most efficient one. In section 2 of this chapter, it will be shown that there is a range on the number of integrated pulses M (half the number of inputs) where the NN works efficiently. 4.
THE TRAINING ALGORITHM
4.1
The BackPropagation (BP) Algorithm
Even if the size of the net has been precisely determined, finding the adequate weights is a difficult problem. The BackPropagation (BP) algorithm modifies the output layer weights during the training as (see Appendix A, section 1) L
L
wij t + 1 = wij t −
(4) L
Wt L
wij t
where wij is the weight connecting output from i-th node of the layer L − 1 to the j-th node in the output layer L; the iteration counter is t, is the learning rate and
W is the criterion or error function, that depends on the weights matrix, W .
APPLICATION OF NEURAL NETWORKS
99
In order to improve the learning time, it is common to include the moment term l l within the basic BP algorithm. This term is wij t − wij t − 1, where 0 < < 1 l and wij t is the weight that connects the i-th output of layer l − 1 to j-th node in layer l. By means of the inclusion of this term, the current search direction is an exponentially weighted average of past directions, helping to keep the weights moving across flat portions of the error surface, after they have descended from the steep portions. In the case of our binary detector, = 08 has been used. This value has been chosen from empirical results [Andina, 1995]. The dynamic of learning utilized is cross-validation, a method that monitorizes the generalization capabilities of the net. This method demands to split learning data into two sets: a training set, employed to modify the net weights, and a testing set, which is utilized to measure its generalization capability. During the training, the performance of the net on the training set will continue improving, but its performance on the test set will only improve until a point, beyond which it will begin to degrade. It is at this point when the net begins to be overtrained (it is excessively specialized on the training set), loosing capacity of generalization; consequently, the training should be finished. In practice, the training has been stopped when a typical number of iterations has been carried out, choosing the net with the smallest error probability Pe over the test set. Although there is not any guarantee that an absolute minimum has been reached, for the optimized network the smallest Pe is achieved typically in the order of 1,500 iterations. The Least-Mean-Squares (LMS) criterion is the most widely used for training a Multi-Layer Perceptron (MLP). However, depending on the application, there is no reason to think that this criterion is the optimal one [Barnard and Casasent, 1989]. With the purpose of using an adequate criterion function for our detector, we have analyzed the following criterion functions (see also Appendix A, section 2): (a) Least-Mean-Squares (LMS). This criterion is the most widely used in Back Propagation (BP) learning algorithm. It has been proved that it approximates the Bayes optimal discriminant function and yields a least-squares estimate of the a posteriori probability of the class given in the input [Ruck et al., 1990]. It minimizes the expression (5)
ELMS =
P 1 1 y − yˆ p 2 P p=1 2 p
where P is the number of training pairs, p is the training pair counter, yˆ p ∈ 0 1 is the net output (the neural detector has only one output, as it has been mentioned in the Section 3), and yp is the desired output, 0 or 1 for the binary case. One of the LMS drawbacks is that a least squares estimate of the probability can lead to gross errors at low probabilities [El-Jaroudi and Makhoul, 1990]. (b) Minimum Misclassification Error (MME). This criterion minimizes the number of misclassified training samples. It approximates class boundaries directly from the criterion function, and it could perform better than LMS for less complexity
100
CHAPTER 4
networks [Telfer and Szu, 1992, Telfer and Szu, 1994]. In our detector, it minimizes the expression P 1 T L−1 P − f 2yp − 1Wp Yp (6) EMME = P p=1 where YpL−1 is the output vector of layer L − 1 and WpT is the weight vector of the output layer L, f· is the node activation function (see Appendix A, section 3. (c) El-Jaroudi and Makhoul (JM) criterion. It is similar to the Kullback-Leibler information measure and results in superior probability estimates when compared to least squares [El-Jaroudi and Makhoul, 1990]. For the binary detection case it minimizes (7)
EJM = −
P 1 ln1 − yp − yˆ p P p=1
Details about how this criterion has been implemented are given in Appendix A, section 4. 4.2
The Training Sets
The training set has been formed by an equal number of signal plus noise patterns and only noise patterns (i.e., PH0 = PH1 during the training), presented alternatively, so the desired output varies from D0 (detecting noise) to D1 (detecting target) in each iteration. Other choices of presenting the training pairs, as in a random sequence, are also suitable. The pattern sets are classical input models for radar detection [Marcum, 1960, Swerling, 1960]. The net inputs are samples of the complex envelope of the signal. There are NNs with complex coefficients [Kim and Guest, 1990] to process complex signals, but it seems more convenient to sample its “in-phase” and “quadrature” components, obtaining an all real coefficient net. This provides generality to this study, because the same NN can be utilized with a different preprocessing [Andina, 1995]. There are no methods that indicate the exact number of training patterns to be used. Assuming that the test and training patterns have the same distribution, there have been found limits to the number of training patterns necessary to achieve a given error probability: this number is approximately the number of weights divided by the desired error. If we utilize this upper limit, the number of patterns necessary for training becomes prohibitive. After an empirical study, an upper limit of 2,000 training patterns has proved to be sufficient for any criterion function or target model. The fact of using a model for the laboratory experiments is, partially, obligated. The acquisition of real patterns representative of the environment is expensive,
APPLICATION OF NEURAL NETWORKS
101
difficult, and even, some times, impossible. We must not overestimate the real data value, and it is really difficult to obtain the data under all propagation conditions. The use of learning from a model in the construction of the system lets the designer to train the MLP easily. Then the robustness of the system could be sufficient to achieve quasi-optimal results (see Section 5.2) over a real input distribution; or, if there is time to on-line training, the NN could be then refined.
5.
COMPUTER RESULTS
In order to make the resulting network independent of the initial weight values, each one has been trained four times, with random initial values of the weights ranging in −01 01, selecting as the final network the NN that provides the smallest probability of error. The training threshold has been set to 0.5. The signal input model in the following figures corresponds to the Marcum model [Marcum, 1960], that is (8)
xkT0 = A · ej0 + nkT0
where A: signal amplitude of constant value. 0 : initial phase, constant for each input pattern and uniformly distributed in 0 2 between patterns. nkT0 : complex white Gaussian noise of zero mean and variance 2 in each component. The signal-to-noise ratio for training or testing is defined as (9)
TSNRorSNR =
A2
2
The parameters for the study are the following: Training signal-to-noise ratio: TSNR Input signal-to-noise ratio: SNR Structure: N0 /N1 /1 (N0 inputs / N1 nodes in the hidden layer / 1 output) Let us call probability of detection, Pd , the probability that the detector decides D1 (target present) under the hypothesis H1 (target present), Pd = Pr D1 /H1 and false alarm Probability Pfa , the probability of deciding D1 under H0 (target absent), Pfa = Pr D1 /H0 . The performance of the detector is measured by detection curves (Pd vs. SNR, for a fixed Pfa ) and ROC curves (Receiver Operating Characteristics curves, Pd vs. Pfa for a fixed SNR).
102 5.1
CHAPTER 4
The Criterion Function
In this section we present significant results of the comparison of criterion functions. 5.1.1
Convergence of the algorithm
The best results are those of the JM criterion, followed by the LMS one. The MME requires, in general, a higher training iterations than the others. In Figure 3, the results for JM and LMS criterions are presented for Training Signal-to-noise ratio (TSNR) of 13 dB, showing that significant convergence improvements are obtained by using JM instead of LMS. This conclusion does not depend on the value of TSNR. Another general result is that rising the TSNR decreases the number of training iterations, as the decision regions to be separated by the MLP become more different. 5.1.2
Detection curves
First, we compare the detection characteristics of the networks with a fixed value of Pfa . In Figure 4, we present the best results for each criterion (these results depend on TSNR [Andina et al., 1995]) for two Pfa values: 10−2 and 10−3 , respectively. As we can see, the performance differences become more clear as the design conditions are more restrictive (i.e. lower values of the false alarm probability, Pfa ). The results show that criterion JM is the best for our detection problem. These results support the idea suggested in [El-Jaroudi and Makhoul, 1990] that the estimation of the a posteriori probabilities carried out by the JM criterion is more accurate than the LMS one. The error surface for the JM criterion is also more appropriate for the gradient search (faster convergence of learning). The value of TSNR that achieves better performance is 13 dB for JM criterion and 6 dB for LMS. The MME criterion presents the worst characteristics (see also Figure 5). This criterion does not adapt to the application, because it minimizes the classification error for a training threshold different from the threshold T used to
Pd 0.5
Pd 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
2,000
6,000
10,000 Iterations
0
2,000
10,000 6,000 Iterations
Figure 3. Error probability Pe vs. iteration for TSNR = 13 dB (a) JM criterion (b) LMS criterion.
103
APPLICATION OF NEURAL NETWORKS
Pd 1
.8
.6
.4 Pfa = 0.01 LMS6dB .2
MME6dB JM13dB -2
0
2
4
6
8
10
(a)
12 Pfa
Pd 1
.8
.6
.4 Pfa = 0.001 LMS13dB .2
MME13dB JM13dB -2
0
2
4
6 (b)
8
10
12 Pfa
Figure 4. Detection probability Pd vs. Signal-to-Noise Ratio (SNR) for a MLP trained with different criterion functions and Training Signal-to-Noise Ratio (TSNR). LMS6dB means LMS criterion and TSNR = 6 dB. MME6dB means MME criterion and TSNR = 6 dB, and so on. False alarm probability, Pfa = (a) 0.01 (b) 0.001
104
CHAPTER 4
Pd 1
.96
.92 SNR= 6 dB .88
LMS6dB MME13dB
.84
JM13dB .8 0
.02
.04
.06
.08
.1
Pfa Figure 5. Detection probability Pd vs. false alarm probability Pfa for a MLP trained with all criteria in study. SNR = 6 dB
achieve the design value of Pfa . Unless you can find a way to estimate the value of T before the training, the network will be forced to suboptimal performance in the direct operating mode, degrading its performance. 5.1.3
Receiver Operating Characteristic (ROC) curves
Now, the performance of the networks, as Pd vs. Pfa , under a fixed SNR is presented. In Figure 5, it is observed that JM provides the best results. The MME criterion presents the worst characteristics, as in the case of the detection curves. 5.2
Robustness Under Different Target Models
For more complex target models as is the classical Swerling I (SWI) [Swerling, 1960] target model where S in Equation (2b) has Rayleigh distribution, the results are also very near the optimal detector (Neyman-Pearson detector), as shown in Figure 6. In Figure 6, an interesting phenomenon is observed. In this case, for the purpose of evaluating the robustness of the network, a net trained by the simple Marcum model is evaluated over the SWI case. The network is not only capable to generalize and obtain good results over the more complicated input model, but also its results are superior to those of the network trained over the same SWI distribution. This result suggests that the neural network may achieve better results trained over simple hypotheses and then letting it to generalize over more complicated ones. This characteristic may be very useful in real cases, where input distributions will generally be different from those assumed in the design stage.
105
APPLICATION OF NEURAL NETWORKS Pd Pd 1
1
0.8
0.8
0.6
0.6 Pfa=0.01
Pfa=0.01
0.4
0.4
TSNR6
TSNR6
TSNR13
0.2
TSNR13
0.2
TSNR15
TSNR15
Optimum -10
-5
0
5
Optimum
10
15
0 -10
-5
0
SNR (dB)
(a)
5
10
(b)
15 SNR (dB)
Figure 6. Detection probability Pd vs. Signal-to-Noise Ratio (SNR) for a MLP of structure 16/8/1 (16 inputs, 8 nodes in the hidden layer and one output) and different TSNR. Pfa = 001 (a) Swerling I for training and testing (b) Net trained with Marcum model and tested over Swerling I target model.
APPENDIX: ON BACKPROPAGATION AND THE CRITERION FUNCTIONS 1.
THE BACKPROPAGATION ALGORITHM
The BackPropagation algorithm modifies the weight values of input i in node j, wij , following the expression wijl t + 1 = wijl t + ijl t
(A.1) where
l is the layer counter: l = 0 · · · L. l = 1 is the first hidden layer, l = L is the output layer. t is the iteration counter. is the learning rate. is the gradient of the criterion function, W. Let us define the error term for each node as l
j = −
(A.2)
W ˆ y j
l
l
· fj
=−
W l wij
l
·
1 l−1 yˆ i l
where yˆ j is the actual output of node j in layer l, fj activation function of the same node, and (A.3)
l
l
l
l
Wj = w0j w1j · · · wNl−1 j
then, we have (A.4)
l
ij =
W l
wij
l
l−1
= − j · yˆ i
is the derivative of the
106
CHAPTER 4
applying the chain rule, the error gradient for each node can be also expressed as a function of the error terms in the previous layers and Equation A.1 can be expressed as L L−1 t Lj t l = L wij t + ˆyi l (A.5) wij t + 1 = l l−1 l = L − 1 L − 2 1 wij t + ˆyi t lj t being 0 ≤ i ≤ Nl−1 − 1, 0 ≤ j ≤ Nl − 1, where Nl is the number of nodes in layer l and (A.6)
l
l
j t = fj
Nl+1 −1
l+1
l+1 n twjn
t
n=0
Now, to adapt the algorithm to each criterion, it is only necessary to calculate the L value of j and apply Equation (A.5) (bias values, always set to “1” must be added to properly train the network). 2.
LEAST MEAN SQUARES (LMS)
The criterion function to minimize is the mean square error between the actual output yˆ and the desired output y, in the training set, i.e. (A.7)
ELMS =
Ns P 1 1 y − yˆ jp 2 P p=1 j=1 2 jp
where P is the number of training pairs and Ns the number of output nodes. Approximating Equation A.7 by its value over one pattern, (A.8)
LMS =
Ns 1 j=1 2
yj − yˆ j 2 l
As L is the number of layers of the MLP, yˆ j = yˆ j . For the binary case, NS = 1 and from Equation (A.8) and (A.2) (A.9) 3.
W = −y − yˆ L ⇒ L = y − yˆ L · f L L ˆy
MINIMUM MISSCLASSIFICATION ERROR (MME)
This criterion minimizes the classification error on the training set and is proposed in [Telfer and Szu, 1994, Telfer and Szu, 1992]. The criterion function, when the network has one output node whose output values are in (0,1) and whenever the final values for the components of the weight matrix are high in module (what usually happens [Telfer and Szu, 1994, Telfer and Szu, 1992]), is P 1 T ˆ L−1 (A.10) EMME = P − f2yp − 1W Yp P p=1
APPLICATION OF NEURAL NETWORKS
107
where f· is the node activation function. Minimizing Equation (A.10) in the step by step training is equivalent to minimize (A.11)
MME W = 1 − f2y − 1W T Yˆ L−1
For the binary case, (A.12) 4.
L = 2y − 1 · f L
THE JM CRITERION
This criterion is similar to the Kullback-Leibler information measure, and it minimizes Pi (A.13) HQ/P = Pi · ln Qi i being Pi = yi = PrHi X the a posteriori probability of hypothesis Hi i = 0 1 ˆ and Qi = yˆ j = PrH i X. When there is enough training vectors, minimizing Equation (A.13) is equivalent to minimize px · ln yˆ i x Hi W · dx (A.14) EJM = − i
∀x
being x the input vector. For the binary case,
1 (A.15) EJM = − ln yˆ 0 x H0 W + ln yˆ 1 x H1 W P Xp ∈H0 Xp ∈H1 being P the number of input patterns. As there is just one output node with values ˆ ˆ ˆ , PrH ˆ and Equation (A.15) can be written as in 0 1, PrH 1 X = y 0 X = 1 − y
1 (A.16) EJM = − ln1 − ˆyx H0 W + ln yˆ x H1 W P Xp ∈H0 Xp ∈H1 After some simple simplifications, Equation (A.16) can be written as function of the output yˆ (A.17)
EJM = −
P 1 ln1 − yp − yˆ p P p=1
and, for each iteration step (A.18)
JM W = − ln1 − y − yˆ W
Finally,we have (A.19)
L =
sgny − yˆ L ·f 1 − y − yˆ
where sgn· is the sign function.
108
CHAPTER 4
REFERENCES M.S.Kim, C.C.Guest, Modification of backpropagation networks for complex-valued signal processing in frequency domain, IEEE Proc. Int. Conf. Neural Networks, IJCNN, Vol. III, pp. 27–31, San Diego, June 1990. S. E. Decatur, Application of neural networks to terrain classification, Proc. Int. Conf. Neural Networks, pp. 283–288, 1989. N. Miller, M.W. McKenna, T.C. Lau, Office of Naval Research Contributions to Neural Networks and Signal Processing in Oceanic Engineering, IEEE Journal of Oceanic Engineering, vol. 17, no. 4, Oct. 1992. M.W. Roth, Survey of neural network technology for automatic target recognition, IEEE Trans. Neural Networks, vol. 1, no. 1, pp. 28–43, Mar. 1990. D.W. Ruck, S.K. Rogers, M. Kabrisky, M.E. Oxley, B.W. Suter, The Multilayer Perceptron as an Approximation to a Bayes Optimal Discriminant Function, IEEE Trans. on Neural Networks, vol. 1, no. 4, pp. 296–298, Dec. 1990. H.L. Van Trees, Detection, Estimation and Modulation Theory, Part I, Eds. Wiley and Sons, New York, 1968. D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippmann?, IEEE Signal Processing Magazine, pp. 8–51, Jan. 1993. B.A. Telfer, H.H. Szu, Implementing the Minimum-Misclassification-Error Energy Function for Target Recognition, Neural Networks, vol. 7, no. 5, pp. 809–818, 1994. B.A. Telfer, H.H. Szu, Energy Functions for Minimizing Misclassification Error With MinimumComplexity Networks, Proc. of Int. Joint Conf. Neural Networks, IJCNN, vol IV, pp. 214–219, 1992. A. El-Jaroudi, J. Makhoul, A New Error Criterion For Posterior Probability Estimation With Neural Nets, Proc. of Int. Joint. Conf. Neural Networks, IJCNN, vol. I, no. 5, pp. 185–192, 1990. D. Andina, J.L. Sanz-González, On the problem of Binary Detection with Neural Networks, Proc. of 38 Midwest Symposium on Circuits and Systems, Rio de Janeiro, Brazil, vol. I, pp. 554–557, Aug. 1995. W.L. Root, An Introduction to the Theory of the Detection of Signals in Noise, Proc. of the IEEE, vol 58, pp. 610–622, May 1970. D.Andina, Optimización de Detectores Neuronales: Aplicación a Radar y Sonar, Ph. D. Dissertation (in Spanish), ETSIT-M, Polytechnic University of Madrid, Spain, Dec. 1995. J.L. Marcum, A Statistical Theory of Target Detection by Pulsed Radar, IRE Trans. on Information Theory, vol. IT-6, no. 2, pp. 59–144. Apr. 1960. E. Barnard, D. Casasent, A Comparison Between Criterion Functions for Linear Classifiers, with Application to Neural Nets, IEEE Trans. Systems, Man, and Cybernetics, vol. 19, no. 5, pp. 1030–1040, Oct. 1989. P. Swerling, Probability of detection for fluctuating targets, IRE Trans. on Information Theory, vol. IT-6, no. 2, pp. 269–308, Apr. 1960. D. Andina, J.L. Sanz-González, J.A. Jiménez-Pajares, A Comparison of Criterion Functions for a Neural Network Applied to Binary Detection, Proc. of Int. Conf. Neural Networks, ICNN, Perth, Australia, Vol I, pp. 329–333, Nov. 1995.
CHAPTER 5 RADIAL BASIS FUNCTION NETWORKS AND THEIR APPLICATION IN COMMUNICATION SYSTEMS
ASCENSIÓN GALLARDO ANTOLÍN1 , JUAN PASCUAL GARCÍA2 , JOSÉ LUIS SANCHO GÓMEZ3 1
Departamento de Teoría de la Señal y Comunicaciones, EPS-Universidad Carlos III de Madrid, Avda. de la Universidad, 30, 28911-Leganés (Madrid), SPAIN.
[email protected] 2 Departamento de las Tecnologías de la Información y las Comunicaciones, Universidad Politécnica de Cartagena, Campus de la Muralla del Mar, s/n, 30202, Cartagena (Murcia) SPAIN.
[email protected] 3 Departamento de las Tecnologías de la Información y las Comunicaciones. Universidad Politécnica de Cartagena Campus de la Muralla del Mar, s/n, 30202, Cartagena (Murcia) SPAIN.
[email protected] Abstract:
Among the different types of Neural Networks (NN), the most popular and frequently used architectures are the Multilayer Perceptron (MLP) and Radial Basis Functions (RBF) due to their approximation capabilities. In this chapter we discuss the use of RBF networks to solve problems in different areas related to communications systems. In the first part of the chapter, we revise the structure of the RBF networks and the main procedures to train them. In the second part, the main applications of RBF networks in communication systems are presented and described. In particular, we will focus our attention in antenna array signal processing (direction-of-arrival estimation and beamforming) and channel equalization (intersymbol interferences and co-channel interferences). Other applications such as coding/decoding, system identification, fault detection in access networks and automatic recognition of wireless standards are also mentioned
Keywords:
Neural networks, radial basis functions, communication systems, channel equalization, antenna array signal processing
1.
RADIAL BASIS FUNCTION NETWORKS
The Radial Basis Function (RBF) networks are one type of layered feedforward neural networks (NN) capable of approximating any continuous function with a certain precision level. The performance of the RBF networks can be viewed as a function approximation that consists of a multidimensional curve fitting problem. 109 D. Andina and D.T. Pham (eds.), Computational Intelligence, 109–130. © 2007 Springer.
110
CHAPTER 5
The function approximation is realized by means of a first nonlinear mapping from the input space to a high dimensional hidden space and a second linear mapping from the hidden space to the final output space. This operation is used in complex pattern classification problems because once the input space has been mapped in a highdimensional space in a nonlinear way, the patterns are easier separable by means of a linear transformation. The support of this operation is given by the Cover’s Theorem of the separability of patterns [Cover, 1965]. According to this theorem, a complex pattern-classification problem is more likely to be linearly separable if a nonlinear cast in a high-dimensional space is done. The aforementioned mapping used in the pattern classification problem can be viewed as a surface construction that can also be used in the interpolation problems. This approach in the RBF networks design was taken by first time in [Broomhead and Lowe, 1988]. In the RBF networks the function that interpolates the data is of the form [Powell, 1988]: (1)
Fx =
N
wi x − ci
i=1
where · is the radial basis function, · is the Euclidean norm, N is the number of the training data pairs, and ci is the center of the i-th radial basis function. In this approximation, there are so many centers as number of inputs, being their values the same as the input vectors, i.e., ci = xi . During the training stage a known set of input and output data pairs are delivered to the RBF network to select the centers and compute the output layer weights. F function has to satisfy the interpolation condition Fxi = di , where di is the desired training output value. The method applied to carry out the output layer weights is easier to explain if the RBF network performance is expressed in a matrix form: (2)
d = w
where d = d1 d2 dN T is the desired output vector, is an N -by-N matrix with elements ji = xj − ci with j i = 1 2 N , and w = w1 w2 wN T is the weight vector. This weight vector is computed as follows: (3)
w = −1 d
where −1 is the inverse of the matrix . The Micchelli’s theorem [Micchelli, 1986] gives the conditions under the RBF network produces an output surface or function F that passes through all the training points. To measure the generalization capabilities of the trained network, i.e., its behavior with patterns that have not been used in the training phase, new points are delivered to the RBF network. The strict interpolation surface usually implies a poor generalization capability, especially when the number of training points is high. The overfitting produced by the strict interpolation is undesirable in the most of cases because the new RBF network outputs, different from the training outputs, will not
RADIAL BASIS FUNCTION NETWORKS
111
be correct. Due to the noise present in the training data or the lack of data in some regions of the input and output space, the construction of the interpolation surface is an ill-posed problem. The regularization theory was proposed by Poggio and Girosi to solve the ill-posed surface reconstruction problem [Poggio and Girosi, 1990]. It can be proved that this regularization network is a universal approximator and the approximation performed is optimal. Because of the regularization network training is a very computational demanding task, a suboptimal solution was proposed in [Poggio and Girosi, 1990]. This proposal allows different RBF so that the function approximation is realized by: (4)
Fx =
m
i i x − xi
i=1
where the number of radial basis is m ≤ N . Typical examples of radial basis are inverse multiquadric functions and Gaussian functions. In [Park and Sandberg, 1991], the universal approximation theorem for RBF networks is proved. This theoretical support allows to design RBF networks which have the capability of approximating any continuous function when a set of appropriate parameters are selected. 2.
ARCHITECTURE
A RBF network consists of two layers as seen in Figure 1. The input level (not considered as a layer) is responsible of the data acquisition. The first layer is called the hidden layer because it is between the input layer and the output layer and it performs a nonlinear transformation by means of the radial basis functions. Finally, the output layer produces the output of the network by means of a linear transformation. In Figure 1, it is shown how each input data is delivered to every neuron (RBF) in the hidden layer. A bias term is applied multiplying a constant function plus its linear weight. Every radial basis function output is multiplied by the corresponding linear weight and all are summed up to carry out the RBF network output. The ϕ=1
w0=b
x1
ϕ
w1
x2
ϕ
w2
xn
ϕ
wm
Hidden Layer
F(x)
Output Layer
Figure 1. Radial Basis Function Network structure
112
CHAPTER 5
architecture depicted in Figure 1 refers to an output space of one dimension. It is straightforward to generalize the model to the case of multidimensional output space. Among all possible Radial Basis functions, one of the most used is the Gaussian function. Thus: (5)
i x = exp
1 x − ci 2i2
where ci is the radial basis center, i2 is the variance of the Gaussian function, and x is the input vector. The variances are usually chosen to be common to all the Gaussian functions, although other alternatives have been proposed; one of them is described in the next section. The Gaussian radial basis functions considered above can be generalized to allow for arbitrary covariance matrices i . Thus, the basis functions take the form: (6)
1 i x = exp− x − ci T −1 i x − ci 2
It is sometimes useful to consider the covariance matrices equal and introduce a weighted norm of the form: (7) (8)
1 −1 = CT C 2 1 1 1 C = diag · · · 1 2 m
where C is the norm weighting matrix and −1 is the covariance matrix. This weighting matrix is diagonal with different values along the diagonal, then the Gaussian function will have different variances along the different dimensions of the input space. 3.
TRAINING ALGORITHMS
The design of a RBF network is composed of two phases: the training process and the testing phase. During the first one, the network learns from a well known set of input-output data pairs called the training set. After training process is completed, the generalization capability of the trained network is checked in the testing phase. To do this, a set of input-output data pairs different from these used in the training phase is delivered to the RBF network to measure the performance of the network. The aim of a training algorithm is to choose the centers, variances (or the covariance matrix if the norm weighting is used) and the linear layer weights of the output layer. The most important issue in the RBF network design is the centers selection. Usually, training strategies use combined supervised and unsupervised strategies. The supervised algorithms take into account the desired output values to compute the weights. An unsupervised algorithm only uses the input data to calculate the radial basis centers and variances. Following this explanation, four different training algorithms are presented.
113
RADIAL BASIS FUNCTION NETWORKS
3.1
Fixed Centers Selected at Random
This unsupervised strategy consists of selecting the centers of the Gaussian Radial Basis randomly from the input data of the training set. Each center, ck , is equal to one training input pattern. The variance of each Gaussian function is calculated as: (9)
=
2 dmax 2m
where dmax is the maximum distance between the selected centers, and m is the number of centers [Haykin, 1999]. Choosing too peaked or too flat Gaussian functions should be avoided since extreme variance values produce a bad performance. The output layer weights can be carried out with the pseudo-inverse procedure. The RBF network operation can be expressed in a matrix form as: ⎛ ⎞ ⎛ ⎞⎛ ⎞ d1 x1 c1 x1 c2 x1 cm w1 ⎜ d2 ⎟ ⎜ x2 c1 x2 c2 x2 cm ⎟ ⎜ w2 ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ (10) ⎜
⎟ = ⎜ ⎟ ⎜
⎟
⎝ ⎠ ⎝ ⎠⎝ ⎠
xN c1 xN c2 xN cm
dN (11)
wm
d = w
where d is the vector of the desired training outputs, w is the weight vector, and is a non-square matrix with components equal to the output of the radial basis functions evaluated in the training points. The weights are calculated as: (12) (13)
w = + d + = T −1 T
where + is the pseudo-inverse of the matrix. If the training set is numerous, the pseudo-inverse calculation requires extensive computation; nevertheless, there exist many efficient algorithms to compute the pseudo-inverse matrix [Golub and Van Loan, 1996]. Another problem arises when T is singular or nearly singular. In this case, the direct solution given by Equation (12) can lead to numerical difficulties. In practice, such problems are best resolved by using the technique of Singular Value Decomposition (SVD) to find a solution for the weights, [Golub and Van Loan, 1996]. 3.2
Self-Organized Selection of Centers
The fixed centers selected at random is a simple and fast training strategy but it usually leads to a poor performance or large RBF networks. Moreover, if the centers are chosen close together it produces ill conditioning in the pseudo-inverse problem. The training algorithms included in the self-organized selection of centers category
114
CHAPTER 5
try to avoid the previous problems by means of clustering procedures. In this section two algorithms are explained: the k-means clustering and the self-organizing map clustering. The RBF network training is completed with the last layer weight estimation. Procedures as pseudo-inverse or a gradient descent algorithm may be utilized to deal with this last part of the training. The most frequently used gradient descent algorithm is the Widrow-Hoff or Least Mean-Square (LMS) rule due to its simplicity, ability to operate satisfactorily in an unknown environment, and its ability of tracking variations of input statistics [Haykin, 2002]. 3.2.1
K-means clustering algorithm
This learning strategy allocates the centers along the regions of the input space where the relevant data are present. The algorithm begins with the random selection of m centers. At each iteration of the algorithm the training input pattern x is assigned to the k center if: x − ck < x − cj for j = 1 2 m and j = k. The new center is computed by ck = N1 xSk x where Nk is the number of input patterns k that belong to the cluster Sk defined as: (14)
Sk = x
with x − ck < x − cj for j = 1 2 m and j = k
The procedure stops when there is no variation in the position of the centers. After the centers have been calculated, several ways to calculate the variance of the Gaussian functions may be used. Normally, the variance of each Gaussian function is the mean of the distances of some certain group of neighboring centers or input patterns; for example, we may select as variance value of the k-th Gaussian function, the mean of the distances to the p nearest centers by k = p1 pi=1 li , where li is the distance between the k-th center and the i-th nearest center and l1 ≤ ≤ li ≤ ≤ lm . Usually, the Euclidean norm is used to compute the distances. Another alternative to select k consists of computing the distance between the current center and the nearest center belonging to other class and establishing k = mincj − ck with j = 1 2 m j = k, and a scale factor. The k can also be estimated making use of the mean of the input patterns belonging to the cluster Sk . 3.2.2
Self-Organizing map clustering: SOM
The Self-Organized Map is a method to classify the input patterns into groups or clusters each one characterized by a particular center. This iterative algorithm developed by T. Kohonen can be divided in two steps [Kohonen, 1990]. First, an input training pattern is randomly selected and assigned to the closest (“winner”) cluster, this is called the competitive phase. In a second step, the center of the winner cluster, together with a predefined neighborhood, are updated such that they are moved towards the input vector. This is called the updating phase. The initial position of the centers are initialized to random values. All the centers have to be different and it is recommendable that the centers take small euclidean norms. In the neural network structure, a neighborhood Nc is defined around each
RADIAL BASIS FUNCTION NETWORKS
115
neuron. At the beginning of the algorithm, because of the random initialization of the centers, the corresponding centers of the neurons belonging to each neighborhood set Nc can be located in distant positions in the input space. The aim of the algorithm is therefore to place the centers of the corresponding neighborhood set neurons in nearby positions in the input space. During the i-th iteration of the the competitive phase all the Euclidean distances between the given input pattern and the cluster centers are calculated. The winner center, ck , is that one with the minimum distance, i.e., kxi = argminj xi − cj i for j = 1 2 m. During the i-th iteration of the updating phase, the centers belonging to the neighborhood, Nc , of the winner center, ck , are adjusted moving towards the input; the rest of the centers are not updated. Thus, the updating equations are:
(15)
ck i + 1 =
ck i + ix − ck i if k ∈ Nc if k ∈ Nc ck i
The parameter i is an adapting rate that may take values comprised between 0 and 1. The i is related to the gain used in the stochastic approximation processes. As in these methods it should decrease with time. Once the centers are calculated according to Equation (15), the final structure is such that similar input vectors activate the same output (center or neuron), while different vectors activate different neurons.
3.3
Supervised Selection of Centers
One approach to develop a supervised selection of centers is to apply a gradient descent algorithm to a network error function. The supervised selection of centers tries to avoid the inherent problems found in the previous training strategies. The heuristic methods as selecting the centers at random or k-means clustering usually lead to large networks for a given error or do not generalize well. The time to converge to a solution is higher in the supervised selection than in the algorithms previously explained. The supervised procedure can also be extended to the calculation of the gaussian variances and the output layer weights. In the most general form of the supervised training algorithm, the gradient descent procedure is applied to the evaluation of the covariance matrices −1 k . Once the error function has been defined, the relevant error function gradients respect the corresponding parameters are calculated. The change in every iteration is proportional to the evaluated gradient. One possibly error function is the Sum of Square Errors (SSE) defined as: (16)
E=
N 1 e2 2 j=1 j
116
CHAPTER 5
where N is the number of the training patterns, and ej is the error committed by the RBF network when one pattern is presented. This error can be written as: (17)
ej = dj −
m
i xj − ci
i=1
The following equations implement the updating step of this supervised training method for the case of a one-dimensional output space [Haykin, 1999]. It is apparent to generalize it to the general case of a multidimensional output space. The initial values of centers, matrix covariance, and weights are chosen to be in a useful region of the parameter space. This initial constraint is done to reduce the probability of getting stuck in a local minimum. The updating equations for the k-th radial basis center in the j-th iteration is calculated as follows: (18) (19)
N Ej = 2 k j ei j xi − ck j −1 k xi − ck j ck j i=1
ck j + 1 = ck j − 1
Ej ck j
The linear weight updating is carried out by means of the equations: (20) (21)
N Ej = ei jxi − ck j k j i=1
k j + 1 = k j − 2
Ej k j
Finally, the −1 k covariance matrix is adjusted with the next equations: (22) (23) (24)
N Ej = − k j ei j xi − ck jQik j −1 k j i=1
Qik j = xi − ck j xi − ck j T −1 −1 k j + 1 = k j − 3
Ej −1 k j
In the above equations, the parameters updating is made when all the training patterns have been presented to the network. This is a batch update. The adjust of each network parameter can as well be accomplished in a continuous way. In this second case, the parameter updates are applied after every pattern presentation. This procedure is known as sequential or pattern-based updating. The 1 , 2 and 3 terms are the learning rate parameters of the centers, weights, and matrix covariance respectively.
RADIAL BASIS FUNCTION NETWORKS
3.4
117
Orthogonal Least Squares (OLS)
The Orthogonal Least Squares algorithm can be applied to the RBF network centers selection problem [Chen et al., 1991]. The objective of the OLS is to select an appropriate centers set in a rational way and to maximize the contribution of the selected centers to the desired response. This iterative algorithm completes the training process evaluating the output linear weights. From the point of view of this procedure, the RBF network can be considered as a special case of the linear regression model. In matrix notation, the output of the RBF network is given by Pw. In this way, we can write: (25)
d = Pw + E
where d = d1 d2 dN is the desired output vector of the training data set, P = p1 pM is the regression matrix, w = w1 wM is the weight vector, and E = e1 e2 eN is the error vector of the training patterns. In this case, the output space is one-dimensional. The objective of the OLS algorithm is to find w and pj being M as lower as posible such that the energy of E is minimized. Each column of the regression matrix is a regressor vector that is calculated as the result of evaluation onto every input data the radial basis nonlinear function with the corresponding center: pi = x1 − ci x2 − ci xN − ci T . We can considerer that the regressor vectors pi form a set of basis vectors. The Least-Squares ˆ of the above problem satisfies the condition that Pw ˆ is the projection solution w of d in the space spanned by the regressors. The square of this projection is part of the desired output energy. Since the initial regressors are generally correlated and in order to measure the individual contribution to the output energy, the OLS transforms the initial set of M regressors in a set of m ≤ M orthogonal basis vectors. The individual contribution to the desired output energy of each orthogonal vector is easy to find. It is usually to select as initial regressors set the ones that result from using as centers all the training input data. In this last case, we have N regressor vectors. The orthogonalization of the initial regressor vectors is realized by means of the well known Gram-Schmidt method. This method decomposes the regression matrix in a product of a two matrices U and A: (26)
P = UA
where U matrix is composed of orthogonal columns and A is a triangular matrix. At each iteration of the algorithm, a column of U and the corresponding column of A are calculated. Each column of U is the result of the orthogonalization of a selected regressor p. An error reduction ratio is calculated to select the regressor that has to be orthogonalized in the corresponding iteration. Each column of U represents the selection of a certain center. The elements of the corresponding column of the matrix A are calculated with the selected regressor and the already calculated orthogonal columns of U. The algorithm stops when a maximum pre-set error is
118
CHAPTER 5
reached. The triangular matrix A is used in the computation of the output weights. In this procedure the variance is common to all the radial basis functions and it is a value to be set previously to the algorithm beginning. In [Shertinsky and Picard, 1996], it is shown that the OLS does not produce the smallest set of centers if a nonorthogonal basis is used. When the basis is nonorthogonal, the energy contributions of the basis vectors are not independent. In this case, the OLS is not able to determine the regressor that produces the maximal alignment in a global sense. Despite of this suboptimal performance, the OLS has proved its validity in numerous applications and it usually leads to more compact set of centers than the previous learning strategies [Haykin, 1999]. In the last years some versions of OLS have been developed. In the next sections a brief explanation of two OLS variant methods is given. 3.4.1
Recursive Orthogonal Least Squares (ROLS)
The recursive orthogonal least squares is useful in the time variant problems [Gomm and Yu, 2000]. In this proposal, the classical OLS is transformed in a set of matrix equations in which the orthogonal decomposition is made in a recursive way. The matrices in each iteration depends on the previous ones. A recursive residual equation is introduced to measure the RBF network accuracy. Higher values of the residual represent better accuracy levels. In [Gomm and Yu, 2000], two operation modes of the ROLS are described. In the backward mode, a center is removed at each iteration. In this operation, the center that causes the smallest increase in the residual is removed from the network. In the forward mode, a center is added at each iteration. In this last operation mode, the center that allows for the maximum increment in the residual is selected. In [Gomm and Yu, 2000], the Givens rotations are used to achieved an efficient implementation of the forward and backward techniques. 3.4.2
Locally Regularised Orthogonal Least-Squares (LROLS)
The locally regularised orthogonal least-squares, is an algorithm developed to design RBF networks that can generalize well in situations with high noise levels [Chen, 2002]. Moreover, the local regularization enforces the sparsity of the subset model obtained by the original OLS training algorithm. The LROLS technique combines the local regularization approach with the OLS training. To achieve the local regularization, each orthogonal regressor uj has associated a regularization parameter j . This new parameter is included in a new error reduction ratio equation. Just as in the classical OLS, this new error reduction ratio is evaluated to select the orthogonal regressor uj . Since the optimal value of the regularization parameter is unknown, an iterative procedure is carried out. At each iteration a regressors subset is selected with the current values of the regularization parameters. The iterative procedure stops when there is no changes in the regularization parameters.
RADIAL BASIS FUNCTION NETWORKS
4.
119
RELATION WITH SUPPORT VECTOR MACHINES (SVM)
The Support Vector Machines are another kind of feedforward neural networks developed by Vapnik [Boser, Guyon and Vapnik, 1992], [Cortes and Vapnik, 1995],[Vapnik, 1995] and [Vapnik, 1998]. The aim of a SVM is to construct an optimal hyperplane that allows a maximum margin classification in the case of linear separable patterns. When the patterns are nonseparable, the constructed hyperplane minimizes the probability of classification error. The hyperplane solution is found, thanks to the application of the method of structural risk minimization, so that the optimal hyperplane is the one that minimizes the Vapnik-Chervonenkis dimension. As seen in the introduction, the Cover’s theorem assures that a set of nonlinearly separable patterns are linear separable with high probability if the input space is mapped into a higher dimension space. This new feature space must have a high enough dimension and the mapping has to be nonlinear. In a SVM, the optimal hyperplane is not constructed in the input space but in the new feature space. The hyperplane construction involves the evaluation of an inner-product kernel. The Mercer’s theorem [Hochstadt, 1989] allows to determine if a certain kernel is an inner-product kernel and therefore if the kernel is acceptable in the design of a SVM. Once the kernel has been chosen, the optimum SVM design implies the maximization of an objective function subject to certain constraints by means of the Lagrange multipliers. The design finishes when a weight matrix and a set of support vectors have been evaluated. The SVM output calculation consists of several operations. First, the inner-product kernel between the input and the support vectors. Each inner-product kernel is multiplied by the corresponding weight. Finally, all the multiplications between the inner-product kernels and weights are summed up to produce the desired output. It is easy to implement the SVM operation as a neural network structure. Depending on the inner-product kernel there are different types of support vector machines. The polynomial learning machine uses the function xT xi +1p . A two layer perceptron is constructed if the function tanh0 xT xi +1 is applied. If the Gaussian function exp− 21 2 x − xi 2 is used, then a RBF network is obtained. The RBF networks are from this point of view a special case of SVM. The SVM designing algorithm allows to construct RBF networks avoiding the heuristics needed in the conventional RBF network training algorithms. 5.
APPLICATIONS OF RADIAL BASIS FUNCTION NETWORKS TO COMMUNICATION SYSTEMS
The general aim of this section is to cover the applications of radial basis function (RBF) networks to different areas related to communication systems. In particular, we will focus our attention in antenna array signal processing and channel equalization. We will also mention other applications such as coding/decoding, system identification, fault detection in access networks and automatic recognition of wireless standards.
120 5.1
CHAPTER 5
Antenna Array Signal Processing
Recently, antenna array signal processing (AASP) is receiving growing attention from researchers as a possible means of improving the performance and capacity of many communication systems, such as mobile radio systems. AASP comprises basically two research areas: direction-of-arrival (DoA) estimation and beamforming. Conventional methods for AASP are usually linear. However, they can be improved by using non-linear techniques, typically neural networks. Interested readers can find a complete review of neural methods for antenna array signal processing in [Du et al., 2002]. Radial basis function networks stand out other neural approaches due to several reasons. Firstly, RBFs have the property of universal approximation, i.e. they can approximate continuous functions arbitrarily well when a large number of neurons is considered. The importance of this property will be highlighted in the subsection devoted to RBF-based DoA estimators. Secondly, it has been demonstrated that RBF methods are very robust against noise, so they perform well in a wide range of signal-to-noise ratios (SNRs). Finally, their training process is faster as compared to other neural networks as multilayer perceptron network (MLP) allowing an online adjustment of the network. This last property is very important because DoA and beamforming are usually applications in which both adaptive adjustment to time-varying conditions and real-time processing are required. The rest of the section is devoted to the description of some of the most relevant contributions in the field of RBF-based methods applied to AASP problems. 5.1.1
Direction-of-Arrival (DoA) estimation
The purpose of DoA algorithms is to obtain the direction of arrival of signals from the information contained in the measurements of antenna array outputs. The antenna array can be viewed as a device which performs a non-linear mapping, G S → , from the space of angles of arrival to the space of sensor output S. So, one method for estimating the values of is to approximate the inverse mapping F → S. Due to its universal approximation property, RBFs can be used for this purpose. Figure 2 is a block diagram of a DoA estimator based on RBF networks. As it can be observed, this system is composed of the sensor array, a preprocessing stage and a RBF network. The array output S is passed through the preprocessing module for removing irrelevant information. This step is fundamental because it contributes to minimize the required size of the RBF network. The preprocessing data is the input of the RBF network which performs the non-linear mapping by approximating the function F . Finally, the RBF output is the estimation of the directions of arrival. Several RBF-based DoA estimator have been developed following the generic scheme in Figure 2. Two representative examples are [Lo et al., 1994] and [El Zooghby et al., 1997]. In both cases, the preprocessing stage performs the computation of the normalized covariance matrix of the sensor output in order to eliminate
RADIAL BASIS FUNCTION NETWORKS
121
Sensor array
... s1
s2
sk
Sensor outputs
PREPROCESSING
RBF NETWORK
DOA estimation (θ) Figure 2. Block diagram of a DoA estimator based on RBF networks
the initial phase and gain of the array output. A similar scheme was used in [Southall et al., 1995], but, in this case, the RBF input consisted of the sines and cosines of the phase differences between the measured signals of the elements and the reference element. All these approaches outperform one of the most used algorithms in this field, the MUSIC (Multiple Signal Classification method) algorithm [Schmidt, 1986], in both accuracy and speed. Another example of tracking system based on RBF networks is proposed in [Mukai et al., 2002]. In this article, the authors used an adaptive RBF in conjunction with an array feed compensation system for acquisition and continuously tracking of a Deep Space Network (DSN) antenna for communications at Ka-band of 32 GHz. However, the extension of these methods for multiple-source tracking is not straightforward. In fact, for detecting more than three sources, the resulting RBF network presents a large size and its training is impracticable [Du et al., 2002]. In addition, the knowledge of the number of sources is required. The solution proposed in [El Zooghby et al., 2000] overcomes these limitations. In this work, El Zooghby et al. developed a new DoA algorithm (N-MUST algorithm) with application to smart antennas for wireless terrestrial and satellite mobile communications. N-MUST consists of two stages. In the first stage (detection), a coarse detection of sources is performed. This result is refined in the second stage (estimation) in which several RBF networks were trained for different ranges of angles of arrival.
122 5.1.2
CHAPTER 5
Beamforming
The main objectives of beamforming techniques is to acquire and reconstruct the original signal of the desired source while rejecting the rest of non-desired sources. Beamforming is of special importance in modern mobile satellite communication systems and global positioning systems (GPS) because they require the presence of smart antennas capable of distinguishing signals from multiple sources. These systems must adapt the radiation pattern of the antenna in order to cancel the interfering signals and emphasize the desired ones. In [El Zooghby et al., 1998] it was demonstrated that RBF networks can be used for this purpose. In this case, RBF-based beamforming was applied in one and two-dimensional adaptive arrays obtaining a results very close to the optimum Wiener solution. In the context of digital beamforming (DBF) antenna array, Xu et al. proposed the combination of the so-called Constant Module algorithm (CMA) and a novel RBFbased beamformer for interference cancellation and Gaussian and non-Gaussian noise [Xu et al., 2000]. The presence of non-Gaussian noise (for example, atmospheric noise) is typical of satellite communications. In these noisy conditions, the authors showed that their approach outperformed the conventional CMA algorithm. 5.2
Channel Equalization
Digital communication systems are often impaired by several types of distortions which produce significant changes in both, the amplitude and phase of transmitted signals. One of the most common distortions is the intersymbol interference (ISI) which causes an overlap of the transmitted symbols over successive time intervals. ISI is usually a result of the restricted bandwidth of the communication channel and it is also produced when several versions of the transmitted signals arrive at the receiver due to multi-path propagation. Other sources of distortion which degrade the performance of the communication system are noise, the intrinsic characteristics of the transmission channel, co-channel and adjacent channel interference, non-linear distortions (for example, those produced by memoryless non-linear devices such as the travelling-wave tube (TWT) amplifiers included in satellite communication systems), fading, time-varying characteristics, etc. Figure 3 illustrates a digital communication system in which the channel includes the effects of the transmitter filter, the transmission medium and the receiver filter. Stationary linear dispersive channels can be characterized by a N taps finite impulse response (FIR) digital filter, given by the coefficient vector h = h0 h1 hN −1 . As it can be observed in Figure 3, the digital data sequence xk is passed through this FIR filter and it is corrupted by an additive zero-mean Gaussian noise, nk, producing the distorted received sequence, yk. The relationship between the transmitted signal, xk, and the received signal, yk, can be expressed as (27)
yk =
N −1 i=0
hi xk − i + nk
123
RADIAL BASIS FUNCTION NETWORKS
x(k)
CHANNEL
+
y(k)
EQUALIZER
Received signal
Transmitted signal
^x(k) Estimated signal
n(k) Noise
Figure 3. Digital communications system
For a more general model of the channel (for example, non-linear channels) the received sequence, yk, is related to the input, xk, of the channel through this expression (28)
yk = fh xk xk−1 xk−N +1 + nk
in which fh is some linear or non-linear function which models the behavior of the channel. In this context, adaptive channel equalization is a major issue in digital communication systems. The role of the channel equalization is to recover the transmitted symbols from the distorted received signal. Therefore, the equalizer is the part of the receiver that is employed for mitigating channel disturbances such as noise, intersymbol interferences and the other distortions previously mentioned. In addition, in most practical communication systems, the channel characteristics varies with time. Therefore, it is necessary to build adaptive equalizers. Equalizers can be classified in two groups: sequence and symbol-by-symbol equalizers. Estimation theory suggests that the best performance for symbol detection is obtained by using equalizers belonging to the first group [Proakis, 2001]. Such equalizers are maximum likelihood sequence estimators (MLSE) and they are usually implemented by means of the well-known Viterbi algorithm. However, the entire transmitted sequence and a certain knowledge of the channel characteristics (i.e. a channel estimator) are required for decoding the optimum sequence of symbols. In addition, sequence equalizers present a high computational complexity and a large decision delay. All these requirements seriously limit their use in many practical communication systems. Symbol-by-symbol equalizers, in which only one output symbol is estimated in each symbol period, are the most popular alternative to MLSE equalizers. There exist three common symbol-decision architectures: transversal equalizers (TE), decision feedback equalizers (DFE) and growing memory structures [Mulgrew, 1996] [Mulgrew, 1998]. In TE architectures, the equalizer reconstructs each input symbol xk using M consecutive channel outputs yk yk − 1 yk − M + 1. This way, the equalizer output, xˆ k, is an estimate of the channel input. The integer M is known as the equalizer order and determines the behavior of the detector. In fact, the performance and the complexity of the system increase with the number M of
124
CHAPTER 5
received symbols used for making the decision. Often, the equalizer operates with a decision delay d, thus at time k the equalizer produces an estimate of the input symbol xk − d. The DFE architecture is an extension of the TE one. In this case, for the estimation of xk − d not only the M most recent channel observations are used, but also the P past detected symbols ˆxk − d − 1 xˆ k − d − 2 xˆ k − d − P + 1, where P is the equalizer feedback order. Finally, equalizers with a growing memory structure are usually implemented using recurrent networks. Recurrent networks provide the possibility of improving the quality of symbol-by-symbol decisions by using recursively all the previously received symbols instead of the last M channel outputs as in the TE architecture, while memory requirements of the equalizer are kept as small as possible. Thus, recurrent networks offer a compromise between performance and complexity. A traditional method of adaptively compensating for the ISI is to construct an approximation to the inverse of the channel [Qureshi, 1985]. From this perspective, equalization is viewed as a deconvolution problem and the equalizer can be implemented as a linear adaptive filter. Linear transversal equalizers (LTE) are the most common example of this approach. However, several shortcomings of this method have been reported [Chen et al., 1993a], [Mulgrew, 1996]: firstly, adaptive filters can not effectively mitigate the additive noise component and secondly, the linear approach completely ignores the fact that, in the absence of noise, a received sequence yk can take only values belonging to a predefined finite alphabet of symbols. Precisely, the latter consideration allows to view channel equalization as a decision problem. This is the reason why non-linear approaches (many of them based on neural networks) have attracted the interest of several researchers in the field of channel equalization. In fact, from the Bayesian decision theory [Proakis, 2001], it can be seen that the optimal solution for the symbol-detection decision corresponds to a non-linear classification problem. Several neural networks have been proposed for channel equalization, like multilayer feedforward neural networks (MLP) and radial basis function networks (RBF) [Ibnkahla, 2000]. The main drawback of MLP networks is their training process, which requires a great number of examples and it is very time-consuming [Mulgrew, 1996]. This fact restricts the use of MLP networks in scenarios in which adaptation is a fundamental issue as occurs in channel equalization problems. On the other hand, RBF networks have a structure with a close relationship to the optimal Bayesian equalizers [Chen et al., 1993a] and they are well suited for solving non-classification problems. For these reasons, in the recent years, RBFs have been considered as an attractive alternative to linear-based approaches. In the next subsections, we summarize some of the most important research works related to the application of RBF networks to channel equalization. In particular, we will focus on the methods developed for the mitigation of intersymbol and co-channel interferences.
RADIAL BASIS FUNCTION NETWORKS
5.2.1
125
Intersymbol interference
Numerous transversal adaptive equalizers based on RBF networks have been introduced in the literature to overcome ISI and nonlinear distortions in communication systems. In all cases, the training phase is the most important process involved in the equalizer design. Many communication channels transmit at certain time intervals a training sequence which can be used for estimating the network parameters. Former adaptive RBF-based equalizers were designed for real-valued binary channels. In these systems, the training of the RBF networks is usually carried out in two stages [Chen et al., 1993a]. The first stage consists of a clustering algorithm which computes the optimal centers of the network. If the training symbol sequence provided by the communication system is available, the learning of the network centers can be done in a supervised manner. If not, an unsupervised clustering algorithm must be used (generally, the unsupervised k-means algorithm). In the second stage the network weights are updated using, for example a least mean square (LMS) algorithm. Chen et al. [Chen et al., 1994] developed a complex-valued version of their previous equalizer which can be used for two-dimensional signalling such as quadrature amplitude modulation (QAM) channels. In [Cha, 1995] another approach for training complex-valued RBF network equalizers was proposed: a stochastic gradient (SG) algorithm. The main advantage of this algorithm is the possibility of training simultaneously all the free parameters of the network (centers and weights). In both cases, it has been shown that the proposed approach is capable of equalizing successfully and slowly the time-varying nonlinear channels and outperform classical linear equalizers. The achievement of RBF networks with a minimum size (and complexity) is one of the main issues considered in this application area. Gan et al. [Gan et al., 1999] developed an equalizer in which the number of centers depended only on the decision delay an not on the equalizer order like in previous approaches. They obtained a very good results in the equalization of fast time-varying complexvalued channels (4-QAM). In this context, it is worth mentioning the application of Minimal Resource Allocation Networks (MRAN) for equalization of real-valued [Chandra Kumar et al., 1998] or complex-valued channels [Jianping et al., 2002]. MRAN is a network which uses a scheme for adding or removing RBF centers depending on the channel conditions yielding to a minimum network structure. Satellite mobile communication channels are characterized for the presence of linear distortions due to the emitter and receiver filters, and nonlinearities caused by on-board power amplifiers and multipath propagation. Bouchired et al. reported that RBF networks have also been applied to satellite channel equalization, outperforming LTE and MLP-based equalizers [Bouchired et al.,1998a]. In the same application area, Bouchired et al. obtained very successful results in the equalization of satellite channels by using Kohonen’s self-organizing maps (SOMs) for improving the clustering stage in a RBF network equalizer [Bouchired et al.,1998b]. The application of RBF networks to decision feedback equalizers have also been considered. Chen et al. [Chen et al., 1993b] obtained significant improvements
126
CHAPTER 5
in both performance and reduction of computational complexity by combining RBF networks with DFE structures. They applied this novel equalizer to highly non-stationary channels (as fading mobile radio channels) and demonstrated the superior performance of their approach in comparison to a conventional Viterbibased equalizer. As mentioned before, higher-order equalizers perform better than low-order ones. In RBF-based equalization, an increment of the order implies an exponential growing of the number of centers of the RBFs involved. For solving this drawback, recurrent RBF networks have been proposed in [Cid-Sueiro, 1994], [Mimura and Furukawa, 2001] for linear and non-linear channels. 5.2.2
Co-channel interference
Many communication systems such as cellular mobile channels are impaired by the so-called Multiple Access Interference (MAI). It originates from the frequency reuse plan which allows several cells to share the same set of frequencies. This way, the signal received at the mobile station consists of the sum of all signals sent by the base station in the same cell, and those in the other cells transmitting in the same frequency band or in adjacent frequency bands (co-channel interference, CCI or adjacent-channel interference, ACI). Figure 4 illustrates a digital communication system with co-channel interferences. Although, traditionally, the main objective of channel equalization is to reduce the effects of ISI, several works have shown that equalizers are able to mitigate the distortions due to co-channel and adjacent-channel interferences. Adaptive RBF networks have been applied successfully to overcome CCI. In [Chen and Mulgrew, 1992] a RBF-based transversal equalization was constructed
n(k) Noise x(k) Transmitted signal
xc1(k) Interfered signal
xcN(k) Interfered signal
OWN CHANNEL
y(k) + Received signal
CO-CHANNEL #1
CO-CHANNEL #N
Figure 4. Digital communication system with co-channel interference
RADIAL BASIS FUNCTION NETWORKS
127
for this purpose. This approach exploits the differences between the interference signals and noise for improving the performance of a conventional equalizer. Chen et al. extended their previous work to an equalizer with decision feedback architecture [Chen et al., 1996] and adaptive implementation. They showed that in presence of severe CCI, the RBF approach produced significant improvements in the performance of the system in comparison to the conventional approach MLSE. Finally, in [Chen et al., 2003] a RBF-based adaptive equalization in presence of interference symbol, additive gaussian noise and co-channel interference was proposed. The novelty of this study was the definition of a new algorithm (LBER) for training the RBF network. LBER is a stochastic gradient adaptive algorithm which tries to minimize the bit error rate (BER) instead of the mean square error (MSE) which is the criterion used in conventional equalizers. As the real goal of equalization is the minimum BER, this method improved significantly the results obtained with a linear MMSE equalizer. 5.2.3
Blind equalization
The equalization techniques presented in the previous subsection rely on the existence of a training signal which is transmitted at regular intervals. This allows the regular adjustment of the RBF network parameters in order to improve its performance in the presence of time-varying channels. However, when the training sequence is not available or the channel is highly time-varying (as for example, Rician fading channels typical of outdoor mobile communications environment), it is necessary to develop more sophisticated techniques generally known as blind equalization methods. Most of blind equalization methods include a channel estimator in order to adapt the centers and weights of the RBF network using non-supervised training techniques. For example, in [Cid-Sueiro, 1993] the channel response was estimated through a non-supervised, non-decision-directed algorithm applied to a recurrent RBF network equalizer. However, other approaches do not require a channel estimator. For example, Gomes and Barroso modified the RBF equalizer proposed by Chen et al. [Chen et al., 1993a] for blind equalization. In this case, a non-supervised clustering algorithm was used for updating RBF’s centers and radii, while the network weights were adjusted by minimizing an appropriate function cost [Gomes and Barroso, 1997]. Finally, in [Lin and Yamashita, 2002] the RBF equalizer was directly designed using only the received signal and a novel cluster map algorithm. 5.3
Other RBF Networks Applications
In the previous sections, some of the most common applications of radial basis function networks in the field of digital communication systems have been described. However, in this context, there are other important areas in which RBFs have demonstrated their usefulness and good performance. These areas will be summarized in the next paragraphs.
128
CHAPTER 5
Several authors have applied RBFs to coding-decoding systems. In [Kaminsky and Deshpande, 2003], the functional and structural similarities between directed graphs in the conventional Viterbi decoding and RBFs networks were exploited for developing an adaptive RBF-based decoder for a trellis coded modulation (TCM) system, which was able to take advantage of its learning capabilities for improving the decoding decisions. Müller and Poor have employed RBFs for constructing a novel chaotic-based coding/decoding method [Müller and Elmirghani, 2002] and they showed that this new strategy presented high noise robustness compared to other conventional schemes for the same problem. System modelling and identification is another fundamental issue in many communication systems. It can be used for channel estimation as a part of a blind equalization system or for computer-based simulation of communication systems. Two good examples are the studies conducted in [Yingwei et al., 1996], [Leong et al., 2002], in which RBFs were used for adaptive identification of non-linear systems taking advantage of their universal approximation property. RBFs can also be applied successfully to the problem of fault detection in access networks as shown in [Zhou and Austin, 1999] in which RBF networks were used for automatically learning the relationship between several telephone line parameters and fault occurrences. Finally, RBF networks can be used for automatic recognition of modulation schemes or identification or wireless standards. These issues have a direct application in the design of wireless reconfigurable receivers [Palicot and Roland, 2003]. The interest of this novel concept of device is clear due to the fast growing of services which possibly have to be carried out on several heterogeneous wireless networks like the Global System for Mobile Communications (GSM) or the Universal Mobile Telecommunication System (UMTS) standards.
REFERENCES Cortes, C. and Vapnik, V.N., Support vector networks. Machine learning, 20: 273–297, 1995. Boser, B., Guyon, I. and Vapnik, V.N., A training algorithm for optimal margin classifiers. Fifth annual workshop on computational learning theory, :144–152, San Mateo, CA, 1992. Bouchired, S., Ibnkahla, M., Roviras, D. and Castanié, F., Equalization of satellite mobile communication channels using combined self-organizing maps and RBF networks, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’98), 3377–3380, Seattle, WA, USA, April 1998b. Bouchired, S., Ibnkahla, M., Roviras, D. and Castanié, F., “Equalization of satellite UMTS channels using RBF networks”, In Proceedings of IEEE Workshop on Personal Indoor and Mobile Radio Communications (PIMRC), Boston, USA, September 1998a. Broomhead, D.S. and Lowe, D., Multivariable functional interpolation and adpative networks Complex Systems, 2:321–355, 1988. Cid-Sueiro, J. and Figueiras-Vidal, A. R., Recurrent radial basis function networks for optimal blind equalization, In IEEE-SP Workshop on Neural Networks for Signal Processing, 562–571, Baltimore, MA (USA), September 1993. Cid-Sueiro, J., Artés-Rodríguez A. and Figueiras-Vidal, A. R., Recurrent radial basis function networks for optimal symbol-by-symbol equalization, In Signal Processing, vol. 40, no. 1, 53–63, October 1994.
RADIAL BASIS FUNCTION NETWORKS
129
Cha, I. and Kassam, S. A., Channel equalization using adaptive complex radial basis function networks, In IEEE Journal on Selected Areas in Communications, vol. 13, no. 1, 122–131, January 1995. Chandra Kumar, P., Saratchandran, P. and Sundararajan, N., Non-linear channel equalisation using minimal radial basis function neural networks, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’98), 3373–3376, Seattle, WA, USA, April 1998. Chen, S., Cowan, C. F. N. and Grant, P. M., Ortoghonal least squares learning algorithm for radial basis function networks, In IEEE Transactions on Neural Networks, vol. 2, 302–309, March 1991. Chen, S. and Mulgrew, B., Overcoming co-channel interference using an adaptive radial basis function equalizer, In EURASIP Signal Processing Journal, vol. 28, no. 1, 91–107, 1992. Chen, S., Mulgrew, B. and Grant, P. M., A clustering technique for digital communications channel equalization using radial basis function networks, In IEEE Transactions on Neural Networks, vol. 4, no. 4, 570–579, July 1993a. Chen, S., Mulgrew, B. and McLaughlin, S., Adaptive bayesian equalizer with decision feedback, In IEEE Transactions on Signal Processing, vol. 41, no. 9, 2918–2927, September 1993b. Chen, S., McLaughlin, S. and Mulgrew, B., Complex-valued radial basis function networks, Part II: Application to digital communication channel equalization, In Signal Processing, vol. 36, no. 2, 175–188, March 1994. Chen, S., McLaughlin, S. and Mulgrew, B. and Grant, P. M., Bayesian decision feedback equalizer for overcoming co-channel interference, In Proc. Inst. Elect. Eng., vol. 143, 219–225, August 1996. Chen, S., Multi-output regression using a locally regularised orthogonal least-squares algorithm. IEE proceedings-vision image and signal processing, 149 (4): 185–195, 2002. Chen, S., Mulgrew, B. and Hanzo, L., “Least bit error rate adaptive nonlinear equalisers for binary signalling”, In IEE Proc. Communications, vol. 150, no. 1, 29–36, February 2003. Cover, T.M., Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic computers, EC-14:326–334, 1995. Du, K.-L., Lai, A. K. Y., Cheng, K. K. M. and Swamy, M. N. S., Neural methods for antenna array signal processing: a review, In Signal Processing, vol. 82, 547–561, 2002. El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., Performance of radial basis function with antenna arrays, In IEEE Transactions on Antennas and Propagation, vol. 45, no. 11, 1611–1617, 1997. El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., Neural network-based adaptive beamforming for one- and two-dimensional antenna arrays, In IEEE Transactions on Antennas and Propagation, vol. 46, no. 12, 1891–1893, 1998. El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., A neural network-based smart antenna for multiple source tracking, In IEEE Transactions on Antennas and Propagation, vol. 48, no. 5, 768-776, May 2000. Gan, Q., Saratchandran, P., Sundararajan, N. and Subramanian, K. R., A complex valued radial basis function network for equalization of fast time varying channels, In IEEE Transactions on Neural Networks, vol. 10, no. 4, 958–960, July 1999. Golub,G.H. and Van Loan, C.G., Matrix computantions Johns Hopkins University Press, 1996. Gomes, J. and Barroso, V., Using a RBF network for blind equalization: desing and performance evaluation, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’97), vol. 4, 3285–3288, April 1997. Gomm, B.J. and Yu, D.L., Selecting radial basis function network centers with recursive orthogonal least squares training. IEEE transactions on neural networks, 11 (2) :306–314, 2000. Haykin, S., Neural Networks: a comprehensive foundation Prentice-Hall, New York, 1999. Haykin, S., Adaptive Filter Theory Prentice Hall, New Jersey, 2002. Hochstadt, H., Integral equations. Wiley Classics Library. John Wiley and Sons Inc., New York, 1989. ISBN 0-471-50404-1. Reprint of the 1973 original, A Wiley-Interscience Publication. Ibnkahla, M., Applications of neural networks to digital communications - a survey, In Signal Processing, vol. 80, 1185–1215, 2000. Jianping, D., Sundararajan, N. and Saratchandran, P., Communication channel equalization using complex-valued radial basis function neural networks, In IEEE Transactions on Neural Networks, vol. 13, no. 3, 687–696, May 2002.
130
CHAPTER 5
Kaminsky, E. J. and Deshpande, N., TCM decoding using neural networks, In Engineering Applications of Artificial Intelligence, vol. 16, 473-489, 2003. Kohonen, T., The Self-Organizing Map. Proceedings IEEE, 78 (9) :1464–1480, 1990. Leong, T. K., Saratchandran, P. and Sundararajan, N., Real-time performance evaluation of the minimal radial basis function network for identification of time varying nonlinear systems, In Computers and Electrical Engineering, vol. 28, 103–117, 2002. Lin, H. and Yamashita, K., Blind RBF equalizer for received signal constellation independent channel, In Proceedings of 8th International Conference on Communication Systems (ICCS’02), 82–86, 2002. Lo, T., Leung, H. and Litva, J., Radial basis function neural network for direction-of-arrivals estimation, In IEEE Signal Processing Letters, vol. 1, no. 2, 45–47, February 1994. Micchelli, C.A., Interpolation of scattered data: distance matrices and conditionally positive definite fucntions, 2:11–22, 1986. Mimura, M., Furukawa, T., A recurrent RBF network for non-linear channel, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’01), 1297–1300, 2001. Müller, A. and Elmirghani, J. M. H., “Novel approaches to signal transmission based on chaotic signals and artificial neural networks”, In IEEE Transactions on Communications, vol. 50, no. 3, 384–390, March 2002. Mukai, R., Vilnrotter, V. A., Arabshahi, P. and Jamnejad, V., Adaptive acquisition and tracking for deep space array feed antennas, In IEEE Transactions on Neural Networks, vol. 13, no. 5, 1149–1162, September 2002. Mulgrew, B., Applying radial basis functions, In IEEE Signal Processing Magazine, vol. 13, no. 2, 50–65, March 1996. Mulgrew, B., Nonlinear signal processing for adaptive equalisation and multi-user detection, In Proc. IX European Signal Processing Conference (EUSIPCO’98), 537–544, Rhodes, Greece, September 1998. Palicot, J. and Roland, C., A new concept for wireless reconfigurable receivers, In IEEE Communication Magazine, 124–132, July 2003. Park, J., and Sandberg, I.W., Universal approximation using radial basis function networks. Neural Computation,3 :246–257, 1991. Poggio, T. and Girosi, F., Networks for approximation and learning. Proceedings of the IEEE, 78: 1481–1497, 1990. Powell, M.J.D., Radial basis function approximations to polynomials Numerical Analysis 1987 proceedings, :223–241, 1988. Proakis, J. G., Digital communications, McGraw-Hill, Boston, 4th edition, 2001. Qureshi, S. U. H., Adaptive equalization, In Proc. IEEE, vol. 73, 1349–1387, 1985. Schmidt, R., Multiple emitter location and signal parameter estimation In Antennas and Propagation, IEEE Transactions on (legacy, pre - 1988) Volume 34, Issue 3, Mar 1986 Page(s):276–280 Shertinsky, A. and Picard, R.W., On the efficiency of the orthogonal least squares training method for radial basis function networks. IEEE transactions on neural networks, 7 (1) :195–200, 1996. Southall, H. L., Simmers, J. A. and O’Donnel, T. H., Direction finding in phased arrays with a neural network beamformer, In IEEE Transactions on Antennas and Propagation, vol. 43, no. 12, 1369–1374, 1995. Vapnik, V.N, The nature of statistical learning theory. Wiley, New York, 1995. Vapnik, V.N., Statistical learning theory. Wiley, New York, 1998. Xu, C. Q., Law, C. L. and Yoshida, S., Interference rejection in non-Gaussian noise for satellite communications using non-linear beamforming, In International Journal of Satellite Communications and Networking, vol. 21, 13–22, 2003. Yingwei, L., Sundararajan, N. and Saratchandran, P., Adaptive nonlinear system identification using minimal radial basis function neural networks, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’96), vol. 6, 3521–3524, 1996. Zhou, P. and Austin, J., Neural network approach to improving fault location in local telephone networks, In Proc. Artificial Neural Networks, 958–963, 1999.
CHAPTER 6 BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
JAVIER ROPERO PELÁEZ, JOSE ROBERTO CASTILLO PIQUEIRA Escuela Politécnica da Universidade de Sao Paulo, Departamento de Engenharia de Telecomunicações e Controle Abstract:
Mc Culloch-Pitts neuron is a simple yet powerful building block for most of nowadays neural networks. However recent advances in neurosciences show that this classical paradigm can be certainly improved. For example, biological synaptic weights have new properties like synaptic normalization and meta-plasticity that are crucial for developing new neural-networks architectures. Other peculiar biological mechanisms like the synchronization among neurons, allowing the identification of the neuron with maximal activation, and the dual behavior (high/low frequency) of some biological neurons can be used for improving the performance of artificial neural networks. As an example a new neural network that mimics the human thalamus will be analyzed and tested
INTRODUCTION Warren McCulloch and Walter Pitts published in 1943 the pioneering work entitled “A Logical Calculus of the Ideas Immanent in Nervous Activity” [McCulloch and Pitts, 1943] in which they proposed that neurons are essentially binary units performing Boolean computations. For their proposal they used the limited knowledge about the nervous system available at that moment. Their work was the source of inspiration for many neural networks designers: from Frank Rosenblatt, the creator of the Perceptron [Rosenblatt, 1956] , a neural network for recognizing characters, to Rumelhart, Hinton and Williams [Rumelhart et al., 1986] who developed the backpropagation algorithm for adjusting the synaptic weights in multi-layered neural networks [McClelland, 1986] [McClelland and Rumelhart, 1986]. Many other neural networks were developed inspired by McCulloch and Pitts model like the Kohonen’s self-organizing map [Kohonen, 1982], the Hopfield’s auto-associative model [Hopfield, 1982], the Grossberg’s ART [Carpenter and Grossberg, 1988] etc. All these neural networks use essentially McCulloch’s neuron model although they 131 D. Andina and D.T. Pham (eds.), Computational Intelligence, 131–146. © 2007 Springer.
132
CHAPTER 6
lack many recently discovered properties of real neurons. For the proper functioning of these networks a great deal of mathematical algorithms counterbalance the lack of up-to-date neurons’ properties. The purpose of this work is to review some of the recently found properties of individual neurons that can improve the conventional model of the neuron. The following section introduces these properties in a bottom-up order: first the synaptic plasticity properties followed by the neurons properties and ending by network properties that are relevant for the functioning of individual neurons. Afterwards a new updated paradigm will be presented in which synapses, neurons and networks are revisited. Finally we integrate the updated neuron model in a model of a real brain architecture, the thalamus, in the core of the brain. In the following sections the nitty-gritty is explained in the text while the fine details are in the figures that are self-explanatory. 1.
BIOLOGICAL PROPERTIES
1.1
Synaptic Plasticity Properties
Synaptic plasticity, the occurrence of sustained changes in synapses was envisioned in 1894 by Santiago Ramón y Cajal who pointed out that learning could produce changes in the communication between neurons and that this changes could be the essential mechanisms of memory. In 1948 Konorski alluded to persistent plastic changes in memory and in 1949 Hebb postulated that during learning synaptic connections are strengthened due to the correlated activity of presynaptic and postsynaptic neurons. The mathematical formalization of the Hebb rule is: (1)
wn = · AnBn
in which the increment of weight at instant n is proportional (through factor ) to the presynaptic activity at time n, A(n) and to the postsynaptic activity B(n). Unfortunately this rule is not able to produce reversible changes in synapses. This reversible changes appeared in posterior models like in the Sejnowski’s covariance rule and in the BCM (Bienestock, Copper and Munro) model [Bienestock et al., 1982]. None of these models were formulated considering the real nature of plasticity in terms of molecular interactions (see Figure 1) [Bear et al., 2001]. From this point of view it is relevant to notice that events A and B that appears equals in importance in the Hebbian learning rule are not equally important during the plastic changes that takes place in real synapses. We will call this property directionality. The presynaptic event A, a presynaptic action potential, and the postsynaptic event B, a postsynaptic increase of potential above a certain value, do not perform the same role in real synapses. For example, in the Hebbian learning rule, when event A is present and event B is not, no alteration of synaptic strength takes place. However in this same circumstances real synapses are depressed because the low calcium concentration in the postsynaptic area leads to synaptic depression (see Figure 1).
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
133
Presynaptic
Ca2+
Na+
NMDA
AMPA
Ca2+ Ca2+ Ca2+
Mg2+
Na+
LTP LTD
Postsynaptic Figure 1. Potentiation and depression is accomplished in biological synapses in the following way: Neurotransmitter Glutamate is released in the synaptic cleft from the presynaptic neuron. This neurotransmitter acts like a key for opening the postsynaptic gates that are of two types NMDA and AMPA gates (channels). To open the NMDA gates it is also necessary to expel the magnesium that blocks the NMDA channels. For this purpose, a certain voltage should be attained in the postsynaptic area. Once the channels are opened, ion Na+ is allowed to enter both type of channels. Ca2+ hydrated molecule is big and for this reason it only passes through the NMDA channels. A big amount of Ca2+ entering the postsynaptic space produces an increment of the number and size of postsynaptic channels (synaptic potentiation or LTP) while a moderate amount produces a decrement of them (synaptic depression or LTD)
In the case of event A taking place without event B little activation is expected in the postsynaptic neuron. This is different in the case events A and B both occurred in which a high activation probably takes place in the postsynaptic neuron. In the just mentioned BCM model [Bienestock et al., 1982] these two intervals of high and low activation correspond respectively to regions of increment and depression of synaptic strength. Artola et al. [Artola et al., 1990] corroborate the BCM model in slices of rat visual cortex. This model is explained in more detail in Figure 2 where two thresholds separates the different regions: the LTD (Long Term Depression) threshold separates the region of no change of synaptic strength and the region of synaptic depression and the LTP (Long Term Potentiation) threshold separates the region of synaptic depression from the region of synaptic potentiation. 1.1.1
Synaptic metaplasticity
Synaptic metaplasticity [Abraham and Tate, 1997] [Pompa and Friedrich, 1998] which some authors consider the plasticity of synaptic plasticity [Abraham and Bear, 1996] consists in the dependence of the LTP threshold on the initial synaptic weight. If the initial weight is low the LTP threshold is also low and when the initial weight is high the LTP threshold is placed over higher values of postsynaptic activation (see detailed explanation in Figure 3).
134
Change in synaptic strength
CHAPTER 6
LTD threshold
Postsynaptic activity LTP threshold
Figure 2. Change of synaptic strength due to the postsynaptic activity in biological neurons. If postsynaptic activity is high a positive change of synaptic strength takes place. If the postsynaptic activity is between the LTD and LTP thresholds a negative change of postsynaptic strength takes place. For lower values of postsynaptic activity no increment nor decrement of synaptic strength is noticed
Change in synaptic strength
Δw Potentiation
LTP threshold Depression
Postsynaptic activity (voltage)
Initial weight
Metaplasticity=LTP threshold variation
Change in synaptic strength
Δw Potentiation
Depression
Postsynaptic activity (voltage)
Initial weight Figure 3. Synaptic metaplasticity consists in the shift of the LTP threshold according to the initial weight of the synapse. The above two figures show graphically this idea. For higher values of the initial synaptic weight the curve is elongated so that the LTP threshold value corresponds to higher values of postsynaptic activity
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
135
None of the just presented models of synaptic weight alteration considered the property of synaptic metaplasticity. In section 3 we will propose an alternative model that also accomplishes this important property. 1.1.2
Synaptic normalization
Synaptic normalization is a property of biological synapses [Turrigiano , 1998] consisting in the normalization of the synaptic weights of a neuron. This is equivalent to the mathematical operation of dividing the synaptic weights by their norm which is calculated by the overall sum of weights. For example if the set of weights of a neuron are expressed as vector: W = 04 05 01 02 07 01 its norm is calculated as: normW = 04 + 05 + 01 + 02 + 07 + 01 = 2. Dividing W by this norm, normW = 2, a new W’ is obtained W’=(0.2, 0.25, 0.05, 0.1, 0.35, 0.05). In biological neurons the norm can be multiplied by an arbitrary multiplicative factor n k. This norm is therefore calculated as: normW = k wi where wi is each one i=1
of the n synaptic weights. The biological counterpart of synaptic normalization is explained in more detail in Figure 4. Synaptic metaplasticity and synaptic normalization are considered homeostatic properties of synapses. Homeostasis is the property of a system to return to its situation in equilibrium when it is forced to leave that situation. In this sense metaplasticity favours synaptic depression when the initial synaptic weight is high and it favours synaptic potentiation when the synaptic weight is low. Synaptic normalization is also an homeostatic property because when the total synaptic
Weight increase
Normalization
Pre-synaptic terminal
Post-synaptic terminal
Moderate activity
High activity
Moderate activity
Figure 4. Synaptic normalization. The above figure shows the evolution of two nearby synapses before, during and after the stimulation of the second of them. In the initial stage, the two synapses have two ionic channels each, giving four ionic channels in total. During the stimulation (center) the second synapse increments the number of its ion channels to six while the number of ion channels in the first synapse remains the same: two. Therefore the weight in the second synapse becomes three times bigger than in the first, in a proportion 6:2=3. After some hours the total number of channels in both synapses goes back to the initial value, four : one channel in the first synapse and three channels in the second. However the relative proportion of ion channels obtained during the stimulation: 6:2=3 is maintained, being in the last stage 3:1=3
136
CHAPTER 6
weight of a set of nearby synapses is altered, the tendency is to draw back this set to its original overall weight. Up to this point some homeostatic properties of synapses were explained. In the following section we will start with an homeostatic property of neurons: the spike threshold adaptation property. 1.2 1.2.1
Neuron’s Properties Spike threshold adaptation
The firing rate of a biological neuron depends on the value of its synaptic inputs according to the solid line in Figure 5. If the synaptic input is high the neuron fires at its maximal rate and if it is low it stops firing. In a intermediate range the firing rate is almost proportional to its synaptic input. This is the only useful range for the activity of the neuron. If its synaptic inputs varied inside any one of the more extreme ranges this variation produces no variation at the neuron’s outputs. For this reason it is very important that the neuron is tuned in the intermediate region which is the region of maximal firing rate variation. This tuning actually occurs in real neurons according to the studies of Desai and colleagues [Desai et al., 1999]. When the synaptic input activity is very high the sigmoidal activation function shifts to the right and when the synaptic input activity is very low this function shifts to the left (see Figure 5). The shift of the neuron’s activation function was a characteristic of the Rosenblatt’s Perceptron [Rosenblatt, 1956] which was a pioneering neural network for characters recognition. 1.2.2
Burst and tonic firing
Firing Rate
The purpose of this paragraph is to illustrate that neurons are capable of having a dual frequency behaviour. Thalamo-cortical neurons which are placed in the
Low Activity
High Activity
Synaptic Input Figure 5. The activation function of a biological neuron is capable of shifting to the right or to the left depending on the synaptic input of the neuron. Let us suppose that the activation function is given by the solid line. If synaptic input levels become very high the curve shifts to the right to avoid the saturation of the firing rate. Conversely if synaptic input levels are very low the activation function shifts to the left in order to prevent the neuron to stay silent
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
Tonic firing
Inhibition
137
Burst firing
>100 msec Figure 6. Some types of neurons like thalamo-cortical neurons present a dual firing behaviour: in their tonic firing mode the frequency of their response is proportional to the stimulus (10–165 Hz). However when they are stimulated and afterwards inhibited during at least 100 msec. their response changes to burst firing with much higher frequency rates (150–320 Hz)
thalamus, at the core of the brain, are able to fire either in tonic or in burst mode as shown in Figure 6. The main characteristic of the tonic mode is that the spiking frequency is proportional to the stimulus being in the range of 10 to 165 Hz. However in the burst mode, the frequency is not related to the input activation, being in the range of 150 to 320 Hz. This burst mode is very interesting because it takes place after a precise sequence of preliminary facts. For the burst mode to happen, the thalamo-cortical neuron needs to be positively stimulated and afterwards inhibited during at least 100 msec. After these two previous events the burst firing is produced when a slight positive stimulation is given to the neuron. For a deeper study of these mechanisms see [Llinas and Jahnsen, 1982], [Llinas, 1994], [Steriade and Llinas, 1988]. The purpose of this dual behaviour is still a matter of controversy. Ropero [Ropero, 1997] proposed that the tonic mode served for intrathalamic operations. When the result of this intrathalamic operations are concluded the result is relayed to the cortex via the burst firing mode. 1.3 1.3.1
Network Properties Synchronization among neurons
Some type of neurons for example, the granule cells of the olfactory bulb and the reticular cells in the thalamus are able to synchronize their activity and, afterwards, oscillate together [McCormick and Pape, 1990], [Steriade et al., 1987]. One of the causes of this behaviour is that these neurons posses dendro-dendritic [Deschenes et al., 1985] electric contacts in which the potential is communicated directly from one neuron to the other without any kind of neurotransmitter in between. The situation is as if we had a set of ping-pong balls tied by fine cords and we used two very big bats to play with them. The movement of the balls becomes more and more uniform and synchronized during the play. The kinetic energy given by each one of the bats over the balls corresponds to the electric energy of ions entering the neurons. One type of ions increments the inner potential of the neurons when it is below a certain threshold and other type of ions reduces the potential when the potential is above an upper voltage threshold.
138
CHAPTER 6
These play beetween ions and the potential sharing of dendrodendritic connected neurons generates the synchronized oscillations. This behaviour was modelled and programmed in Matlab [Ropero, 2003] with the results shown in Figure 7. 1.3.2
Normalizing inhibition
Inhibitory neurons were supposed to only perform subtraction [Carandini and Heeger, 1994] over other neurons and this property was used for biasing the neurons in conventional neural networks models like backpropagation or radial basis networks. The operation of biasing the neurons was equivalent to shifting the activation function of these neurons to the right or to the left in a similar way to the one explained in section 2.2. This kind of subtracting or biasing inhibition is performed by means of GABA-B (Gamma-aminobutyric acid) neurotransmitter in real neurons. However inhibition is performed in many of the cases by means of GABA-A neurotransmitter instead of GABA-B, being the effect of GABA-A inhibition divisive and not subtractive. We postulate that this GABA-A inhibition could perform a scaling or normalizing effect of the input patterns arriving at a certain layer of the brain. Many structures in the brain have a layered organization. The input to each layer goes to two type of neurons: (A) To neurons that perform an excitatory projection onto the following layer (B) To GABA-A neurons that produce inhibition inside its own layer thereby creating an inhibitory divisive field in the layer (see Figure 8).
Figure 7. The height of each intersection of lines over the surface represents the activation of a 7 × 7 net of neurons. If each of the neurons in this net has an oscillatory activity and the potential of each of them is partially shared between the other neurons, a synchronization of the activities takes place. From top to bottom and from left to right a computer simulation of the synchhronization of a 7 × 7 net of networks is shown
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
139
O
+ + + ++ + i 4 i5 i6 i1 i 2 i3
+ +
+ +
+ +
+ I4 I1
I5 I2
I6
Field of inhibitory interneurons
I3
Figure 8. Normalization of synaptic inputs due to an inhibitory field of GABA-A inhibitory interneurons. The six neurons of the lower layer of the figure form an excitatory input I = I1 ,I2 ,I3 ,I4 ,I5 ,I6 impinging on a second layer of neurons (middle). This pattern produces an excitation + over the six neurons in the middle layer and over GABA-A inhibitory interneurons that are not shown. Once these inhibitory interneurons are activated, they creates an inhibitory field that divides the activation of these middle layer neurons by nI = nI1 + nI2 + nI3 + nI4 + nI5 + nI6 . In this way the neuron at the top receives a normalized input i = i1 ,i2 ,i3 ,i4 ,i5 ,i6 that is the result of dividing each of the components of pattern I by the constant n(I)
The activation of excitatory and inhibitory neurons in each layer is almost the same absolute value because the input pattern impinges at the same time excitatory and inhibitory neurons. Therefore this inhibitory divisive field is proportional to this activation. This divisive inhibition is able to produce a sort of normalization over input patterns (see Figure 8 for more details). 2.
UPDATING MC CULLOCH-PITTS MODEL
Up to this point we introduced several properties of real neurons with remarkable interest for computational purposes. Using some of them we tried to update some of the characterisitics of the McCulloch-Pitts paradigm of neural computation. 2.1
Up-to-date Synaptic Model
The classical model of synaptic weight alteration due to Hebb lacked many of the properties that were mentioned in previous sections. Here we propose another
140
CHAPTER 6
model that not only mimics the way biological reinforcement and depression is produced but also accomplishes the property of metaplasticity [Ropero and Sim˜oes, 1999]. In our model the synaptic weight between the presynaptic neuron A and the postsynaptic neuron B is calculated as: (2)
wAB = PB/A
where B is a postsynaptic activation above a specific threshold and A a presynaptic action potential. As shown the synaptic weight is calculated as a conditional probability. The above expression can also be written as: (3)
wAB = PB/A =
nA I B nA
in which the operator “n( )”, number of times, quantifies how many times a certain event takes place, for example how many times event A, event B or the intersection of A and B occurs. Starting with different values of the numerator and denominator, i.e. different initial weights, and allowing the postsynaptic neuron to fire according to a non-linear squashing function (logistic) a 3-D version of Figure 2 is obtained in Figure 9. In this figure a continuous line drawn on the surface shows the evolution of the LTP threshold in function of the initial weight. It can be noticed that a very simple statistical expression is able to account for a big variety of properties like
Change in synaptic strength
Weight = P(B/A)
LTP threshold
Initial weight Normalized postsynaptic activity (voltage) Figure 9. The computer simulation above shows that metaplasticity takes place when the synaptic weight is calculated using the conditional probability P(B/A), being B a suprathreshold activation of the postsynaptic neuron and A the presynaptic action potential. A line joins the different LTP threshold, each one of them corresponding to a different initial synaptic weight
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
141
reinforcement, depression and metaplasticity. Therefore, talking into account that conditional probabilities can be computed in synapses, the more obvious question that arises here is: Are real synapses the tiniest pieces of the more fascinating statistical computer ever imagined? 2.2
Up-to-date Neuron Model
We propose a neuron model (see Figure 10) using the just presented equation for modelling the synaptic weights [Ropero, 1996]. In the soma each of the excitatory postsynaptic potentials (EPSPs) are summed. An EPSP is obtained by multiplying the probability PIi of an action potential in the presynaptic neuron by the corresponding weight PO/Ii . Although there are no probabilities at the presynaptic space but action potentials at different frequencies, the product PIi PO/Ii can approximate each of the EPSPs. These EPSPs are formed by the sum of the voltage humps produced each one of them by a presynaptic action potential in a process known as temporal summation. When these humps are nearby, the humps ride over previous humps creating a tallest EPSP. When they are far away, as for example when the presynaptic action potential is low, they can hardly ride over each other and the resulting EPSP is low. Given that the maximal frequency of presynaptic action potential is limited, the height of the resulting EPSP is also limited. This maximal height corresponds to a PIi PO/Ii of value 1. All the EPSPs go from the dendrites to the soma where they are summed. This sum is the so-called activation of the neuron which is transformed afterwards into a frequency of action potentials by means of a logistic or sigmoidal function. To prevent the saturation of the weights a normalization of the input pattern by means of divisive inhibition is commonplace in the brain.
P(I1) P(O/I1)
P(O/I2) P(I2)
P(O/I3) P(I3)
P(O/I) = P(O/I1)P(I1) + P(O/I2)P(I2) + P(O/I3)P(I3)
Figure 10. Model of a neuron based on conditional probabilities for calculating the synaptic weights. In each synapse the probability of presynaptic action potential is multiplied by the synaptic weight and the result gives the postsynaptic activation in each synapse. The sum of postsynaptic activations gives the activation of the neuron which is calculated as POI = PO/I = PO/I1 PI1 + PO/I2 PI2 + PO/I3 PI3
142 2.3
CHAPTER 6
Up-to-date Network Model
If the same pattern is input to several neurons, instead of only one, a competitive process can take place so that only one neuron, the one whose activation is maximal, becomes the winner of the competition. When the winner fires, the remaining neurons are kept silent. Silencing the not winning neurons is usually done by an inhibitory feed-back or lateral inhibition. For avoiding that only one neuron becomes the winner for every pattern, the probabilistic synapses should be normalized along time (see Figure 11). This is one of the possible roles of biological synaptic normalization, giving every neuron the same opportunity to fire. But what biological mechanisms are involved in the selection of this winning neuron? In section 2.3.1. it was introduced that the synchronized oscillation of neurons is a mechanism found at least in the thalamus and the olfactory bulb. This synchronized oscillation of neurons can allow the finding of the neuron with maximal activation: if a common oscillatory potential were summed to the activations of a layer of neurons the neuron whose total activation arrives first to a certain firing threshold is at the same time the one with biggest activation [Ropero, 2003].
t1 = 0.2 a.
w11 = 0.6 3 y1 = ∑w1j .tj = 0.38 j=1
t2 = 0.4 w12 = 0.4
y1 w13 = 0.2
t3 = 0.5 0.2 b.
0.2
3
y2 = ∑w2j .tj = 0.50 j=1
0.4 y2
0.4
0.6 0.5 0.2 c.
0.4 0.6
0.4
y3
3
y3 = ∑w3j .tj = 0.42 j=1
0.2 0.5 Figure 11. Synaptic normalization allows a competitive process among neurons. The neuron whose synaptic weight distribution wij is most similar to the input pattern of frequencies T = t1 , t2 , t3 tj is also the one with maximal activation. This is the case of neuron b whose weights [0.2, 0.4, 0.6] are most similar to vector T = 02 04 05 Therefore the sum of the products of the input frequencies multiplied by its weights yields the maximal value. Notice that due to the synaptic normalization the number of ionic channels is the same in the three neurons. In summary, synaptic normalization is the property that allows that the neuron whose weight distribution is most similar to the input pattern also exhibits the maximal activation
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
3.
143
JOINING THE BLOCKS: A NEURAL NETWORK MODEL OF THE THALAMUS
Probabilistic synapses, synchronized oscillations, weights normalization and normalizing inhibition, all of them are properties that were used to implement a realistic computational model of the thalamus. The thalamus is a structure at the core of the brain that relays sensorial information from the senses to the cortex. The function of this structure was unknown. The model we propose helps in this way to understand the role of the thalamus inside the computation in the brain [Ropero, 1996], [Ropero,1997], [Pelaez, 2000]. The thalamus is basically a two layered brain structure. The first layer is formed by thalamo-cortical neurons that receive sensorial patterns and after approximately 100 msec. send the result of the inner thalamic computation to the cortex. The second layer formed by reticular neurons that oscillate synchronically performs a competitive process by which each one of the neurons fires in the presence of specific characteristics of the input patterns. When several of these neurons fire, they produce several inhibitory masks that, when superposed, create a negative replica of the input pattern shown in Figure 12 over the first layer. If the input patterns were damaged or noisy the negative replica recreates a perfect version of the input without defects or noise. Pattern reconstruction and noise rejection are two of the tasks that we postulate the thalamus is able to perform. For these tasks, a process of learning must take place at the level of the thalamus. Our computer model of the thalamus programmed in Matlab has these two layers, each one of 9 × 9 = 81 neurons. The two layers are completely interconnected to each other having 2 × 81 × 81 = 13122 connections. It learned 36 characters during several epochs and is able to recognize and complete damaged or noisy patterns (see Figure 12). The learning capability of the model shows that the real thalamus have also learning capabilities, a fact, that was completely ignored until now in the thalamus’ research. 4.
CONCLUSIONS
In this review we have presented several properties of synapses, neurons and networks that were not considered in previous neural network models but that have interesting computational potential. McCulloch Pitts neuron’s model was based in the restricted knowledge about neurons that existed in the forties. Nowadays a more comprehensive knowledge about the amazing properties of neurons can be used to update McCulloch Pitts model. In the case of synaptic plasticity we presented several properties of synaptic weights like directionality, existence of both potentiation and depression thresholds, metaplasticity and normalization. Regarding neurons relevant properties were introduced to the reader like the spike threshold adaptation and the dual behaviour in frequency of some types of neurons. Finally, and concerning networks of neurons, we studied the synchronization of a set of neurons and the normalizing inhibition produced by a set of GABA-A neurons over the input pattern of another neuron.
144
CHAPTER 6
Figure 12. A biologically realistic computer model of the thalamus constituted by two layers of 9 × 9 = 81 neurons each. An example of the pattern reconstruction capability of the thalamus model is shown (a) After being trained with 36 different characters (letters and numbers) a very noisy and damaged testing pattern is input which vaguely resembles a B. (b) An “I” shaped sustained feedback inhibition over the first layer is produced by a reticular neuron in the second layer. After firing, the reticular neuron rests in refractoriness. This inhibition reduces the subsequent activation in the first layer. (c) Another neuron fires and immediately enters in the refractory period producing another sustained inhibition that is superposed over the previous one. Both inhibitions are shaped like an E. (d) Finally, another reticular neuron fires and the total inhibition completely reconstructs letter B showing the reconstruction capability of the thalamic model. The central figure of each screen gives the value of the activations of a net of reticular neurons
With all these elements in mind we proposed a new equation for synaptic reinforcement based in conditional probabilities. The paradigm of a neuron was also modified taking into account that the neuron is always integrated in a network. For example, if the neuron was detached from the inhibitory field that normalizes its inputs, its active synaptic weights will increase without bound and the neuron will be saturated most of the time. It was also shown that the normalization of synaptic weights is an important condition for allowing a competitive process between neurons. An example of such competition and of all the mentioned properties working together is the model of the thalamus that we programmed in Matlab. It learned 36 characters and exhibits the property of completing damage or noisy patterns.
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS
145
We expect that the reader benefits from this paper’s account of recently found neural properties when creating new artificial neural networks or trying to emulate the functioning of the brain. REFERENCES Abraham, W.C., and Bear, M.F. (1996) Metaplasticity: the plasticity of synaptic plasticity. Trends in Neuroscience 19:126–130. Abraham, W.C., and Tate, W.P. (1997) Metaplasticity: a new vista across the field of synaptic plasticity, Progress in Neurobiology 52:303–323. Artola, A , Brocher, S., and Singer, W. (1990) Different voltage-dependent threshold for inducing long-term depression and long-term potentiation in slices of rat visual córtex. Nature 347:69–72 Bear, M.F., Connors, B.W., and Paradise, M.A. (2001) Neuroscience. Exploring the Brain. Lippincott, Williams & Wilkins. USA Bienestock, E.L., Cooper, L.N., and Munro, P.W. (1982) Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual córtex. The Journal of Neurosciences 2(1):32–48. Carandini, M., and Heeger, D.J. (1994) Summation and division by neurons in primate visual cortex. Science 264(5163):1333–6. Carpenter, G., and Grossberg, S. (1988) The ART of adaptive pattern recognition by a self-organizing neural network. Computer 21(3):77–88 Deschenes, M., Madariaga-Domich, A., and Steriade, M. (1985) Dendrodendrític synapses in the cat reticularis thalami nucleus: a structural basis for thalamic spindle synchronization. Brain Research 334:165–168. Desai, N.S., Rutherford, L.C., and Turrigiano, G.G. (1999) Plasticity in the intrinsic excitability of cortical pyramidal neurons, Nature Neurosciences 2:515–520 Hopfield, J.J. (1982) Neural Networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79:2554–2558 Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biological Cybernetics 43:59–69. Llinás, R., and Jahnsen, H. (1982) Electrophysiology of mammalian thalamic neurones in vitro. Nature 297:406–408 Llinas, R., Ribary, U., Joliot, M., and Wang, X.J. (1994). Content and Context in Temporal Thalamocortical Binding. In G.Buzsaki et al. (Eds.), Temporal Coding in the Brain (pp. 151–72). Berlin: Spring-Verlag McClelland, J.L., Rumelhart, D.E., and The PDP Research Group. (1986). Parallel distributed processing: Exploration in the microstructure of cognition. Cambridge, MA: MIT Press. McClelland, J.L., and Rumelhart, D.E. (1988). Explorations in parallel distributed processing. Cambridge, MA: MIT Press. McCormick, D.A., and Pape, H.-C. (1990) Properties of a hyperpolarization activated cation current and its role in rhytmic oscillation in thalamic relay nurons. Journal of Physiology (London) 431:291–318. McCulloch, W. and Pitts, W. (1943) A logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 1943. Ropero Peláez, J. (1996) A Formal Representation of Thalamus and Cortex Computation. Proceedings of the International Conference of Brain Processes, Theories and Models. Edited by Roberto MorenoDíaz and José Mira-Mira. MIT Press. Ropero Peláez, J. (1997) Plato’s theory of ideas revisited. Neural Networks, 1997 Special issue 10(7): 1269–1288. Ropero Pelaez, J., and Godoy Simoes, M. (1999) A computational model of synaptic metaplasticity. Proceedings of the International Joint Conference of Neural Networks 1999. Washington DC. Ropero Peláez, J. (2000) Towards a neural network based therapy for hallucinatory disorders. Neural Networks, 2000 Special Issue 13(2000):1047–1061.
146
CHAPTER 6
Ropero Peláez, J. (2003) Phd Thesis in Neuroscience: Aprendizaje en un modelo computacional del tálamo. Faculty of Medicine. Autónoma University of Madrid. Rosenblatt, F. (1956) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65:386–408 Steriade, M., Domich, L., Oakson, G., and Deschenes, M. (1987) The deafferented reticular thalamic nucleus generates spindle rhythmicity. The Journal of Neurophysiology 57:260–273. Steriade, M., and Llinas, R.R. (1988), The Functional State of the Thalamus and the Associated Neuronal Interplay. Physiological Review 68(3):649–739. Tompa, P., and Friedrich, P. (1998). Synaptic metaplasticity and the local charge effect in postsynaptic densities. Trends in Neuroscience 21(3):97–101. Turrigiano, G.G., Leslie, K.R., Desai, N.S., Rutherford, L.C., and Nelson, S.B. (1998) Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature 391:892–896.
CHAPTER 7 SUPPORT VECTOR MACHINES
JAIME GÓMEZ SÁENZ DE TEJADA1 , JUAN SEIJAS MARTÍNEZ-ECHEVARRÍA2 1 2
Escuela Politécnica Superior, Universidad Auónoma de Madrid Escuela Técnica Superior de Ingenieros de Telecomunicaciones, Universidad Politécnica de Madrid
Abstract:
Support Vector Machines is the most recent algorithm in the Machine Learning community. After a bit less than a decade of live, it has displayed many advantages with respect to the best old methods: generalization capacity, ease of use, solution uniqueness. It has also shown some disadvantages: maximum data handling and speed in the training phase. However, these disadvantages will be overcome in the near future, as computer power increases, leaving an all-purpose learning method both cheap to use and giving the best performance. This chapter provides an overview about the main SVM configuration, its mathematical applications and the easiest implementation
Keywords:
Support Vector Machines, Machine Learning
INTRODUCTION Machine Learning has become one of the main fields in artificial intelligence today. Whether in the pattern recognition field or in function estimation, statistical Machine Learning tries to find a numerical hypothesis which adapts correctly to the given data, that is, machines able to generalize the statistical distribution of a representative data set. Once we have generated the hypothesis, all future unknown patterns following the same distribution will be correctly classified. From the principles of statistical mechanics, a handful of algorithms have been devised to solve the classification problem, such as decision trees, k-nearest neighbour, neural networks, Bayesian classifiers, radial basis functions classifiers, and, as a newcomer, support vector machines (from now on SVM). The basic SVM is a supervised classification algorithm introduced by Vladimir Vapnik, motivated by VC (Vapnik Chervonenkis) theory [Vapnik, 1995], from which the Structural Risk Minimization concept was derived. In the late 70’s, 147 D. Andina and D.T. Pham (eds.), Computational Intelligence, 147–191. © 2007 Springer.
148
CHAPTER 7
Vapnik studied the numeric solution of convex quadratic problems applied to Machine Learning, and defined an immediate ancestor of SVM called ‘Generalization portrait’. In the early 90’s, Vapnik joined Bell laboratories, where his ideas evolved until the creation of the term ‘support vector machines’ in 1995. Nevertheless, the basic mathematics behind SVM were developed much earlier. The concept of a non-input space hyperplane generation to separate data in input space, the heart of SVM, was settled in 1964. The study of convex quadratic problems gave the Karush-Kuhn-Tucker optimality conditions in 1936, while the definition of valid kernel functions for the transformation described above was formulated by Mercer in 1909. This chapter provides an introductory view to SVM, so that any computer scientist or engineer reader can develop his own SVM implementation and apply it to any real world machine-learning problem. For that purpose, we will sacrifice some mathematical completeness for the sake of clarity. It has four sections: first, the SVM will be defined and analysed; second, the main SVM principle mathematical uses will be developed; third a comparison between SVM and neural networks will be studied; last, the best current implementation approach will be shown. Support Vector Machines are easy to understand, not too difficult to implement, and child’s play to use. If you need a generic Machine Learning method, forget about neural networks or any other method you previously learnt: the SVM family globally outperforms them all. 1. 1.1
SVM DEFINITION Structural Risk
Classifiers having a big number of adjustable parameters (and so, great capacity) most probably will generate overfitting, thus learning the training data set without errors, but with poor generalization ability. On the contrary, a classifier with insufficient capacity will not be able to generate a hypothesis complex enough to model the data. A mid-point must be found where adjustable parameters are neither too much nor too scarce, both for the training ant test set. For that reason, it is essential to choose the kind of functions a learning machine can implement. For a given problem, the machine must have a low classification error, and also small capacity. Capacity is defined as the ability of a given machine to learn any training set without errors. For example, the 1-nearest neighbour has infinite capacity, but is a poor classifier for unseen test data with complex distributions and noisy sets. A machine with great capacity will tend to generate overfitting over the data, making it no longer useful because it does not learn. For extended information about these issues, see [Burges, 1998]. There are a handful of mathematical bound expressions that define the relations between a machine learning ability and its performance. The underlying theory tries to find under which circumstances and how fast the performance measure converges while the number of input data for training increases. On the limit, with an infinite number of points, we could have a correct performance value, better than just an
SUPPORT VECTOR MACHINES
149
estimation. With respect to the SVM, we will use one limit definition in particular which will take us to the Structural Risk Minimization (SRM) principle [Vapnik, 1995]. Suppose we have l observations, input data in the training phase. Each data consists on a pair of values xi yi , where xi is a vector ∈ n i = 1 l and the fixed associated label yi ∈ 1 −1, given by a consistent data source. We assume there is an unknown probability distribution P(x,y), from which the data points have been drawn. Data is always assumed to be independently drawn and identically distributed. Suppose we have a machine whose task is to learn the mapping xi → yi . This generic machine is really defined by a set of possible mappings xi → fx , where functions fx are generic, defined for the set of adjustable parameters . This machine is by definition deterministic, that is, for a given input vector xi , and a parameter set , we will always obtain the same output fxi . Choosing the parameter set gives a trained machine. For example, a neural network with a fixed architecture and fixed weights (parameter set ) would be a trained machine as defined in these paragraphs. Thus, the expected error in the phase test for a trained machine is: 1 (1) R = y − fx dPx y 2 The value R is called expected risk. Nevertheless, this expression is difficult to use because the probability distribution P(x,y) is unknown. Thus, a variation of the formula is developed to use the finite number of available observations. It is called empirical risk, and is defined as the measured mean error rate on the training set: (2)
Remp =
l 1 y − fxi 2l i=1 i
Remp is a fixed number for a given parameter and test set. It has been shown in [Vapnik, 1995] that the following condition holds: (3)
R = Remp + gh
where g(h) is a real number which is directly related to the VC dimension. Again, a learning machine is defined as a set of parameterised functions (called a family of functions) having a similar structure. The term Vapnik-Chervonenkis (VC) dimension is a non-negative integer that measures the generalization capacity previously defined. The VC dimension for a given learning machine is defined as the maximum number of points that can be correctly classified using functions belonging to the family set. In other words: if VC dimension = h, then there exists a set of h points that can be classified with family functions regardless of the point labels. Note that, first, there cannot exist a set of h + 1 points satisfying the constraint; second, you only need one set of h points for the definition to be applicable (it did not say “for all h-points sets”).
150
CHAPTER 7
Figure 1. Three wisely chosen points
Let’s try an example. Suppose we are in 2 space and the learning machine L1 is defined as the set of “one straight line” classifiers. In figure 1 we choose three points. We see (you can try) that, for all combination of labels (8 possible combinations using 3 points with two labels), they can be separated using one straight line. For each combination it would use a different straight line, but it would still be a component of the family set. Therefore the analysed learning machine VC dimension is at least 3. If we try 4 points (any 4 point set) we will not be able to satisfy all constraints, so we can state that “one straight line” classifiers in 2 space have VC dimension equal to 3. Another example. Suppose we are in 2 space and the learning machine L2 is defined as the set of “two-segment line” classifiers (continuous but non-derivable in the joint point). In figure 2 we choose five points. Again, try all possible label combinations (now 32). Using a two-segment line you can separate all 32 cases, but it would not be possible to separate 6 well-chosen points (any 6 points). Therefore VC dimension for this learning machine is 5.
Figure 2. Five wisely chosen points
SUPPORT VECTOR MACHINES
151
Figure 3. Training set and two valid classifiers, “straight-line”(dashed line) and “two-segment-line” (solid line)
When facing a problem that can be classified using different learning machines, as can be seen in figure 3, which one is better. The SRM principle will try to find the learning machine with the lowest VC dimension that correctly classifies all data points. The consequences are analysed in section 2.12. In what regards SVM definition, SRM principle and VC dimension concept requires that the chosen classifier be the one with the largest margin (linear SVM use the family of linear hyperplanes in input space), defined in next section.
1.2
Linear SVM for Separable Data
The simplest case for a SVM is that of linear machines trained with a separable data set (see Figure 4a).
Figure 4a. Linear separable training set
152
CHAPTER 7
Suppose we have a training data set made of pairs xi yi i = 1 l, such that xi ∈ d yi ∈ 1 −1. Suppose there exists a hyperplane in d which separates positive from negative examples (after their yi value). Points that are exactly on the hyperplane satisfy the condition: (4)
w•x+b = 0
where w is the hyperplane perpendicular vector (regardless of the norm), b/w (absolute value of term b divided by module of vector w) is the distance from the origin to the hyperplane, and the operator • is defined as the dot product in the Euclidean space in which the data belong (we will use the scalar product between two d-dimension vectors). Let d+ d− be the shortest distance between the plane and a positive (negative) example; the margin of the hyperplane is defined as d+ + d− . We can say that, at maximizing the classifier margin, we will decrease the risk limit defined in (3). This is the base for the following SVM mathematical development. For the linear and separable case, the SVM algorithm calculates the separator hyperplane that maximizes the classifier margin. Thus, all training data must satisfy the following constraints: (5)
w • xi + b ≥ +1
for yi = +1
(6)
w • xi + b ≤ −1
for yi = −1
which can be formulated in one expression: (7)
yi w • xi + b − 1 ≥ 0
∀i
All points for which equality at inequality (5) holds, are on hyperplane H1 : w • xi + b = 1, parallel to the separator hyperplane and distance 1 − b/w to the origin. In much the same way, those points for which equality at inequality (6) holds, are on hyperplane H2 : w • xi + b = −1, parallel to H1 and the separator hyperplane and distance − 1 − b/w to the origin. Thus, d+ = d− = 1/w, and so the margin is 2/w. We must find a pair of planes H1 H2 that maximize the margin, minimizing w2 , with respect to constraints defined in inequality (7). Note that, in the training phase, no data point will be between H1 and H2 or on the wrong side of its class plane (that is the reason for calling it separable case). Those points that satisfy the equality in inequality (7), (those placed on H1 or H2 ), and that, if eliminated from the training set, would give a different solution (by definition would change d+ or d− ), are called support vectors. The name comes from the fact that the learning machine is completely defined with these points and their weight on the hyperplane. All other training points, which are at a greater distance from the hyperplane than the support vectors, serve no purpose: if we had begun the training without them, the solution would have remained the same (see Figure 4b).
SUPPORT VECTOR MACHINES
153
Figure 4b. Linear SVM classifier. Support vectors are encircled, the margin is shown with two dashed lines and the separator hyperplane is shown with a solid line
The problem can be reformulated using Lagrange multipliers. It will help us to add constraints to the problem more easily, and will let the training data appear only in the form of dot products between vectors. This will let us generalize the SVM algorithm to the non-linear case. The general rule for creating the Lagrange formulation is: for constraints of the type c ≥ 0, the constraint equation is multiplied by a Lagrange multiplier and subtracted from the objective function. Thus, we introduce non-negative Lagrange multipliers i i = 1 l , one for each constraint in inequality (7), that is, one for each training point. The Lagrangian we obtain is: (8)
LP =
l l 1 w 2 − i yi w • xi + b + i 2 i=1 i=1
We want to minimize LP with respect to w and b (the variables that define the plane), and require that partial derivatives of LP with respect to the i be 0. By definition, this is a convex quadratic optimisation problem, because objective function is convex and constraints are also a convex set [Burges, 1998]. This means we can solve the problem using the dual formulation [Fletcher, 1987]. This Wolf-dual formulation has the following property: maximization of LD (in contrast with primal formulation LP ) with the defined constraints occurs at the same value of w and b than the minimization of LP , shown in the previous paragraph. All partial derivatives must be zero at the optimum. Calculating partial derivatives of LP with respect to b and w, we obtain the following conditions: (9)
w=
l
i yi xi
i=1
(10)
l i=1
i yi = 0
154
CHAPTER 7
which substituting in equation (8) gives: (11)
LD =
l
i −
i=1
l l 1 y y x • x 2 i=1 j=1 i i j j i j
Therefore, now the problem is written as “Maximize LD with respect to all i , satisfying conditions (7) and (10)”. There is a Lagrange multiplier for each training point, but only those having i > 0 are of any importance in calculating the separator hyperplane with equation (9). These are the support vectors, which were defined in previous paragraphs. Geometric interpretation of (11) is easier if the second term is substituted using (9). Suppose we are in an intermediate optimisation state, and we want to calculate the second term at step i = 0. Thus the term is: l
0 j y0 yj x0 • xj = 0 y0
j=1
= 0 y0 x0 •
l
j yj x0 • xj
j=1 l
j yj xj = 0 y0 x0 • w
j=1
The scalar product of a point and a normal-to-the-hyperplane vector gives the point projection over the vector, that is, relative distance between point and hyperplane. The relying concept under the formula is: = 0 ∗ Correctness of classification ∗ distance between point and sscurrent defined separator plane At each sep i, the relation between xi and current-state w is calculated. Therefore, we can deduce some hand-made optimisation rules: A) If classification is correct, the term is negative, so i should decrease, and thus reduce its weight (its importance) in the calculation of current w, in case the optimum has not been reached. B) If distance is big with respect to other points of the same class, and it is correctly classified, i should decrease, while other same-class point k closer to the margin should increase. Note that when evaluating the correctness of a point during the training phase, the point itself is used. If a point is misclassified, the algorithm will increase its multiplier as much as needed, forcing the hyperplane definition until this point condition is satisfied. For the linear separable case this strategy is valid, because sooner or later the point must be correctly classified. But for non-linear or nonseparable cases, this strategy may give poor results. If we have some noise in the training data, the algorithm will try to force the hyperplane definition to classify points that are wrong. This will generate overfitting over the data so the performance will be poorer. Therefore, the SVM training algorithm consists of the following basic steps:
SUPPORT VECTOR MACHINES
155
1. Identify all training data points, and their labels. 2. Optimize (maximize) the dual Lagrangian, maintaining constraints defined in (7) and (9). For that purpose, there are many convex quadratic problem optimisation methods described in mathematical literature [Fletcher, 1987]. The optimisation phase result is the set of all Lagrange multiplier values i . Basic optimization methods have important limits about the resources (time and memory) needed in big problems (more than 10.000 patterns). Thus, at the beginning of SVM history, efficient optimization algorithms were the basic research line. In section 5 the best SVM algorithm will be shown: SMO. 3. Throw away all those points which are not support vectors after the training process (i.e. those having i = 0), and calculate the value of w and b from support vectors and formulas (9) and (7). Then, we will have a completely defined optimum separator hyperplane. 1.3
Karush-Khun-Tucker Conditions
Karush-Khun-Tucker (KKT) conditions represent necessary and sufficient conditions for a solution to exist to the problem defined in step 2 in the previous algorithm. This solution identifies the objective function LP optimum value with respect to all available parameters (all i ). Many SVM algorithms use these KKT conditions to identify if the machine’s current state is the optimum, and if not so, which are the points that violate these optimality conditions the most. For the basic SVM definition, given in this chapter, optimality conditions are: (7.9)
w=
l
i yi xi
i=1
(7.10)
l
i yi = 0
i=1
(7.7)
yi w • xi + b − 1 ≥ 0
(7.7 bis)
i yi w • xi + b − 1 = 0 i ≥ 0
Most of them have been introduced in previous sections of this chapter, but they have been repeated here for better comprehension of the optimisation process. The new equation (7.7 bis) is easy to be interpreted. It regards to the points that must hold equality in inequality (7). It could be defined in the following words: “Any training point, either holds equality in inequality (7), or its Lagrange multiplier is annulated, i.e. i = 0”. If it holds equality (7) and i = 0, then the point is on the margin hyperplane and is a support vector. It can also happen that both conditions hold, that is, equality (7) holds and i = 0. In that case, the point is on his class margin hyperplane but it is not needed for the hyperplane definition, therefore it is not a support vector.
156 1.4
CHAPTER 7
Optimisation Example
To show with more clarity the optimisation process, we will introduce an example. Suppose we have 3 points 1 1 2 1 3 1 ∈ 2 and labels +1 −1 −1 respectively (see figure 5). Suppose the initialisation routine defines Lagrange multipliers as 1 = 2 2 = 1 3 = 1 (holding condition (10)). We use formulas (9) and (11) to calculate the following: w1 = 21 1 − 12 1 − 13 1 = −3 0 LD1 = 4 − 1/2−6 + 6 + 9 = −0 5 Then, we check if this is a valid solution for our SVM. For that purpose, we use KKT conditions, specially condition (7). Note that all three points would be support vectors, so they must have the same value of b when substituting in condition (7). At this optimisation stage this is not true for w1 , because we obtain b = 4 b = 5 and b = 8. Thus we can say, without doubt, that this is no solution. Now we must find another set of Lagrange multipliers that bring us to an increase of LD . Point 3 is farthest from current pseudo-hyperplane (being correctly classified), so it is a good candidate for decreasing its weight in the definition of w (see section 2.2). Suppose that new Lagrange multiplier values are 1 = 1 2 = 1 3 = 0 (condition (10) must always hold). w2 = 11 1 − 12 1 − 03 1 = −1 0 LD2 = 2 − 1/2−1 + 2 = 1 5 We made a good choice because LD has increased. Nevertheless, we still do not satisfy KKT conditions. When we substitute equation (7), we obtain b = 2 y b = 1 for both points respectively (we have two support vectors only).
Figure 5. A linear separable set with margin and separator hyperplane
SUPPORT VECTOR MACHINES
157
Now that we have two support vectors, with different class, their Lagrange multipliers must change in the same way for condition (10) to hold. We increase, for instance, to 1 = 2 2 = 2 3 = 0. w3 = 21 1 − 22 1 − 03 1 = −2 0 LD3 = 4 − 1/2−4 + 8 = 2 Again, LD has increased, so we have chosen wisely. Moreover, at this optimisation step, KKT conditions hold, having the same value b for all support vectors, b = 3. We can assert without any doubt that the optimum has been reached. For instance, if we continue to increase the multipliers to 1 = 3 2 = 3 3 = 0, the result would not be valid. We would obtain: w4 = 31 1 − 32 1 − 03 1 = −3 0 LD4 = 6 − 1/2−9 + 18 = 1 5 Convexity required for the objective function definition holds: LD1 < LD2 < LD3 > LD4 . Moreover, as the example is so small, some degree of uniform quadratic convexity can be seen, as LD2 = LD4 , underneath the optimum. During the optimisation process, while KKT conditions do not hold, the unique separator hyperplane does not exist. At each new step (new set of values of ), there is one hyperplane direction only, but as many separator hyperplanes as support vectors in the training set (different values of b). These hyperplanes do not need to have a geometric meaning; they do not try to separate the data, even though they could. As we get closer to the optimum (increasing LD ), all support-vector-defined hyperplanes will come closer to each other (less difference in the b value). The limit is reached when LD gets to the optimum value, and all hyperplanes match up with only one value of b: the separator hyperplane. This concept differs largely on the search process followed by other similar methods, like the perceptron. This last one always defines a separator hyperplane that evolves at each training step trying to classify correctly all training data. For that reason, it can reach a state in which all data points are correctly classified, but whose margin is not the optimum. That is called a local minimum, where the perceptron will be trapped and will not be able to continue. The SVM algorithm performs a quadratic optimisation in which no intermediate state can be considered as a valid solution. There will be one solution only, it will be global, and it will be the best you can have. Even though soft-margin SVM definition will take place in next sections, this is a good place to see what happens when the optimisation algorithm is applied to a non-separable data set. Suppose we have again the 3 points used before 1 1 2 1 3 1 ∈ 2 but now with different labels +1 −1 +1 (see figure 6). We have changed the third point label, so the training set becomes nonseparable with a linear machine. Nevertheless, this information is not given to the SVM algorithm.
158
CHAPTER 7
Figure 6. A linear non-separable set
Suppose we initialise values as 1 = 1 2 = 2 3 = 1 (condition (10) holds). w1 = 11 1 − 22 1 + 13 1 = 0 0 LD1 = 4 − 1/2+0 − 0 + 0 = 4 Of course, this cannot be a solution. We do not need to check KKT conditions, because w = 0 0 does not define a hyperplane. At this stage we cannot guess which points are better changing, so we do it randomly. Suppose we define a new state 1 = 1 5 2 = 2 3 = 0 5 (there are not many more alternatives). w2 = 1 51 1 − 22 1 + 0 53 1 = −1 0 LD2 = 4 − 1/2−1 5 + 4 − 1 5 = 3 75 We obtain LD2 < LD1 , so we can be sure this is not a solution, and, even more, this way will take us nowhere. We choose another possible set, 1 = 2 2 = 4 3 = 2. w3 = 21 1 − 42 1 + 23 1 = 0 0 LD3 = 8 − 1/2+0 − 0 + 0 = 8 As in the first case, this cannot be a solution. But LD has increased quite a lot, and we could think this is getting us closer to the solution. But it can be noted that we could increase the multipliers anyhow, knowing LDn = 1 + 2 + 3 , and so, the objective function increases without limit (note that in this example the problem is not characterized by a quadratic function, but by a linear function, so there cannot be an optimisation solution). Therefore, if the objective function increases without limit, then we are applying a linear separable machine to a linear non-separable training set.
SUPPORT VECTOR MACHINES
1.5
159
Test Phase
As it has been said, once we have trained a SVM, we obtain the values w and b. With these values, we define a separating hyperplane, w • x + b = 0, parallel to H1 and H2 and placed at the middle, at the same distance of both. To classify an unseen pattern x, we just need to know which side of the separator hyperplane the point is, i.e., the sign of w • x + b. Note that in the test phase we may have data points placed in between H1 and H2 , and, if used during training, the solution found would have changed somehow. This concept may be useful when developing SVM training algorithms, because it could find a priori support vectors, before the whole training, saving computational power. Up until now we have mentioned only the binary case, that is, data can only have two classes. SVM classifiers can be easily extended to the multiple class case: for n classes, we just need to generate n-1 binary classifiers which separate one class form the rest. Nevertheless, this multiple classifier is O(n) more complex in time (memory resources are more difficult to estimate) than one binary classifier in the training as well as the test phase. As this extension does not give new major advances, it will not be mentioned in the rest of this chapter. 1.6
Non-Separable Linear Case
Now that we know everything that is needed to create and use a simple SVM, we will upgrade its definition so that it will be able to deal with any real-life problem. When the above-described algorithm for separable data is used over non-separable data (see figure 7), no solution will be found, as the value of LD will grow without limit (see section 2.5). For the non-separable data to satisfy initial constraints, we have to introduce the concept of soft margin. This means that the algorithm will allow some training points to violate those constraints, and so, the rest of training data will be correctly classified (regardless of violating points). For that purpose we
Figure 7. A linear non-separable set, which needs a soft-margin classifier. The distribution is defined as class = 1 if x1 + x2 > 7 5; class = −1 otherwise. The distribution has some noise
160
CHAPTER 7
introduce positive slack variables for each point in a way such that the following inequalities hold [Cortes and Vapnik, 1995]: (12)
w • xi + b ≥ +1 − i
for
yi = +1
(13)
w • xi + b ≥ −1 + i
for
yi = −1
Values i are not fixed prior to the training; they will be calculated during the optimisation process. And because they are not fixed, we can be certain that all points will satisfy inequalities (12) and (13): just increase its i until inequality holds. We have solved our troubles: now, there will always be a solution. But it may be that the solution is not close enough to the true distribution under the data. If that is so, then the solution is useless; so we have just changed the name of our worries. With the introduction of these variables must follow a primal Lagrangian LP increase, so that classification errors during training will be minimized. For a training pattern classification error to take place, its associated i must be greater than 1, so l
i
i=1
is a good estimate of the training errors’ upper bound with respect to the complete training set. Therefore, the objective function to be minimized changes from 1/2w2 to l 1 w 2 + C i 2 i=1
being C a parametrizable non-negative real value. This value corresponds to the global penalization given to training errors. This new objective function could have been different. We could have devised other methods for forcing i values to be as small as possible. The election of exactly that function follows simplicity reasons: the problem continues to be convex quadratic, and neither the i , nor the Lagrange multipliers associated to these new constraints, appear in the problem dual formulation. Therefore, we have to maximize LD : (14)
LD =
l
i −
i=1
with constraints: (15)
0 ≤ i ≤ C
(16)
w=
l
i yi xi
i=1
(17)
l i=1
i yi = 0
l l 1 y y x • x 2 i=1 j=1 i i j j i j
SUPPORT VECTOR MACHINES
161
The only difference between the previous algorithm and this last one is that now the i have an upper bound C. The training algorithm will not allow any point to increase its weight indefinitely, and so, a solution will eventually be found. The error term in the optimisation process goes to those points that have i > 0, either because they are incorrectly classified or because they lie inside the margin. For any point that satisfies i > 0, it can be stated i = C. It still is a support vector, and it will be treated as such in the calculation of w, but in the optimisation process its weight will grow no more. Soft margin philosophy (against hard margin defined in section 2.2), is not to forbid training errors, not even to minimize them alone. The idea is to minimize the whole objective function, in which errors make some pressure as well as the hypothesis robustness, identified as the margin maximization between those well-classified points at each side of the separating hyperplane (characterised by constraint (7)). Suppose, for instance, the case shown in figure 7. A hard margin classifier cannot be found, but many soft margin classifiers will satisfy the constraints, and the only difference will be the C value. The first approach for newcomers is usually the hardest soft-margin possible, one that looks like figure 8. It is a valid solution, but it has a very small margin. By definition of structural risk minimization, if we increase the margin, test errors would decrease (better generalization performance). On the other hand, training errors should be avoided (or, at least, limited), so a balance must be found between margin maximization and error permissibility. A small quantity of noise may be accepted without modifying the generalization performance, by creating a hypothesis that is developed after some common properties satisfied by the data (the internal, true data distribution). In the case of figure 9, it is easily seen that more training points become errors, but the classifier is much closer to the underlying distribution concept. The new parameter C becomes the only value (until now) that must be provided in the SVM architecture. As it has been said, C serves as a balance between error permissibility and generalization goodness.
Figure 8. The figure 7 set, with a rather hard soft margin classifier
162
CHAPTER 7
Figure 9. The figure 7 set, with a softer margin classifier
– If C is small, then errors are cheap. The margin will grow, and so will the number of training patterns that violate the margin. – If C is big, then the value of w has small relevance in the objective function optimisation against training errors. We are approaching the hard margin philosophy. Because w value is closely related to the margin maximization, decreasing w relevance will take us to a smaller margin, and maybe, to a worse generalization ability. To choose a good C value, model complexity and expected data noise must be evaluated as a whole. 1.7
Non-Linear Case
In most real life cases, data cannot be separated using a linear hyperplane in input space. Even the use of slack variables could lead to a poor classifier, in case the linear deviations are caused by the hypothesis structure and not because of noisy data. The next step is to introduce in the SVM algorithm non-linear separating surfaces instead of hyperplanes (see figure 10). For that purpose, we generate an input data mapping into another Euclidean space H, whose dimension is higher than the input space. We use a mapping function , such that: d → H In the problem dual formulation, input data vectors appear only as inner products xi • xj , in the space they belong. Now they will only appear as xi • xj in space H. Space H will usually be a very high dimension space. It could even be an infinite dimension space. Therefore, performing operations in this space could be too costly. But if we could find a kernel function K such that Kxi xj = xi • xj , then we would not need to explicitly map data vectors into space H, we would not even
SUPPORT VECTOR MACHINES
163
Figure 10. Non-linear distribution set
need to know what is. Now we just have to define a valid kernel function K, and substitute Kxi xj everywhere xi • xj stands in the algorithm. When we use a much higher dimension space, many new data features, linear and non-linear, arise. Each new dimension offers a new possible correlation view, a new attribute with which we can separate the data, a new factor with which to create the hypothesis. It will be the training process responsibility to discriminate those attributes that contain useful hyperplane-definition information from those that do not, by assigning a bigger weight in the linear combination of all features. For those cases when there is some user information about data correlation, an explicit mapping can be generated. Nevertheless this is not usual, and could lead to an inefficient implementation, depending on the previous knowledge credibility. Using generic mapping functions (we will see them later) offers the possibility to generate an enormous number of new features, without taking care of the meaning of each one. In fact, these spaces use to be in the order of thousands, millions or even infinite dimensions. It is difficult to accept such a big geometrical space. It seems easier to identify it with a set of non-linear relations between input attributes, which can be assembled with linear relations in the optimisation process to create a surface (hyperplane in feature space, indefinable curve in input space), capable of separating input data one class from the other. If we replace xi • xj by Kxi xj everywhere in the training phase formulas, the algorithm defined in section 2.2 will generate a linear SVM in a high dimensional space (specified by the mapping function). And most important, it will do it in roughly the same time complexity as a simple linear SVM created in input space (without mapping). All further development stays the same, as we are still creating a linear separator, although in a different space. In the linear case, the training phase output was the value of w and b, with which the hyperplane was completely defined, and so the test phase had just to see at which hyperplane side the new pattern was. Now, we cannot explicitly calculate w, because it is defined in space H only and we do not know exactly how the mapping is made.
164
CHAPTER 7
Through the support vector extension, the value of w can be written as: (18)
w=
N
i yi si
i=1
so we can write the classification function as: (19)
fx =
N
i yi si • x + b =
i=1
N
i yi Ksi x + b
i=1
where si are the N support vectors, identified in the training phase as those patterns whose Lagrange multiplier is not zero. With this definition we avoid calculating mapping function once more. Note that soft margin concept still applies to a non-linear classifier. Actually, its implementation remains very simple: Lagrange multipliers have an upper limit. In this case soft margin applies to the linear classifier in high dimension space. The clearest advantage is that we still assert there is a solution. The use of a non-linear surface as separator functions does not guarantee a solution will be found at all, even though it is more probable. Moreover, using the soft margin alternative gives the classifier more robustness against noisy training patterns. Training phase time complexity does not change, but test phase is different. In the linear case, having calculated explicitly w, algorithms complexity is O(1), using inner product as the basic operation (which is O(d) if multiply-add is the basic operation). For the non-linear phase, we need to perform O(N) operations, where N was previously defined as the number of support vectors. Because of the relation between support vectors number and complexity, algorithms have been devised that try to minimize, or even replace, support vectors during and after training, so that this phase may be competitive enough with other machine learning methods, such as neural networks. 1.8
Mapping Function Example
For better understanding of the concept of new useful features generation, we will show an example. Suppose we have a data set xi ci in 2 × +1 −1 as shown in figure 11. It can be seen that this is not a linear separable case, and the soft margin linear separator is not enough. In this example, training data has no noise. We define as a mapping function 2 → 3 with the form: x1 x2 → x1 x2 x1 x2 Therefore, we have added a new feature to the input definition, which gives us information about a specific kind of relation between the two initial variables. Thus, we can calculate the kernel function:
Kx x = x • x = x1 x2 x1 x2 • x1 x2 x1 x2
= x1 x1 + x2 x2 + x1 x2 x1 x2
SUPPORT VECTOR MACHINES
165
Figure 11. Non-linear distribution set. The distribution is defined as class = 1 if x1 x2 < 14 5; class = −1 otherwise
Figure 12. Feature space view for the main points from figure 11. The margin h1 − h2 is partially shown using solid lines
We have defined the mapping function and the new space implicitly, using the inner product in input space as the only valid operator. In figure 12, the most important points, form the training data set, have been represented, as well as the separator hyperplane the SVM algorithm would find and those points that become support vectors. The separator hyperplane is z = 14 5. Note that in the final hypothesis only one feature is required to create the hyperplane (it is defined using just the third component) from the three available features. This will be very common case in non-linear SVM: just a few features will form the linear combination defining the separator hyperplane. To represent the curve in input space that describes the generated hyperplane we need to use the inverse mapping: −1 x1 x2 x1 x2 → x1 x2
166
CHAPTER 7
Figure 13. Non-linear classifier for the figure 11 set. Support vectors are encircled, margin is shown using dashed lines and the separator curve is shown with a solid line
As the new axis z was defined as z = x1 x2 in the high dimensional space, those points that lie on the hyperplane hold x1 x2 = 14 5, and so the curve in input space can be defined as x2 = 14 5/x1 . In figure 13, the final result can be observed, with hyperboloid x2 = 14 5/x1 as the non-linear class separator surface. Support vectors in this figure are those that were identified during training and highlighted in figure 12. It should not be thought that those points that lie near the non-linear separator surface in input space should become support vectors, although it usually tends to it. The mapping function does not necessarily satisfy any input data relation properties, but the concept behind the support vector is: “significant point”, and the points that carry more information are those that lie near other class points in input space. In real world cases, this function will not be useful, unless clear and easy apriori information is given to the SVM engineer. Nevertheless, it is a valid mapping function and generates a valid kernel function. For this to happen, function K(x,y) must satisfy some constraints, known as Mercer conditions.
1.9
Mercer Conditions
Not all kernel functions are valid, that is, they describe a Euclidean space with the properties required in previous sections. It is enough to satisfy Mercer conditions [Vapnik, 1995], which can be written as: There exists a function Kx y = x • y if and only if for all g(x), such that gx2 dx is finite, the following inequality holds:
Kx ygxgydxdy ≥ 0
SUPPORT VECTOR MACHINES
167
For most cases, this is a very complicated condition to check, because it is said ‘for all g(x)’. It has been demonstrated for Kx y =
P
Cp x • yP
i=1
when Cp is a positive real number and p is a positive integer. 1.10
Kernel Examples
The first (and only) basic kernels used to develop pattern recognition as well as non-linear regression and principal component analysis with SVM are (for any pair of vectors x y ∈ d ): (20)
Kx y =x • y + 1p
(21)
Kx y = exp−x − y2 /2 2
(22)
Kx y = tanhx • y −
Kernel (20) is a non-homogeneous polynomial classifier of degree p (another used variation is the homogeneous polynomial kernel, without term ‘+1’). It creates a space H with as many dimensions (data features) as p-combinations of x and y. All possible relations between input attributes until degree p appear in the new space. The margin maximization algorithm will discriminate those having information from those that have not (should be most of them), so the number of adjustable parameters required to obtain a good solution decreases. Kernel (21) is a Gaussian radial base function (RBF). The new space dimension is not fixed, depends on actual data distribution, and it could get to infinite. This kernel visual effect is that near-by patterns form class clusters, as big as they can. Clusters have the support vectors as centres (in feature space), and the radius is given by the value of and support vector weight, obtained during training. Kernel (22) is similar to a two layer sigmoidal neural network. Using the neural network kernel, the first layer is composed of N sets of weights, each set consisting of d weights; the second layer is composed of N weights (the i ), so that an evaluation requires a weighted sum of sigmoids evaluated on dot products. The structure and weights (which defines the related neural network architecture) are given automatically by the training process. Not all values of y satisfy Mercer conditions [Vapnik, 1995]. We say (20), (21) and (22) are basic functions because new kernel functions can be formulated combining them and still satisfying Mercer conditions. A linear combination of two Mercer kernels is a Mercer kernel. This can be easily demonstrated knowing that the integrator operator is distributive with respect to the add operator. Also, another kind of slight changes can be implemented from the basic functions, looking for a kernel function having a priori information about the internal distribution.
168
CHAPTER 7
Nevertheless, it has been experimentally stated that, in many cases, kernel choice is not a determining factor in the machine performance. For a real world problem whose internal distribution is not particularly fitted to some kind of kernel, support vector set tend to be very similar, no matter what non-linear function is used. Of course weights are fairly different, as the evaluating function is so. But the result, the separating surface, tends to have a very similar geometrical shape, especially where data density is high. As it was said in previous sections, the reason could be that those patterns that are important because they lie near other-class patterns continue to be important regardless of the mapping function, so they become support vectors. Last, we will define the kernel matrix as a symmetric square M-order matrix (where M is the training pattern number), where position (i,j) describes the kernel function value Kxi xj . 1.11
Global Solutions and Uniqueness
As it has been shown in previous sections, the result of SVM training is a global solution for the optimisation process, i.e., the parameter set (values for w, b and i ) which give an objective function maximum. This term goes against ‘local solution’, defined as a parameter set whose objective function is optimum when compared around the vicinity. In the SVM algorithm, any local solution is also a global solution because it is characterised as a convex quadratic problem. Nevertheless, global solution may not be unique. There could be more than one parameter set where objective function gets the same value, and it could be the optimum. It is not inconsistent with global solution definition. Solution uniqueness is guaranteed only in case the problem is strictly convex. The SVM training definition assures the problem to be convex, but training data will make the problem be strictly convex or not. Non-uniqueness occurs in two different ways: • When w and b values are not unique. In this case all w and b values between two solutions are also global solutions. This is easy to accept, as the problem is characterized by a convex problem. • When w and b values are unique, but the w value comes from different sets of i values. Reaching one solution or the other depends on the training algorithm randomness. Remember that there can be training data points that lie on the hyperplane but are not support vectors. Much alike when three points in a row give just one straight line and throwing away any of the three would give the same result, it is easy to create one training set that would generate different hard margin classifier support vector set depending on the listing order, although the separator hyperplane would remain unchanged. 1.12
Generalization Performance Analysis
Mercer condition tells us whether a kernel function defines a new Euclidean space or not, but it does not define how the mapping function must be applied or
SUPPORT VECTOR MACHINES
169
the new space morphology. For easy cases, the feature space dimensions can be deduced. For instance, the p-degree homogeneous polynomial kernel has d+p−1 p new features or dimensions. For a 4-degree polynomial kernel using 16 × 16 pixel images (256 initial features), the new space dimension is 183181376. In real world cases we will never have training sets that big. A classification machine with a huge ‘features over data’ ratio would undoubtedly produce overfitting. Let us use an easier example: 3-degree polynomial with 8 × 8 pixel data. The new space dimension is 45760. If you are using a simple multi-layer perceptron neural net, the relation between number of weights and data points should not be greater than around 15%. Suppose you are generating a hidden layer with 45760 units (new features), 64 units in the input layer and one unit as output. The number of weights in the net gets around 2974400 (almost 3 million). Therefore, the minimum training data set should have 19829333 patterns (almost 20 million). Now, that is an awfully big data set. Of course, not all 45760 new features are important. Many of them will have a null weight. But you cannot know at first which features will be needed and which ones will not be. Some algorithms have been designed to decrease the neural net while training, but even in this case the difference between useful feature and disturbing feature is not easy to make. A separator hyperplane in feature space H must have dimH+1 parameters. Any classification system needing so many parameters to create a discrimination function will be resource and time inefficient. Nevertheless SVM have a good classification and generalization performance, in spite of treating data in an enormous space, which could be even infinite. The reason has not been formally demonstrated, although the maximum margin requirement has much to say about it. Within the SVM, the solution has at most l + 1 adjustable parameters, being l the number of training patterns. After the training, the solution has N + 1 parameters, being N the number of support vectors, which is much less than the number of new features. In section 2.1 we left a question about which classifier is better out of two possible choices. The answer is “the one having lowest VC dimension”, which is the same as saying “the simplest”. It was shown that the bound on the risk is related to the VC dimension: the least the VC dimension, the least the risk bound. However, it does not assure you which one will have the least actual risk. There is no way to know it beforehand. This approach is not only mathematically motivated, but we could also use some philosophy statements on it. An English 14-century philosopher, William of Ockham, enounces the Ockham’s razor theory: Given some evidence and two hypothesis, one simple and one complex, both satisfying the evidence, then the simplest hypothesis is most probable to be true. It does not say which one is true, but if you had to bet and you had no additional knowledge or evidence, you should go for the first hypothesis. That is all about learning, be it machine or human: choose
170
CHAPTER 7
the one hypothesis which seems most probable with current evidence. Whenever you make a new assumption (using an unnecessary complex hypothesis) you are most probably farther from the truth. That answers the big question, why are support vector machines generalization performance good even when using high dimension feature space? Because SVM performance is not related to the space dimension where data is separated, but to the classifier VC dimension. Therefore, SVM classifier depends on the data hypothesis simplicity, not on the number of available features. If a simple hypothesis can do the separating job, the SVM will use it, with no overfitting. There is no magic any more. The SVM algorithm gives the simplest hypothesis, that is, the most probable one. But it does not mean there cannot be a better answer for a given problem. In spite of our SVM hard militancy we do not deny SVM have been slightly outperformed (mostly by specific neural networks) in some experimental benches. The answer is simple: luck. The SVM gave the most probable answer after one general-purpose execution. But the true internal distribution may have been slightly more complex, even though it did not show on the training data. If you are trying a neural network architecture with a bit more complexity, which way will you go? You cannot say unless you have additional information. The successful architect engineer would most probably try all possible ways. It means trying hundreds of different architectures and finally using the one having better error rates on the test set. But that approach falls down in many places: first, the engineer must decide how much complexity should the answer have (not an easy task at all); second, the training set must be slightly deviated from the internal distribution for the SVM to lie behind; third, if you are generating many classifiers and you use the test set to decide which one is better, then the test set is no longer a good validation set, because you are using it as a secondary training set (even though it is used as a validation set for publishing the results); last, the engineer spends a lot of time in the training phase. And in spite of all this extra work, in cases where SVM are outperformed, they are still very near to the highest results in this scientific ranking. Which means that in the real world it is difficult to find the SVM outperformed. Support Vector Machines are not easy to implement, but they are very easy to use. Nevertheless, its use has some limits. As it is a statistical method, symbolic learning does not suit too well. For instance, the parity problem with few data makes the SVM decide that all points are support vectors. This is a clear hint for bad generalization performance, because it means: “one point has no relation with any other point”. In those cases a SVM is no better than a simple Nearest Neighbour classification algorithm. Other Machine Learning paradigms, for instance C4.5, are able to work with input data having parameters with the unknown value (C4.5 uses the ‘?’ symbol). The algorithm identifies this value and treats the information accordingly. However, the SVM algorithm does not allow unknown values, diminishing the applicability to some data sets.
SUPPORT VECTOR MACHINES
171
Inside the previously defined scope, SVM has a very light bias. It is a true general-purpose machine learning method. Although a priori information can be included inside the kernel function, the number of new features is so wide that, regardless of the internal data distribution, there will always be a near-by hypothesis model using those new features. The training algorithm will have embedded some sort of balance between using too few features (too simple hypothesis), and using too many (overfitting). The basic achievement in using SVM is that you just choose a generic kernel function (we won’t say “any kernel will do”, but it is not too far from the truth), and the confidence degree C (up until now, mostly heuristics are used, but you will soon find it is quite easy). Then you push the button, and after some time you will have the best classification machine. No need for an experienced engineer or scientist. No complicated architectures. No tailoring. No second thoughts. Child’s play. 2.
SVM MATHEMATICAL APLICATIONS
The initial mathematical development for SVM has been applied to different approaches inside Machine Learning scope. All of them are based on the structural risk minimization principle, in the problem Lagrange formulation, and in the non-linear case generalization. For each approach you only need to define the requirements all points must satisfy, its effect on the objective function and the mathematical steps through the Lagrange formulations. 2.1
Pattern Recognition
The first approach to SVM was in the pattern recognition field. In fact, the search for a new statistical paradigm able to optimise the class separation problem was the boost to V. Vapnik in his quadratic programming research. For that reason, the SVM definition developed in the previous sections and their implementation shown in next sections, apply specifically to pattern recognition. Nevertheless, most concepts apply also to the other approaches defined in this section. 2.2
Regression
Historically, the second approach the SVM had was non-linear regression and function estimation, called SVRM (Support Vector Regression Machines) [Vapnik et al, 1997]. This field can be divided into two parts: first, ‘function approximation’ tries to find the curve that best adapts to the training data, acquired without noise (which makes it very similar to usual methods for interpolation); second, function estimation (regression), where data is noisy and whose distribution is unknown,
172
CHAPTER 7
the method tries to estimate as simplest as possible unseen data points, including extrapolation. SVRM algorithm treats both cases in a very similar way. For each case, the cost function can be slightly changed. 2.2.1
Definition
Suppose we have a training set with l data pairs xi yi , where xi ∈ d i = 1 M (up until now, just the same as the pattern recognition case), and where yi ∈ , is not a label any more but a real number which represents the value of the function we want to estimate at xi , i.e. yi = fr xi + ni , being ni the noise associated to point i. We want to find a function fx having a deviation maximum of with respect to all training yi . In the basic case, there can be no training points having a distance to the expected value bigger than , so the resulting curve must fit all points. This case can be used only when data describe a linear function with a noise level ni < ∀i. The estimating function has the form: (23)
fx = w • x + b
being w the vector defining the curve in input space, and b the free term (the bias). Similarly to the pattern recognition case, the structural risk minimization principle demands the greatest possible simplicity to the approximation function. We will try to minimize w2 , which will give us the flattest linear function from those satisfying the constraints (unlike the margin maximization definition in pattern recognition). Therefore, the optimisation problem is written as: Minimize 1/2w2 with respect to constraints: yi − w • xi − b ≤ (24)
w • xi + b − yi ≤
Nevertheless, following the same reasoning as in section 2, this inflexible formulation is only valid when there is at least one solution satisfying conditions (24). Because this is usually an unreal case, without noise in the data (it could be used for an interpolation approach), the soft margin idea must be introduced. We define positive slack variables i that give information about how far is the expected value from the true value for point i. Thus, we are introducing in the algorithm the ability to admit errors (points not satisfying constraints), but keeping the ability to find a solution representing the data distribution well enough. Likewise, a new cost function must be defined giving a balance between the number of allowed errors and simplicity (and usefulness) of the final estimating function. This cost function, cx y f, must fulfil some properties, discussed in next section.
SUPPORT VECTOR MACHINES
173
To continue with the formulation development through this section we will use the -insensitive cost function [Vapnik et al, 1997], partially because it was the first one proposed, and because it is the simplest to interpret and optimise. This cost function is continuous non-derivable, so variables must be duplicated (formulation gets longer but no more difficult). Now all and turn ∗ and ∗ , where the one without asterisk is associated to the yi ≥ fxi case, and the one with asterisk is associated to the yi < fxi ) case. Note that both cases cannot be true for any one point, so for all training points at least one of the duplicated variables will be zero. Thus, objective function becomes: M 1 1 2 (25) LP = w + C cxi yi f M i=1 2 After the primal and dual formulation development (just like the pattern recognition case), the problem can be written as: Maximise LD = −
M M M M 1 i −∗i i −∗i xi •xj − i −∗i + yi i +∗i 2 i=1 j=1 i=1 i=1
with respect to: M
i − ∗i = 0
i=1
(26)
i ∗i ∈ 0 C
having C the same meaning as in section 2: an error permissibility balance parameter. This development remains defined as a convex quadratic optimisation problem, which has to satisfy Karush-Khun-Tucker conditions at optimality. Therefore, implementation methods defined for pattern recognition are applicable, although with some differences caused by the cost function. In the case of -insensitive cost function, duplicated Lagrange multipliers must be treated specifically. This will happen to all non-derivable cost functions. Again, support vectors are those training points whose Lagrange multipliers are not zero (in the case of duplication, it means one of the multipliers is non-zero). Moreover, those points having a non-zero slack variable are considered as training ∗ errors, and have the corresponding multiplier set to the maximum i = C (where ∗ the symbol means “either of the duplicated items, the applicable one”). Support vectors having a Lagrange multiplier not at bound 0 < < C are placed on the margin (they are needed to define the margin) and have a zero slack variable. Basically, the concept after the support vectors, weights and geometrical meaning, remain the same as the pattern recognition case (see figure 14).
174
CHAPTER 7
Figure 14. Linear regression machine. Support vectors are encircled, the margin/tube is shown with dashed lines and the estimated function is shown with a solid line
The value of the bias b can be calculated from a non-bound support vector, i.e. a non-error support vector. The equalities to be used are those in inequalities (24) (27)
b =yi − w • si −
ifi = 0yi = C
(28)
b =yi − w • si +
if∗i = 0y∗i = C
In case all support vectors are errors (very unusual, and in any case, most probably a bad solution) the b calculation method is much more complex, and can be done during optimisation itself. Likewise, we can define a non-linear mapping from input space to a feature space, where the algorithm will try to find the flattest function approximating the data well enough. The ‘flat’ property can usually be seen in the corresponding input-space non-linear curve: its shape is the one having smaller tangent value through the point set. The mapping concept and development is similar to the one described in previous sections: using a kernel function K(x,y) making all operations implicitly in feature space, usually a much higher dimension space (see figures 15a and 15b). Therefore, the non-linear problem is defined as: Maximize LD = −
M M 1 − ∗i i − ∗i Kxi xj 2 i=1 j=1 i
−
M
i − ∗i +
i=1
with respect to: M i=1
i − ∗i = 0
M i=1
yi i + ∗i
SUPPORT VECTOR MACHINES
175
Figure 15a. Non-Linear regression machine. The dots follow the sinc(x) function, the dashed lines are the -tube, and the solid line is the function SVRM estimation. Note that support vectors are those corresponding to the 3 tangent points (1 in the middle x = 0, and the other two at the limits)
Figure 15b. Non-Linear regression machine. The dots follow the same sinc(x) function, and the other elements follow the figure 15a notation. Note that as the -tube decreases, the function estimation gets more accurate. At the limit, if noise allowance approaches 0, the function estimation error will also be 0 in this example
(29)
i ∗i ∈ 0 C
and w support vector expansion and estimated function are written: (30)
w=
N
i − ∗i si
i=1
(31)
fx =
N
i − ∗i Ksi x
i=1
being si the resulting N support vectors, and being M the complete training set. It has been observed, through a number of experiments [Osuna and Girosi, 1998], that SVRM tend to use a relatively low number of support vectors, compared to
176
CHAPTER 7
other similar machine learning processes. The reason could be the allowed flexibility while errors are below a threshold, generating simpler surfaces, and thus needing less support vectors to define them. Moreover, It has been proved that the algorithm works well when a non-linear kernel is applied in spite of having few training data. Other well-known methods will easily overfit the data, while the SVRM dynamically controls its generalization ability, generating a hypothesis simple enough to model training data distribution better. 2.2.2
Cost Functions and -SVRM
The cost function is one of the key elements in SVRM. As it was said in the previous section, real data is usually acquired with a certain noise figure with unknown distribution. The cost function is in charge of accepting noise deviations, and penalizing wide deviations, whether they are caused by noise or by a current too simple hypothesis. The point is how to make the difference between noise and hypothesis complexity. Nevertheless, this function must satisfy certain features. For the sake of problem resolution usefulness, the cost function must be convex, thus maintaining problem convexity and assuring solution existence, uniqueness and globality. Moreover, for the mathematical development to remain simple, it is required to be symmetric and having at most two discontinuities at ±, in the first derivative, being ≥ 0. Therefore, even if we know the noise distribution, it would be too complex to introduce that additional information inside the algorithm. We should then have to find a convex cost function that may adjust to the noise distribution, but we would still use an approximation. Not to mention the mathematical development for the new cost function, notably difficult for non-expert mathematicians. The conclusion is: just use a general purpose cost function and let the SRVM automatic learning do the engineering job. The development described in the previous subsection refers to the -insensible cost function, which is the most commonly used, and is defined as: (32)
c =
0 if ≤ − if >
These kind of functions have an additional parameter , which helps to adjust the maximum allowable deviation for any given point. A validation process is required to adjust this parameter, even though its value can be approximated after any additional knowledge about noise or data distributions. To finish with the SVRM section, we will summarize a variation for the -SVRM (using -insensitive cost function), called -SVRM [Schölkopf et al, 1998]. The difference consists not in the cost function itself (which remains the -insensitive), but in the objective function. The -SVRM gave the objective function as:
SUPPORT VECTOR MACHINES
(33)
177
M 1 1 2 ∗ + i LP = w + C M i=1 i 2
and now, in -SVRM, the objective function is: M 1 1 2 ∗ + i (34) LP = w + C + 2 M i=1 i with respect to the same constraints as in (24). The resulting dual formulation problem gets: Maximize LD =
M
yi i − ∗i −
i=1
M M 1 − ∗i i − ∗i Kxi xj 2 i=1 j=1 i
with respect to M
i − ∗i = 0
i=1 M
i + ∗i ≤ C
i=1
(35)
i ∗i
∈
0
C M
and leaving the estimating function in the same form as in (31). The values of b and can be calculated after training using constraints (24) for non-bound support vectors. If the value of increases, then the first term in the cost effect at (34), , will increase proportionally, while the second term will decrease as some points will benefit from the softer constraints and will be inside the bound (it also decreases proportionally to the new lucky points). For the objective function to attain the optimum, the value of must increase until the fraction of error points (out of bounds) is less than or equal to the value of . Therefore the new parameter is an upper limit for training errors (which are related to the number of support vectors). Obviously it must satisfy ∈ 0 1. It seems easier to pick a good value rather than a value. Moreover, -SVRM is a superset of -SVRM: after training with the first method we can calculate the parameter value, which can be used in a -SVRM algorithm giving exactly the same solution obtained in the first place. 2.3
Principal Component Analysis
Support Vector Machines (regression included) and non-linear Principal Component Analysis (PCA) were the first applications developed under the idea of a high
178
CHAPTER 7
dimension space mapping using Mercer kernels in Machine Learning. They differ in the problem to solve even though they use similar means. SVM is a supervised algorithm, i.e. the system state changes whether an output for a given pattern is equal to the expected correct value or not. On the other hand, kernel PCA is an unsupervised algorithm, i.e. there are no labels, and the output is the training data distribution covariance analysis [Schölkopf et al, 1998]. PCA is an efficient method to extract the input data in a certain structure, and can be achieved by calculating the system eigenvalues and eigenvectors. Formally speaking, kernel PCA is an input space base transformation for diagonalizing the normalized input data covariance matrix estimation with the form: (36)
C=
M
1 xi xiT M i=1
where M is the number of patterns xi . It is called principal component to the new coordinates described by the eigenvectors as base, i.e. the matrix vectors orthogonal projection over the eigenvectors. Eigenvalues and eigenvectors V must be non-zero and satisfy V = CV. We introduce the usual non-linearity concept, with the mapping function and its corresponding kernel. We assume there exist coefficients 1 M , such that : (37)
V=
M
i xi
i=1
and the corresponding matrix kernel K (as defined in previous sections). Then we arrive to the problem: (38)
M = K
being the eigenvalues and = 1 M the eigenvectors coefficients. To extract the principal components for a given pattern, data projections in feature space are calculated in the following form: (39)
Vk x =
M
M
ki xi x = ki Kxi x
i=1
i=1
In this notation, k is a super-index representing the k-th eigenvector and its k-th coefficients set. Note that after the previous calculation process, k non-zero eigenvalues and eigenvectors are obtained, each one of them with a set of M coefficients. To implement the kernel PCA algorithm, the following steps must be taken: 1. Kernel matrix must be calculated, being of size MxM Kij = kxi xj ij
for all i,j ∈ 1 M
SUPPORT VECTOR MACHINES
179
Here comes the first problem when using kernel PCA. Any matrix calculation resources will grow at least with the square of its size, so with current algorithms and hardware no more than 5000 data should be used. If you are provided with more data for training (which should never be seen as an unfortunate case), a representative subset must be created heuristically. 2. Diagonalize matrix K, to calculate eigenvalues and eigenvectors after equation (38), using traditional methods, and normalize such vectors. After this calculation we can obtain coefficients k = k1 kM , to be used in the projection phase. 3. To extract non-linear principal components from a given pattern, point projections over eigenvectors must be calculated using equation (39). The number of principal components (non-zero eigenvectors) to be used is designer’s choice. But not all of them must be used: if so, the process would be useless. Just choose the first k principal components, those with a significant amount of information and very little noise. After this simple process, you get a data space change. From input space we changed to a k-dimension space (being k a fraction of M), in which each dimension gives a useful feature taken from non-linear correlation in the training data set. That is a conceptual difference between SVM training and kernel PCA training: in the first case new features are implicitly generated, and many are dropped after training; in the second case k new features are explicitly generated, all of them with a lot of information, ordered from most important downwards. The value k has an upper bound of M, the number of training patterns. New features are made explicit, so, obviously, the number k must not be too high or computation effort would be inefficient. So only the first components should be used, those having the greatest possible variance, i.e. the biggest eigenvector, i.e. the most discriminant information. Non-linear PCA usefulness in pattern recognition has been tested thoroughly, attaining classification performances as good as the best non-linear SVM and well above neural networks. The process is very simple: first calculate projection coefficients and select the best ones; then transform all patterns (training, validation and test) explicitly into the new space; afterwards use these data to train a linear or non-linear classification machine (SVM, neural networks, decision trees, …, anyone will do) which will be the true supervised classification process. When using kernel PCA for classification, usually a linear SVM is used for supervised training, giving enough flexibility to solve any non-linear problem. The described process is very much like a one-hidden-layer neural network, in which the architecture and the first layer weights are obtained by optimised means: the variance matrix eigenvectors. Also because of the explicit new features calculation, multiclass SVM can be trained easily: the first layer would be common (as in neural networks), and the second layer (linear discriminant) can be calculated using the hyperplane w value (now it can be calculated because feature space is no longer implicit), giving O(1) complexity.
180 3.
CHAPTER 7
SVM VERSUS NEURAL NETWORKS
Neural Networks has led the Machine Learning field from the 1980’s thanks to its development and interpretation simplicity, while having very competitive generalization ability. Nevertheless, after 20 years have gone by, design and development complexity has increased considerably when trying to solve secondary problems as convergence speed, new error calculation concepts, new activation functions with additional constraints, local minimum preventing, and so on. So many years of active research have turned NN from initial simplicity to current complexity, fit only for specialised engineers. Probably, as time goes by, SVM will follow a similar path, from current simplicity to some complexity degree, needing a human expert to take out all of its potential. Complexity by itself is not bad: higher method complexity usually leads to better performance or classification rates. But NN basic research is currently scarce. For the most part, it is about new applications where NN give better results for specific architectures, so a qualitative jump is needed: SVM. This is quite a natural step in human research, and many examples can be shown. When some technology gets to its limit, then a new approach must be issued. At first, both methods performance may be similar, but the new one will eventually outperform the old method. We believe we are currently in the beginning of a technology jump, so it is a nice time to change sides. All along this chapter, the relation between SVM and NN has been widely established. It can be stated that a SVM object topology can be developed as a one-hidden-layer perceptron. It has been demonstrated in NN literature that the family of one-hidden-layer perceptron can act as a universal discriminator, i.e. it can approximate any function. For sigmoidal activation function, similarity between NN and SVM with kernel (22) is complete. (see figure 16). After the SVM optimisation process, we obtain a network having d units in the first layer (input space dimension), N units in
Figure 16. A neural network approach for the SVM implicit architecture. Note that the layers are completely connected (although not explicitly shown for figure clarity). Also, all weights are equal to 1 except the ones connecting the hidden layer and the output layer which are equal to the corresponding support vector’s
SUPPORT VECTOR MACHINES
181
the hidden layer (number of support vectors), and one unit in the output layer (binary classifier), with weights connecting the last two layers. For other kernels, similarity is somewhat lesser, although the resulting topology is like the d,N ,1 one-hidden-layer perceptron. Only the kernel function makes the difference. On the other hand, the SVM with RBF kernel is like a RBF classification network, in which clusters and its characteristics have been calculated using an automatic optimal algorithm. When using a similar kernel and activation function, an important difference can be observed. SVM tend to have a bigger number of support vectors (hidden-layer units), when facing a complex or noisy training set. Neural networks can attain similar classification performances with much less internal units. This is essential to the test phase speed, because it depends directly on the number of elements in the hidden layer. The number of multiply-add operations done in the test phase of either method is Nd + 1. The reason for such difference is mainly that support vectors defining the hidden layer are constrained to be training points. Neural networks do not have such constraint, so they need less elements to model the same function (it has greater freedom degree). This does not mean that NN solution is better; it is quicker in test phase, and topology complexity is lower, but generalization performance is not affected. Moreover, note that the training phase allowed errors (including those points lying inside the margin) become support vectors. When optimising complex or noisy training sets with loose error penalization, the number of training errors can be very large. But this problem has also been solved during the first SVM research steps. In [Burges, 1996] the “reduced support vector set” method is described. Given a trained SVM, this method creates a smaller support vector set representing approximately the same information than the whole support vector set. But in this case the former constraint is eliminated because the new virtual support vectors need not be training points. The result is very much alike the NN approach topology. This new expansion solves the classification speed problem, making the SVM competitive against other Machine Learning methods. Nevertheless, it is seldom used because of its considerable development difficulty. Even more similarity can be found between NN and SVM classifiers using kernel PCA as the feature extraction step. Units in the hidden layer are calculated explicitly using the eigenvector projection instead of kernel calculation. These units are not significant training points but true features, all of which share the concepts under the internal data distribution. Thus, the classifier topology should be very similar to the one generated by an experienced NN architect, because they have heavy statistical meaning. The only flexibility a NN offers and the SVM cannot reach is the multiplehidden-layer approach (using kernel PCA plus non-linear SVM could get up to 2 hidden layers, but it is seldom used). In spite of the fact that a one-hidden-layer
182
CHAPTER 7
topology is a universal discriminator, having more hidden layers can make the training process much more efficient. Using that capability, maybe there are fewer units in the net, or convergence is faster. But the training algorithms grow more complex, and the overfitting and local minimum finding problems will still be there. Therefore, the main differences between both methods are: • Training one SVM requires much more computation resources than training one NN. • Classification speed is usually slower in SVM • SVM result is the optimum, while NN can be stuck in local minima. Therefore SVM usually outperforms NN in classification performance. • SVM parameters are few and easy to use, while NN requires an experienced engineer to create and try the right architecture. • SVM usually needs one execution only to give the best results, while NN usually requires many tries to take out its best. Outside scientific community, money rules. Expert engineer time is much more expensive than computing resources, and differences will grow higher. If Machine Learning algorithms are to be introduced massively in commercial products such as knowledge management or data mining, automatic methods must be used. In the real world, new data is always coming; new profiles arise while others are no longer valid. Neural network flexibility must be tailored by an expert to fit current state. But, for a company, it may be not worthy the cost of tailoring a Machine Learning system that will become obsolete within some months. It is unavoidable: craftsmen will be eventually replaced by machines. 4.
SVM OPTIMISATION METHODS
4.1
Optimisation Methods Overview
SVM development tries to solve the problem described in (14): Maximize LD with respect to Lagrange multipliers and with constraints (15) and (16). When SVM appeared, the first approach to solve this problem was using standard optimisation methods, such as gradient-descent or quasi-Newton. These methods, quite veterans in mathematical literature, mainly apply complex operators over the Hessian matrix (partial derivatives matrix). These one-step processes are computationally as well as memory resources intensive. Memory resources for matrices is OM2 , while computational resources are OM3 , being M the number of patterns in the training set. For instance, a 5000-point set will require 100 Mbytes of storing memory using single precision floating numbers. Any process over such an enormous data structure will be very inefficient, beyond many machines ability. The main research line in the first years of live of SVM was the search for alternative optimisation methods, developed explicitly for SVM mathematical use. Many new approaches were published before one of them pleased all researchers for its simplicity and its efficiency. The main methods, in chronological appearance are the following:
SUPPORT VECTOR MACHINES
183
• The chunking method, developed by Vapnik. Points that are not support vectors do not affect the Hessian matrix calculation; therefore, if we take them out before the matrix calculation begins the resulting Lagrange multipliers would remain the same. At the same time, the matrix calculation itself is easier, now its complexity is ON3 being N the number of support vectors, and N << M. The problem is that we cannot know beforehand which patterns will become support vectors and which ones will not. Therefore, the algorithm is described as: • Divide the training set in subsets randomly, with same (heuristic) size. • Initialise the “support vector set” to void • Repeat until there are no more subsets • Join the “support vector set” with next subset • Optimise (as if it were a complete SVM training set) the joined subset using basic optimisation techniques (gradient-descent, etc..). • Assign to the “support vector set” the patterns identified in the last optimisation execution as support vectors. The idea behind this algorithm is, at each step, eliminate those points that are not support vectors, and therefore, they are not likely to become support vectors in a complete set optimisation. However, you could eliminate significant patterns in one step that would become support vectors only in the presence of patterns not in the current set. Of course, if N and M are about the same magnitude, then this algorithm is even worse than the basic methods. • The decomposition method, described in [Osuna et al, 1997]. Following the chunking idea, we can optimise just a few training points subset with respect to the complete training set. Similarities with the chunking method are: first, small subsets are trained at each step; second, basic optimisation algorithms are used at each step. And it also has differences: first, each step calculates the optimisation state for a small subset with respect to the whole training set (therefore, no support vectors will be missed); second, the values of N and M are not so strictly constrained. The resulting value you get is an approximation having more guarantees to be close to the basic method approach (which gives the exact result), but at the same time it can handle many more data patterns. The size of the small subset is heuristics-driven, depending on the hypothesis complexity, the number of data points or the noise figures. • The SMO method, described in [Platt, 1999]. The decomposition method is extended to the limit. The smallest data set you can optimise is two, for condition (16) to hold. But the remarkable difference is that a two-point optimisation step can be calculated analytically, using a simple equation. Therefore, the optimisation method becomes very simple: • Until we are close enough to the correct solution • Choose two points • Optimise those two points with respect to the complete training set. Generating a SVM optimisation algorithm is no longer a mathematician’s field. Calculating the optimisation step is easy, and selecting the two-point set is made using a fixed set of heuristics, accepted by the scientific community as the best you
184
CHAPTER 7
could imagine (until a better one is published). The SMO algorithm is the quickest and the most scalable of them all, so there is no more controversy about which optimisation method should be used. The greatest disadvantages of SVM with respect to other Machine Learning methods are the training speed and the training set size limit. Using the decomposition algorithm, the biggest training set successfully tried had around 100.000 patterns. Time complexity is a bit harder to calculate, but it has been calculated empirically to be ON2 , while memory requirements decrease dramatically using SMO (no matrices needed). When facing a really big problem (more than 100.000 points), the SVM user should try divide-and-conquer algorithms, allowing the SMO algorithm to perform more time-efficient tasks.
4.2
SMO Algorithm
After describing the main features about optimisation methods, we will show with more detail the SMO algorithm basics. This section is divided in two parts: the takeStep function, were the two-point optimisation equation is implemented; the point pair choice, were heuristics are implemented, and one usually ignored (but important to efficiency) parameter appears. 4.2.1
The takeStep Function
After reading Platt’s paper [Platt, 1999] , where the first SMO pseudo-code appears, the non-mathematician reader will have that uncomfortable “this is too complex for me to understand” feeling. Because we do not want to lose your interest in SVM (after all, you read all the way up until now), we will follow Platt’s pseudo-code regarding to the takeStep function, adding some intermediate steps that will make it easier for engineers without a very strong mathematical background. Given two points, A and B, defined with an input vector x and label y as xA yA and xB yB . We distinguish two cases, one where yA = yB and another where yA = yB . • Case yA = yB . Because of constraint (16), A + B = H, before and after optimisation, where H is a constant real value, calculated using the previous A and B values. The SVM tries to maximize LD with respect to all i . In one SMO step we want to maximize LD with respect to A and B . All other Lagrange multipliers are constants to this optimisation step. LD The main condition for this step is = 0, where A
LD =
l i=1
i −
l l 1 y y x • x 2 i=1 j=1 i i j j i j
SUPPORT VECTOR MACHINES
185
We can separate this expression in two adding expressions: Set 0 =
l
i
i=1
Set u = −
l l 1 y y x • x 2 i=1 j=1 i i j j i j
Suppose the set L is defined as the set of all training patterns except A and B. We can further separate expression Set u in nine adding expressions: Set 1 When i = A and j = A Set 2 When i = A and j = B Set 3 When i = A and j = B and j = A Set 4 When i = B and j = B Set 5 When i = B and j = A Set 6 When i = B and j = B and j = A Set 7 When i = A and i = B and j = A Set 8 When i = A and i = B and j = B Set 9 When i = A and i = B and j = A and j = B And substitute B by H − A everywhere it appears. Then LD Set0 = + A A +
Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 Set9 + + + + + + + + =0 A A A A A A A A A
The complete expression is: =
1 + 1 + K + A + H − A + K + n + A
−
1 2A KAA + 2 A
−
1 HA − 2A KAB + 2 A L
j A yj yA KjA
−
1 j=1 2
−
1 H − A 2 KBB + 2 A
A
+
186
CHAPTER 7
−
1 HA − n2A KAB + 2 A L
1 j=1 − 2 L
−
1 i=1 2 L
−
1 i=1 2
j H − A yj yA KjB +
A i A yi yA KiA A
+
i H − A yi yA KiB A
L L
−
i j yi yA Kij 1 i=1 j=1 A 2
After partial derivation: = 0+ 1 − 2A KAA + 2 1 − H − 2A KAB + 2 −
L 1 yy K + 2 i=1 i i A Ai
1 − −2H − A KBB + 2 1 − H − 2A KAB + 2 L 1 − −1 i yi yB KBi + 2 i=1
−
L 1 yy K + 2 i=1 i i A Ai
L 1 − −1 i yi yB KBi + 2 i=1
0
+
SUPPORT VECTOR MACHINES
187
Adding up members with equal Kmn term = yA
L
i yi KAi +
i=1
− yB
L
i yi KBi +
i=1
A KAA + H − 2A KAB + − H − A KBB Where index i is related to the set L (to all training points except A and B). Therefore, yB (40)
A =
L
i yi KBi − yA
i=1
L
i yi KAi + HKBB − KAB
i=1
KAA − 2 KAB + KBB
and B is very easy to compute, because B = H − A • Case yA = yB . Because of constraint (16), B = H + A , before and after optimisation, where H is a constant real value, calculated using the previous A and B values. The mathematical development is very similar to the first case. The formulas to calculate the step are: −yB (41)
A =
L i=1
i yi KBi − yA
L
i yi KAi + HKBB − KAB + 2
i=1
KAA − 2 KAB + KBB
and B = H + A . Note that in the soft margin algorithm the values must be between 0 and C. This is not guaranteed using the formula here written. Therefore, a last step is needed: clipping. The A value must be set inside defined bounds prior to the B calculation. Afterwards, the B value must be checked to be inside the same bounds, and otherwise clipped. Note that this clipping procedure only needs two checks, and that the H term must always hold. Nevertheless, numerical stability must be ensured in the clipping procedure using the add and substract arithmetic operations only, avoiding inexact floating point multiply operations. In both cases, you still have to calculate a couple of terms, so its complexity is related to the number of points in the training set. Suppose you implement an internal cache where those terms are calculated a priori. Then, the step complexity would be O(1). But this improvement is not free: now you have to update the cache value for all training points after each successful step, so the complexity becomes O(M) all over again. Actually you can limit the cache update only to the current
188
CHAPTER 7
non-bound support vectors, that is, the points that most probably will be optimised in next steps. Then, the complexity decreases to ON , being N the number of non-bound support vectors, and usually N < N << M The cache used is the error term (usually defined as E in code implementations), being ExA = EA = SVMCurrentlearnedFunctionxA − yA = w • xA + b − yA =
N
i yi KAi + b − yA
i=1
Therefore, the previously defined term becomes: L
i yi KAi =
N
i=1
i yi KAi − A yA KAA − B yB KAB
i=1
=EA − b + yA − A yA KAA − B yB KAB Substituting this term in (40) and (41) and expanding H, gives a common and very simple formula for both cases (same and different class): (42)
old new A = A +
yA EB − EA KAA − 2KAB + KBB
This cache concept comes along with the SMO algorithm from its very beginnings. It may seem that no great advantage is gained using and updating the cache instead of calculating the terms at each optimisation step. The reason why is done so is the heuristics. 4.2.2
The Heuristics
The greatest cost in the SMO algorithm is the error cache update. The optimisation step itself has a very low impact when using this cache. Therefore, as it was said in the previous section, the least number of updated cache values, the better. So, a two-fold search pattern must be used: first, the best optimisation pair must be tried at each step (the highest increase in LD ); second, the least number of pairs to be inspected for one step, the better. For the first issue, we could use some of the hand-made optimisation rules examples given at section 2.2. In much the same way as those rules were figured out, the main heuristic defined in SMO is: chose the pair of points whose error cache difference is the greatest. Note that one of the effects of a two-point optimisation
SUPPORT VECTOR MACHINES
189
is that both points become “well classified”, i.e., their error value becomes 0. Looking at the LD formula analysis at section 2.2, it can be seen that classification correctness is one of the main signs for reaching the optimum. If you optimise the two points which are most poorly classified, then the increase in the LD value seems the highest. This approach is only a heuristic function. Maybe it cannot be formally demonstrated, but it seems very plausible. For the second issue, we need a search pattern using the least number of error calculations possible. When in an intermediate optimisation step, it has been demonstrated that those points whose Lagrange multiplier is at bound (i = 0 or i = C), will probably remain that way throughout the rest of the training. Thus, updating these points error cache is not worthy. Most of the time the non-bound points will be used for the optimisation procedure (and therefore, only these points will have its cache updated). In that usefulness order, support vectors at bound i = C and non-suport-vectors i = 0 will most probably remain non-significant throughout the rest of the training. Therefore, the choosing of both points follows the same search pattern: • First, try successful steps using every non-bound point (fine-grain). • If not successful, try on all the set (coarse-grain). The training procedure can be seen as a two-loop approach, each one selecting one of the optimisation pair points (called i1 and i2 in the original SMO pseudocode). Loop until LD is high enough Choose next i1 following the two-set approach heuristic. Loop until an i2 choice is good enough or i1 cannot be optimised, choose i2 in one of three ways: First try the point having greatest error (cache) difference. If not successful, try all support vectors. If not successful, try all patterns. If not successful, i1 cannot be optimised at this step If i2 choosing successful, then takeStep(i1, i2).
Actually, the implementation is slightly different, as the takeStep procedure is used to assert if the optimisation step is successful. But the idea behind this simple pseudo-code is quite a close approximation. Remember that points at bound do not have their error cache updated. Therefore, when choosing i2, if the first try fails, then non-updated-cache points are used, decreasing the current step efficiency. Much in the same line, there is a usually forgotten parameter in the SMO code that is very important to attain a greater efficiency degree. In the previous pseudocode, there is no mention about when to switch the set used to choose i1. Current implementations use a very simple decision: when you reach an optimisation state where no first-set points can be optimised, switch to the bigger set. This condition is also used for the algorithm termination: when you are already using the bigger set and no points can be optimised the current numerical approximation is good enough.
190
CHAPTER 7
The effect of these sharp limits is lower efficiency. Suppose you are at an early optimisation step where at-bound points are still far from being in their correct state (a very common case). A set of non-bound points is ready for optimisation. The sharp-limit algorithm will optimise that small set until it is completely correct. That approach is not at all cheap, it is like generating a correct smaller SVM using some constant information. At each optimisation step, many other points could become non-optimal. Then, another pair will be chosen, and the previous pair could become non-optimal again. Fine-grain optimisation processes move at a very slow pace. And after all that work is performed, a coarse-grain good optimisation step will make all previous fine-grain computation useless. Of course, fine-grain steps are needed as well as coarse-grain steps. A balance must be implemented optimising the computational cost. One coarse-grain step always costs more than a fine-grain step, but the objective function increase in the former can be much greater than the latter. In early steps, coarse-grain optimising performs better, while in close-tooptimum steps, fine-grain optimising is more cost-efficient. However, this analysis is not defined as a sharp rule; it is more like a fuzzy rule, where some heuristics must be used. Note that you can always know approximately how far you are from the solution at any point in the optimisation procedure. At optimality, the difference LP −LD is 0. In complex problems, the difference will hardly be 0, as a numerical approximation is used and some degree of floating point error is allowed (using a precision parameter). But it gives a very good estimate about how much a fine-grain or coarse-grain optimisation step is more likely to give better results. Note that the difference is non-negative, as LP is always greater (or equal to at optimality) than LD (see section 2.2). The basic SMO has been enhanced in several ways, fulfilling the premonition enounced in section 4: SVM algorithms will get more complex. Keerthi proposed in [Keerthi et al., 1999] two modifications on SMO heuristics that highly upgrades performance, but they are a bit more difficult to follow. Also, C.J. Lin created and maintains LIBSVM (Library for Support Vector Machines), a freely available, open-source, unreadable, optimised implementation of SMO, considered worldwide as the fastest SVM training library.
5.
CONCLUSIONS
The SVM community is growing fast. Nowadays all commercial pattern recognition toolboxes have the SVM algorithms as a standard Machine Learning method. In the early years, from 1995 to 1999, its implementation was too hard for the engineering community to develop and maintain. But after the publishing of the SMO algorithm in 1999, the whole Machine Learning community is walking with big strides to accept the SVM as a reference method. SVM theory has very strong foundations. Statistical learning theory has its roots deep in mathematical and logic knowledge, and SVM is just a new mathematically
SUPPORT VECTOR MACHINES
191
developed appendix. Although its implementation may seem difficult, the solid background makes the SVM basics very easy to understand, making the results clear and useful for analysis. 6.
ACKNOWLEDGEMENTS
The authors would like to thank Ignacio Melgar for text review, and the I + D department at Sener Ingeniería y Sistemas for their support. REFERENCES C. J. C. Burges. Simplified support vector decision rules. In L. Saitta, editor, Proc. 13th International Conference on Machine Learning, pages 71–77, San Mateo, CA, 1996. C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), pages 121–167, 1998. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995. R. Fletcher. Practical methods of optimization. John Wiley and Sons, Inc, 2nd edition, 1987. S. Keerthi, S. Shevade, C. Bhattacharyya and K. Murthy, Improvements to Platt’s smo algorithm for svm classifier design. Technical Report CD-99-14, Control Division, Dept. of Mechanical and Production Engineering, National University of Singapore, Singapore, August 1999. E. Osuna, R. Freund and F. Girosi. An improved algorithm for support vector machines. In proceedings of the 1997 IEEE workshop on Neural Networks for Signal Processing 7, pages 276–285, Amelia Island, FL, 1997. E. Osuna and F. Girosi. Reducing run-time complexity in SVMs. In Proceedings of the 14th International Conf. on Pattern Recognition, pages 271–284,Brisbane, Australia, 1998. J. Platt. Fast training of support vector machines using sequential minimal optimisation. In B. Scholkopf, C. Burges and A. Smola, editors, Advances in Kernel Methods - support vector learning, pages 185–208. MIT press, Cambridge, MA, 1999 B. Schölkopf, P. Y. Simard, A. J. Smola, and V. N. Vapnik. Prior knowledge in support vector kernels. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural information processings systems, volume 10, pages 640–646, Cambridge, MA, 1998. MIT Press. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press.
CHAPTER 8 FRACTALS AS PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
ANA M. TARQUIS1 , VALERIANO MÉNDEZ1 , JUAN B. GRAU1 , JOSÉ M. ANTÓN1 , DIEGO ANDINA12 1
Dpto. de Matemática Aplicada, E.T.S. de Ingenieros Agrónomos, U.P.M., Av. Complutense s.n., Ciudad Universitaria, Madrid 28040, Spain. 2 Dpto. de Señales, Sistemas y Radiocomunicaiones, E.T.S. Ingenieros de Telecomunicación, U.P.M., Av. Complutense s.n., Ciudad Universitaria, Madrid 28040, Spain. Abstract:
Preprocessing is the process of adapting the input of our Computational Intelligence (CI) problem to the CI technique applied. Images are inputs of many problems, and Fractal processing of the images to extract relevant geometry characteristics is a very important tool. This chapter is dedicated to Fractal Preprocessing. In Pedology, fractal models were fitted to match the structure of soils and techniques of multifractal analysis of soil images were developed as is described in a state-of-the-art panorama. A box-counting method and a gliding box method are presented, both obtaining from images sets of dimension parameters, and are evaluated in a discussed case study from images of samples, and the second seems preferable. Finally, a comprehensive list of references is given
Keywords:
pedology; soil structure; multifractal; soil images; box-counting method; gliding-box method; capacity dimension; information dimension; correlation dimension; multiscale heterogeneity; fractal models; porous media; partition function
INTRODUCTION Methods of analysis based on fractal theories have been developed for Pedology to describe the structure of soil, as natural soils due to a combination of geology, action of water and air and organisms present a structure down to very small scales that is in rapport with the physical, biological and agricultural properties. That structure tends to correspond to fractal paradigms as maintaining somehow a fairly similar structure when reducing the scale to small microscopic ranges. Fractal models with a reduced number of parameters have been developed to describe naturally complex soils, and different experimental methods were created to compare theory with 193 D. Andina and D.T. Pham (eds.), Computational Intelligence, 193–213. © 2007 Springer.
194
CHAPTER 8
reality or to evaluate parameters, including representative images of soils treated so as to maintain natural structure features while marking pores or flow patterns, etc. Next section contains a condensed panorama of the corresponding state of art with references of authors and methods. Third section explains the most common methods in subdividing an image to calculate fractal dimensions. Later, two related methods of image description are exposed with formulae, a box-counting method and a gliding box method, that assume a multifractal structure and the type of image analysis resulting in parameters that include capacity, information and correlation dimension. Next, a case study of three obtained images is presented, and using both methods a wide range of generalized dimensions is plotted for three samples, and these results are described and discussed to assess the methods. Finally, conclusions and a list of references are provided. 1.
STATE OF THE ART
Many parameters may be used in the attempt to describe a disordered morphology, but the spatial arrangement of its most prominent features is a challenging problem throughout a wide range of disciplines (Ripley, 1988; Griffith, 1988; Baveye and Boast, 1998). In the case of 2-dimensional images of soil sections, several works try to describe the spatial structure applying fractal techniques. Some of them were studying the spatial arrangement of pore and solid spaces on images of sections of resin-impregnated soil (Protz and VandenBygaart, 1998; VandenBygaart and Protz, 1999). Thin soil sections are analysed by transmitted light to obtain images from which pores, filled with a resin, and solid spaces can be separated using image analysis techniques (Morán et al., 1989; Vogel and Kretzschmar, 1996). In other soil science areas, dye tracers are frequently used to study flow patterns in structured soils, and with modern photographic techniques, they may provide excellent spatial resolution of the flow paths (Flury and Fluhler, 1994; 1995). The most common approach to describe dye patterns has been by descriptive statistics of the vertical variation in dye coverage or shape parameters (Flury et al., 1994) based on a black and white image, as the ones used for soil-pore structure. A main objective was to extract fractal dimensions which characterize multiscale and self-similar geometric structures within the images. In other words, we expect to see that the image viewed at different resolutions looks the same. At a given resolution we should see the matrix as a collection of subsets similar to each other and furthermore similar to the whole. If such a hierarchical organization exists, it can be characterized by a mass fractal dimension D. These structures are either one of the phases (black or white) or the interface between the two within a 2-dimensional image. That such dimensions can be extracted from soil images is evident, from soil pore structure (Brakensiek et al., 1992; Peyton et al., 1994; Crawford et al., 1993; 1995; Anderson et al., 1996; 1998; Pachepsky et al., 1996; Giménez et al., 1997; 1998; Oleschko et al., 1997; 1998a; 1998b; Hallet et al., 1998; Bartoli et al., 1991; 1998; 1999; Dathe et al., 2001; Bird et al., 2000; 2003) or flow paths in soils (Hatano and Booltink, 1992; Hatano et al, 1992; Booltink et al., 1993).
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
195
More recently, interest has turned to multifractal analysis of soil images. A multifractal or more precisely a geometrical multifractal (Tel and Vicsek, 1987) is a non-uniform fractal which unlike a uniform fractal exhibits local density fluctuations. Its characterization requires not a single dimension but a sequence of generalized fractal dimensions (Pachepsky et al., 2000). A multifractal analysis to extract these dimensions from a soil image may have utility for more complex distributions, if there is marked variation in local density or porosity. Multifractal analysis (MFA) has been applied to images of rock pore systems (Muller and McCauley, 1992; Saucier, 1992; Muller et al., 1995; Muller, 1996; Saucier and Muller, 1999; Saucier et al., 2002), reviewed in the context of soils by Tarquis et al. (2003) and recently applied to soils by Posadas et al. (2003). In other areas, researchers have calculated fractal dimensions of dye patterns in horizontal sections of undisturbed soil cores and related them to the outflow of percolating dye solution. Baveye et al. (1998) calculated the information dimension D1 and the correlation dimension D2 of dye stain patterns in vertical sections in a field soil. Based on these calculations, models have approached the variability showed in the dye images using diffusion limited aggregation techniques (Persson et al., 2001). Several authors have shown that the exact value of the generalized dimension is not an easy task (Baveye et al., 1998; Baveye and Boast, 1998; Crawford et al., 1999; Ogawa et al., 1999) pointing to practical difficulties in extracting generalized dimensions (Vicsek, 1990; Buczkowski et al., 1998). Merits and limitations of the multifractal analysis have been discussed by Aharony (1990), Beghdadi et al. (1993), and Andraud et al. (1994). In the same way, Chhabra et al. (1989) pointed out the risks in the estimation of the Hausdorff dimension and cited different sources of errors in some physical cases. The difficulties arising in practice are due to the fact that the relevant quantities used in the multifractal concept are estimated asymptotically and in image analysis these estimations are much more coarse and limited by the finite resolution of the image (Ahammer et al., 2003) and the measure build on it, as the number of black points in a box (Buczkowski et al., 1998). Also some authors have pointed out the influence of the percentage of black pixels in a 2-dimensional image in the Dq obtained (Dathe and Thullner, 2005; Bird et al., 2005; Tarquis et al., 2005; Dathe et al., 2005). On the other hand, MFA involves partitioning the space of study into boxes to construct samples with multiple scales. The number of the samples at a given scale is restricted by the size of the partitioning space and data resolution, which is usually another main factor influencing statistical estimation in MFA (Cheng and Agerberg, 1996). The purpose of this chapter is to ascertain the successful extraction of a spectrum of generalized dimensions from a soil image, trying to avoid all the restrictions that the box-counting method has and to discern the existence/non-existence of a multifractal distribution of black or white within the image. In the following sections, we shall review the box counting algorithm to obtain the generalized fractal dimensions of multifractal analysis, as well as the gliding-box method. We will show some examples to highlight shortcomings of these methods.
196 2.
CHAPTER 8
FRACTAL CALCULATIONS
The main purpose of this section is to introduce some of the fractal methods used in the context of black and white image analysis of soil structure. A complete treatment of fractal and multifractal theories can be found, among others, in Feder (1989) and Baveye and Boast (1998). 2.1
Box-counting Method
This methodology is classical in this field and has generated a large volume of work. If a fractal line in a 2-dimensional space is covered by boxes of side length d, the number of such boxes, n, needed to cover the line when → 0 is (Mandelbrot, 1982): (1)
n = c−DL
The length of the line studied (e.g., the pore-solid interface), L, can be defined at different scales and is equal to n. At small values the method provides a good approximation to the length of the line because the resolution of the image is approached. At larger sizes the difference between n and the “true” length increases. Thus, DL , or capacity dimension, is estimated using small values (Gimenez et al., 1997b). The box-counting method is also used to obtain a fractal dimension of pore space by counting boxes that are occupied for at least one pixel belonging to the class “pore” (Gimenez et al., 1997b). 2.2
Dilation Method
Dathe et al. (2001) is the only published report of this method in the soil science. The dilation method follows essentially the same procedure as the box-counting method, but instead of using boxes it uses other structuring elements to cover the object under study, e.g., circles (Dathe et al., 2001). The image is formed by pixels, which are either square or rectangular in shape. If circles are used, the measure of scale is their diameter (as is the side length of the box in the box-counting technique). If we want to have the same dilation in any direction, the orthogonal and √ diagonal increments should be biased by 2, which corresponds to the hypotenuse of a square of unit side length (Kaye, 1989). The length of the studied object is counted by numbers of circles, and then the slope of the regression line between the log of the object length and the log of the object diameter is defined by the relation: (2)
L = c1−DL
Dathe et al. (2001) applied the box-counting and dilation methods to the same images and found non-significant differences in the values of the fractal dimensions obtained with both methods. They pointed out, however, that fractal dimensions
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
197
estimated with both methods are different: the box counting dimension is the Kolmogorov dimension while the dimension obtained with the dilation method is the embedding dimension (Mikowski-Bouligand). For further details, see Takayasu’s work (Takayasu, 1990). 2.3
Random Walk
Fractal methods can also be used to describe the dynamic properties of fractal networks (Crawford et al., 1993; Anderson et al., 1996]. Characterization of fractals involving space and time are achieved through the use of fractons (Kaye, 1989) or the spectral dimension (Orbach, 1986). For example, Crawford et al. (1999) related measurements of the spectral dimension d to diffusion through soil, associating d with the resistance degree to which the network delay the diffusing particle in a given direction. The determination of d is based on random walks, where in each walk the number of steps taken ns and the number of different pore pixels visited Sn are computed. At the beginning of the random walk a pore pixel is randomly chosen, then a random step is taken to another pore pixel from the eight pixels surrounding the present one (see Figure 1A). If the new pore pixel has not been visited by the random walk, the Sn and ns are increased by one, otherwise only ns is increased. The random walk stops when a certain number of null steps (the step goes into a site that has been used previously during the walk) is achieved or the random walk
Figure 1. Possible steps taken in a: a) eight-connected random walk, and b) four-connected random walk. The present position of the pore pixel is marked by an x, and the arrows indicate the possible next pore pixel. (From Tarquis et al. In: Scaling Methods in Soil Physics, Pachepsky, Radcliffe and Selim Eds., CRC Press, 2003. With permission)
198
CHAPTER 8
arrives to an edge of the image (for further details see Crawford et al., 1990). A graphical representation of these random walks is shown in Figure 2. The number of walks and the maximum number of null steps for each walk can vary (Anderson et al., 1996). Also a four-connected random walk (Figure 1B) can be used instead of eight-connected one (Anderson et al., 1996). For each random walk, d is calculated based on the relation: d
(3)
ns = cSn2
where c is a constant. The mean value of the d calculated for each walk is the spectral dimension.
Figure 2. A simplified example of one random walk through the pore space (Anderson et al., 1996) (From Anderson et al. Soil.Sci.Soc. A. J., 60, 962,1996. With permission)
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
3.
199
CALCULATION OF GENERALIZED FRACTAL DIMENSIONS
3.1
Box-counting Method
Generalized dimensions calculated using the box counting technique basically accounts for the mass contained in each box. An image is divided into n boxes of size r (pixels in each dimension making r · r pixels per box), designated as nr, and for each box the fraction i of pore space in that box is defined and calculated as (4)
i =
nr mi = m i / mi M i=1
where mi is the number of pore class pixels and M is the total number of pore class pixels in an image. In this case, the pore space area is the measure whereas the support is the pore space itself. The next step is to define the generating function q r as: (5)
q r =
nr
i q r q ∈ R
i=1
where (6)
i q r =
qi
= mi /
nr
q mi
i=1
i is a weighted measure that represents the percentage of pore space in the ith box, and q is the weight or moment of the measure. When computing boxes of size r, the possible values of mi are from 0 to r · r. Therefore, let Nj r be the number of boxes containing j pixels of pore space in that grid. Equations (5) and (6) will then be (Barnsley et al., 1988): (7)
⎛ ⎞q q r·r r·r ⎜ ⎟ mi q j j ⎟ = Nj r = Nj r ⎜ q r = i q = r·r ⎝ ⎠ M i=1 i=1 M j=1 j=1 kNk r nr
nr
k=1
Using the distribution function Nj r, calculations become simpler and computational errors are smaller. A log-log plot of a self-similar measure, q r, vs. r at various values for q gives (8)
q r ∼ r − q
200
CHAPTER 8
where q is the q th mass exponent (Feder, 1989). We can express q as: (9)
q = −limr→0
logq r logr
Then, the generalized dimension, Dq , can be introduced by the following scaling relationship (Feder, 1989): (10)
q = −limr→0
logq r logr
And, therefore (11)
q = q − 1Dq
For the case that q = 1, Equation (11) cannot be applied and the following equation should be used: nr
(12)
D1 = lim
r→0
i 1 r · logi 1 r
i=1
log r
The generalized dimensions, Dq , for q = 0, 1 and 2 are known as the capacity, the information, and the correlation dimensions, respectively (Hentschel and Procaccia, 1983). The capacity dimension is the box-counting or fractal dimension. The information dimension is related to the entropy of the system, whereas the correlation dimension computes the correlation of measures contained in boxes of various sizes (Posadas et al., 2003). Given these definitions and the behaviour to expect in case of a multifractal measure, it is again instructive to seek the lower and upper bounds for q r in order to establish what scope exists for behaviour other than that associated with a multifractal measure (for further explanation see Bird et al., 2005). Following Bird et al.’s work (2005), a brief explanation will be shown to establish bounds for q r. Four separate ranges of values of the parameter q should be considered: Case q > 1 The smallest value that q r can take corresponds to a uniform distribution of pore phase over the image. The largest value that q r can take corresponds to the case in which each grid block covering the pore phase is entirely filled by pore phase. The lower and upper bounds are then as follows: (13)
21−q 21−q L L < q r < f 1−q q > 1 r r
where f is the fraction of the image occupied by black pixels and L is the length of the image.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
201
Case q = 1 The smallest value of the entropy corresponds to the case in which each grid block covering the pore phase is entirely filled by pore phase. The largest value corresponds to a uniform distribution. Lower and upper bounds are then as follows: (14)
2 ln
nr L L + lnf < − i lni < 2 ln q = 1 r r i=1
Case 0 ≤ q < 1 The smallest value q r can take corresponds to the case in which each grid block covering the pore phase is entirely filled by pore. The largest value corresponds to a uniform distribution of pore phase over the image. Lower and upper bounds are as follows: 21−q 21−q L L 1−q (15) f < r q r < 0 ≤ q < 1 r r Case q < 0 The smallest value q r can take corresponds to the case in which each grid block covering the pore phase is entirely filled by pore. The function is monotonically decreasing. Therefore, the value corresponding to r = 1 pixel can be selected as an upper bound. (16)
21−q L f 1−q < q r < L21−q f 1−q q < 0 r
Having defined these bounds, we now seek to examine their significance in terms of extracting generalized dimensions from image data. For q > 1 and for 0 ≤ q < 1, the bounding functions when plotted on the log-log plot used to extract the dimension yield two parallel lines with a vertical separation of 1 − q lnf. For q = 1, the bounding functions when included in the plot of entropy against lnr again yield two parallel lines of slope 2 with separation of lnf. Thus, in these cases we reach the same impasse as that with the fractal analysis, namely depending on f , and independent of actual geometry considered, the data can be so constrained as to yield convincing straight-line fits with associated derived dimensions. 3.2
Gliding Box Method
The gliding-box method was originally used for lacunarity analysis (Allain and Cloitre, 1991). Later, it was modified by Cheng (1997a, 1997b) for estimating q as follows: (17)
< q > +D = −
log< Mq r > logr/rmin
202
CHAPTER 8
Where D is the dimension of the Euclidean space where the image is imbibed (in this case D = 2) and M represents the multiplier measured on each pixel as: rmin q (18) Mq r = r For further details see Grau et al. (2006). The advantage of using Equation (17) in comparison with Equation (9) is that the estimation is independent of box size r which allows the use of two successive box sizes only to estimate q. Equation (18) imposes that rmin should not be null. Once this estimation is done, Equation (8) can be applied to estimate Dq . For the case of q = 1 the following relationship is applied based on the work given in (Saucier and Muller, 1999): (19) 4.
ˆ 1 = 2D2 − D3 D IMAGES FOR THE CASE STUDY
Three soil samples were selected with the aim to represent a different range in void pattern distribution in soils and a wide range of porosity values, from 5% of porosity till 47%. Each of the samples was prepared for image analysis following the procedure described by Protz and VandenBygaart (1998). The data was obtained by imaging thin sections with a Kodak 460 RGB camera using transmitted and circularly polarized illumination. The data was cropped from 3060 × 2036 pixels to 3000 × 2000 pixels. Then, EASI/PACE software classified the data and the void bitmap separated, each individual pixel size was 186 × 186 microns. The images of these soils are showed in Figure 3. To avoid any interference of the edge effect for the calculations using the boxcounting method, an area of 1024 × 1024 pixels of the left upper corner of the original images was selected. 5. 5.1
RESULTS OF THE CASE STUDY AND DISCUSSION Generating Function with the Box-counting Method
For the three binary images, q r was calculated and then a bi-log plot of q r versus r was made to observe the behavior. All plots showed a clear pattern in the data. In Figure 2, for example, at negative q there were two distinctive areas, one where there was a linear relationship between logr and logq r and another where the value of log q r was almost constant versus logr. The box size at which the behavior is different for the three images is around 64 pixels. These two phases were not evident with positive q values (see Figure 4). The existence of a plateau phase of logq r can be explained by the nature of the measure under consideration. At r values close to 1, the variation in number of black pixels is based on a few pixels, having the most simplicity when r = 1
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
203
Sample A
Sample B
Sample C
Figure 3. Soil binary images, pore phase in black pixels, of: (a) ADS, (b) BUSO and (c) EHV1. Each image has 5.65%, 19.17% and 46.67% of porosity, respectively
204
CHAPTER 8 A 150
LogX(q,r)
100 50 0 –50 –100 –150
0
2
4
6
Log(r) B 150
LogX(q,r)
100 50 0 –50 –100 –150
0
2
4
6
Log(r) C 150
LogX(q,r)
100 50 0 –50 –100 –150
0
2
4
6
–10 –8 –6 –4 –2 0 2 4 6 8 10
Log(r)
Figure 4. Bi-log plot of q r versus box size r at different mass exponent q: A): ADS; B) BUSO; C) EVH1
where the measure can only have 0 or 1 value. Thus, for small boxes of size r the proportions among their values are mainly constant. However, when the box size passes certain size a scaling pattern begins.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
5.2
205
Generalized Dimensions Using the Box-counting Method
If all of the regression points are considered, the Dq values, obtained mainly for q < 0, were quite different from these obtained if only the regression points in the linear behavior were chosen (Figure 5). Between both criteria, any Dq can be obtained, but for q >= 0 the differences are not significant. Many authors have pointed out this fact since the first applications of multifractal analysis to experimental results (Tarquis et al., 2005). The implications of Dq changes, too noticeable in this case, make impossible any comparison and calculation of the amplitude of the dimensions D−10 − D+10 as it has been used in several works. The differences found among the Dq representation (Figure 5, filled circles) are mainly found in the negative part. In particular, comparing ADS (Figure 5A filled circles) with the rest it is evident that it doesn’t show a multifractal behavior. All the D0 obtained have a value of 2 (plane dimension). This overestimation is due to the fact that the studied range that was selected to have an optimum fit for all the q values. However, looking at the lower and upper bond of the box-counting plots for q = 0 (Figure 6) it is quite clear that regardless the structure in the image the linear fit will be obtained with a high r 2 . The standard errors (data not shown) of the Dq obtained in the linear behavior phase are minimum and the r 2 of the regression analysis very high. However, this is not surprising if we realize that only three points are being used. In addition, the number of boxes of each size is very low, for size 128 × 128 pixels the number of boxes is 64, for size 256 × 256 pixels the number of boxes is 16, analyzing an image of 1024 × 1024 pixels that is considered a representative elementary area (VandenBygaart and Protz, 1999). This size restriction is avoided by using the gliding box method and its results are discussed in the next section.
5.3
Generalized Dimensions Using the Gliding Box Method
For the three binary images, < Mq r > was calculated and then a bi-log plot of < Mq r > versus r/rmin was made. All plots showed a linear relationship, as it was expected, with an important number of points to calculate a linear regression and based on the line’s slope estimate Dq (Figure 4). In the case of EHV1 for q < −6 (Figure 4A), the linear relationship is not as clear as in the rest of the images. Finally, a comparison between both methods in the Dq values obtained can be studied in Figure 5. In all of the graphics, Dq appears again with a value of 2 imposed by the box gliding method as it was explained in section 3.2. For ADS (Figure 5A) both curves are similar. On propose, the range of values for Dq has been changed to observe that the image effect could induce to an error in our conclusions, when in Figure 3 was evident that Dq was an almost constant value.
206
CHAPTER 8 A 6,50 5,50
Dq
4,50 3,50 2,50 1,50 –10
–8
–6
–4
–2
0 q
2
4
6
8
10
B 6,50 5,50
Dq
4,50 3,50 2,50 1,50 –10
–8
–6
–4
–2
0 q
2
4
6
8
10
C 6,50
Dq
5,50 4,50 3,50 2,50 1,50 –10
–8
–6
–4
–2
0 q
2
4
6
8
10
Figure 5. Generalized dimensions (Dq) from q = −10 to q = +10 for all points of the regression line (filled square) and for the three selected points based on bi-log plot of X(r,q) (filled circles) of each image: A) ADS; B) BUSO and C) EVH1
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
207
Observing the differences between both methods in BUSO and EVH1 (Figure 5B and 5C respectively) are bigger in the negative q values although in the positive values Dq shows a stronger decay (Grau et al., 2006). 6.
CONCLUSIONS
Over the last years, the concepts of fractal/multifractal have been increasingly applied in analysis of porous materials including soils and in the development of fractal models of porous media. In terms of modeling, it is important to characterize the multiscale heterogeneity of soil structure in a useful way, but the blind application of these analyses does not approach to it.
(a) 16 14
log N
12 10 8 6 4 2 0 0
1
2
3
4 log r
5
6
7
8
(b) 16 14 12 log N
10 8 6 4 2 0 0
1
2
3
4
5
6
7
8
log r Figure 6. Box counting plots for EHV1 soil images, q = 0, with upper and lower bounds (a) solid phase (b) pore phase. (From Bird et al., J. of Hydrol., 322, 211, 2006. With permission)
208
CHAPTER 8
A
Log (<M(r,q)>)
40 30 20 10 0 –10 –20 –30 –40 –0,1
0,1
0,3
0,5
0,7 Log(r/rmin)
0,9
1,1
1,3
1,5
B
Log (<M(r,q)>)
40 20 0 –20 –40 –60 –80
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
Log(r/rmin)
C
Log (<M(r,q)>)
20 0 –20 –40 –60 –80 –100 –120 0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
–10 –8 –6 –4 –2 0 2 4 6 8 10
Figure 7. Bi-log plot of < Mr q > versus box size rate r/rmin at different mass exponent (q): A): ADS; B) BUSO, C) EVH1
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
209 A
2,030 2,025 2,020
Dq
2,015 2,010 2,005 2,000 1,995 1,990 -10
-8
-6
-4
-2
0 q
2
4
6
8
10 B
4,000 3,500 3,000
Dq
2,500 2,000 1,500 1,000 0,500 0,000 -10
-8
-6
-4
-2
0 q
2
4
6
8
10 C
7,00 6,00
Dq
5,00 4,00 3,00 2,00 1,00 -10
-8
-6
-4
-2
0 q
2
4
6
8
10
Figure 8. Generalized dimensions (Dq) from q = −10 to q = +10 based on the box-gliding method (empty square) and based on the box-counting method (filled circles) using the same box sizes range: A) ADS; B) BUSO; C) EVH1
210
CHAPTER 8
The results obtained by the “box-counting” and “gliding-box” methods for multifractal modeling of soil pore images show that “gliding-box” provides more consistent results as it creates more number of large size boxes in comparison with the box-counting method and avoids the restriction that box-counting method imposes to the partition function.
7.
ACKNOWLEDGEMENTS
We thank Dr Richard Heck of Guelph University for the soil images. We are very indebted to Dr. N. Bird, Dr. Q. Cheng and Dr. D. Gimenez for helpful discussions. This work was supported by Techical University of Madrid (UPM) and Madrid Autonomous Community (CAM), Project No. M050020163.
REFERENCES Aharony, A., 1990, Multifractals in physics – successes, dangers and challenges, Physica A. 168: 479–489. Ahammer, H., De Vaney, T.T.J. and Tritthart, H.A., 2003, How much resolution is enough? Influence of downscaling the pixel resolution of digital images on the generalised dimensions, Physica D. 181 (3–4):147–156. Allain, C. and Cloitre, M., 1991, Characterizing the lacunarity of random and deterministic fractal sets, Physical Review A. 44:3552–3558. Anderson, A.N., McBratney, A.B. and FitzPatrick, E.A., 1996, Soil Mass, Surface, and Spectral Fractal Dimensions Estimated from Thin Section Photographs, Soil Sci. Soc. Am. J. 60:962–969. Anderson, A.N., McBratney, A.B. and Crawford, J.W., Applications of fractals to soil studies. Adv. Agron., 63:1, 1998. Barnsley, M.F., Devaney, R.L., Mandelbrot, B.B., Peitgen, H.O., Saupe, D. and Voss, R.F., 1988, The Science of Fractal Images. Edited by H.O. Peitgen and D. Saupe, Springer-Verlag, New York. Bartoli, F., Philippy, R., Doirisse, S., Niquet, S. and Dubuit, M., 1991, Structure and self-similarity in silty and sandy soils; the fractal approach, J. Soil Sci. 42:167–185. Bartoli, F., Bird, N.R., Gomendy, V., Vivier, H. and Niquet, S., 1999, The relation between silty soil structures and their mercury porosimetry curve counterparts: fractals and percolation, Eur. J. Soil Sci., 50(9). Bartoli, F., Dutartre, P., Gomendy, V., Niquet, S. and Vivier, H., 1998. Fractal and soil structures. In: Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 203–232. Baveye, P. and Boast, C.W. Fractal Geometry, Fragmentation Processes and the Physics of ScaleInvariance: An Introduction. In Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 1998, 1. Baveye, P., Boast, C.W., Ogawa, S., Parlange, J.Y. and Steenhuis, T., 1998. Influence of image resolution and thresholding on the apparent mass fractal characteristics of preferential flow patterns in field soils, Water Resour. Res. 34, 2783–2796. Bird, N., Díaz, M.C., Saa, A. and Tarquis, A.M., 2006. Fractal and Multifractal Analysis of Pore-Scale Images of Soil. J. Hydrol, 322, 211–219. Bird, N.R.A., Perrier, E. and Rieu, M., 2000. The water retention function for a model of soil structure with pore and solid fractal distributions. Eur. J. Soil Sci. 51, 55–63. Bird, N.R.A. and Perrier, E.M.A., 2003. The pore-solid fractal model of soil density scaling. Eur. J. Soil Sci. 54, 467–476. Booltink, H.W.G., Hatano, R. and Bouma, J., 1993. Measurement and simulation of bypass flow in a structured clay soil; a physico-morphological approach. J. Hydrol. 148, 149–168.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
211
Brakensiek, D.L., W.J. Rawls, S.D. Logsdon and Edwards, W.M., 1992. Fractal description of macroporosity. Soil Sci. Soc. Am. J. 56, 1721–1723. Buczhowski, S., Hildgen, P. and Cartilier, L. 1998. Measurements of fractal dimension by box-counting: a critical analysis of data scatter. Physica A 252, 23–34. Cheng, Q. and Agerberg, F.P. (1996). Comparison between two types of multifractal modeling. Mathematical Geology, 28(8), 1001–1015. Cheng, Q. (1997a). Discrete multifractals. Mathematical Geology, 29(2), 245–266. Cheng, Q. (1997b). Multifractal modeling and lacunarity analysis. Mathematical Geology, 29(7), 919–932. Crawford, J.W., Baveye, P., Grindrod, P. and Rappoldt, C. Application of Fractals to Soil Properties, Landscape Patterns, and Solute Transport in Porous Media, in Assessment of Non-Point Source Pollution in the Vadose Zone. Geophysical Monograph 108, Corwin, Loague and Ellsworth, Eds., American Geophysical Union, Wahington, DC, 1999, 151. Crawford, J.W., Ritz, K. and Young, I.M. Quantification of fungal morphology, gaseous transport and microbial dynamics in soil: an integrated framework utilising fractal geometry. Geoderma, 56, 1578, 1993. Crawford, J.W., Matsui, N. and Young, I.M. 1995., The relation between the moisture-release curve and the structure of soil. Eur. J. Soil Sci. 46, 369–375. Dathe, A., Eins, S., Niemeyer, J. and Gerold, G. The surface fractal dimension of the soil-pore interface as measured by image analysis. Geoderma, 103, 203, 2001. Dathe, A., Tarquis, A.M. and Perrier, E., 2006. Multifractal analysis of the pore- and solid-phases in binary two-dimensional images of natural porous structures. Geoderma, doi:10.1016/j.geoderma.2006.03.024, in press. Dathe, A. and Thullner, M., 2005. The relationship between fractal properties of solid matrix and pore space in porous media. Geoderma, 129, 279–290. Feder, J., 1989. Fractals. Plenum Press, New York. 283pp Flury, M. and Fluhler, H., 1994. Brilliant blue FCF as a dye tracer for solute transport studies – A toxicological overview. J.Environ. Qual. 23, 1108–1112. Flury, M. and Fluhler, H., 1995. Tracer characteristics of brilliant blue. Soil Sci. Soc. Am. J. 59, 22–27. Flury, M., Fluhler, H., Jury, W.A. and Leuenberger, J., 1994. Susceptibility of soils to preferential flow of water: A field study, Water Resour. Res. 30, 1945–1954. Giménez, D., R.R. Allmaras, E.A. Nater and Huggins, D.R., 1997a. Fractal dimensions for volume and surface of interaggregate pores – scale effects. Geoderma 77, 19–38. Giménez D., Perfect E., Rawls W.J. and Pachepsky, Y., 1997b. Fractal models for predicting soil hydraulic properties: a review. Eng. Geol. 48, 161–183. Gouyet, J.G. Physics and Fractal Structures. Masson, Paris, 1996. Grau, J., Méndez, V., Tarquis, A.M., Díaz, M.C. and A. Saa, 2006. Comparison of gliding box and box-counting methods in soil image analysis. Geoderma, doi:10.1016/j.geoderma.2006.03.009, in press. Griffith, D.A.. Advanced Spatial Statistics. Kluwer Academic Publishers, Boston, 1988. Hallett, P.D., Bird, N.R.A., Dexter, A.R. and Seville, P.K., 1998. Investigation into the fractal scaling of the structure and strength of soil aggregates. Eur. J. Soil Sci. 49, 203–211. Hatano, R. and Booltink, H.W.G., 1992. Using Fractal Dimensions of Stained Flow Patterns in a Clay Soil to Predict Bypass Flow. J. Hydrol. 135, 121–131. Hatano, R., Kawamura, N., Ikeda, J. and Sakuma, T. Evaluation of the effect of morphological features of flow paths on solute transport by using fractal dimensions of methylene blue staining patterns. Geoderma 53, 31, 1992. Hentschel, H.G.R. and Procaccia, I. (1983). The infinite number of generalized dimensions of fractals and strange attractors. Physica D, 8, 435, 1983. Kaye, B.G. A Random Walk through Fractal Dimensions. VCH Verlagsgesellschaft, Weinheim, Germany, 1989, 297. Mandelbrot, B.B. The Fractal Geometry of Nature. W.H. Freeman, San Francisco, CA, 1982. McCauley, J.L. 1992. Models of permeability and conductivity of porous media. Physica A 187, 18–54.
212
CHAPTER 8
Moran, C.J., McBratney, A.B. and Koppi, A.J.,1989. A rapid method for analysis of soil macropore structure. I. Specimen preparation and digital binary production. Soil Sci. Soc. Am. J. 53, 921–928. Muller, J., 1996. Characterization of pore space in chalk by multifractal analysis. J. Hydrology, 187, 215–222. Muller, J., Huseby, O.K. and Saucier, A. Influence of Multifractal Scaling of Pore Geometry on Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals, 5, 1485, 1995. Muller, J. and McCauley, J.L., 1992. Implication of Fractal Geometry for Fluid Flow Properties of Sedimentary Rocks. Transp. Porous Media 8, 133–147. Muller, J., Huseby, O.K. and Saucier, A., 1995. Influence of Multifractal Scaling of Pore Geometry on Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals 5, 1485–1492. Ogawa, S., Baveye, P., Boast, C.W., Parlange, J.Y. and Steenhuis, T. Surface fractal characteristics of preferential flow patterns in field soils: evaluation and effect of image processing. Geoderma, 88, 109, 1999. Oleschko, K., Fuentes, C., Brambila, F. and Alvarez, R. Linear fractal analysis of three Mexican soils in different management systems. Soil Technol., 10, 185, 1997. Oleschko, K. Delesse principle and statistical fractal sets: 1. Dimensional equivalents. Soil&Tillage Research, 49, 255,1998a. Oleschko, K., Brambila, F., Aceff, F. and Mora, L.P. From fractal analysis along a line to fractals on the plane. Soil&Tillage Research, 45, 389, 1998b. Orbach, R. Dynamics of fractal networks. Science (Washington, DC) 231, 814, 1986. Pachepsky, Y.A.,Yakovchenko, V., Rabenhorst, M.C., Pooley, C. and Sikora, L.J. . Fractal parameters of pore surfaces as derived from micromorphological data: effect of long term management practices. Geoderma, 74, 305, 1996. Pachepsky, Y.A., Giménez, D., Crawford, J.W. and Rawls, W.J. Conventional and fractal geometry in soil science. In Fractals in Soil Science, Pachepsky, Crawford and Rawls, Eds., Elsevier Science, Amsterdam, 2000, 7. Persson, M., Yasuda, H., Albergel, J., Berndtsson, R., Zante, P., Nasri, S. and Öhrström, P., 2001. Modeling plot scale dye penetration by a diffusion limited aggregation (DLA) model. J. Hydrol. 250, 98–105. Peyton, R.L., Gantzer, C.J., Anderson, S.H., Haeffner, B.A. and Pfeifer, P. . Fractal dimension to describe soil macropore structure using X ray computed tomography. Water Resource Research, 30, 691, 1994. Posadas, A.N.D., Giménez, D., Quiroz, R. and Protz, R., 2003. Multifractal Characterization of Soil Pore Spatial Distributions. Soil Sci. Soc. Am. J. 67, 1361–1369 Protz , R. and VandenBygaart, A.J. 1998. Towards systematic image analysis in the study of soil micromorphology. Science Soils, 3. (available online at http://link.springer.de/link/service/journals/). Ripley, B.D. Statistical Inference for Spatial Processes, Cambridge Univ. Press, Cambridge, 1988. Saucier, A. Effective permeability of multifractal porous media. Physica A, 183, 381, 1992. Saucier, A. and Muller, J. Remarks on some properties of multifractals. Physica A, 199, 350, 1993. Saucier, A. and Muller, J. Textural analysis of disordered materials with multifractals. Physica A, 267, 221, 1999. Saucier, A., Richer, J. and Muller, J., 2002. Statistical mechanics and its applications. Physica A, 311 (1–2): 231–259. Takayasu, H. Fractals in the Physical Sciences. Manchester University Press, Manchester, 1990. Tarquis, A.M., Giménez, D., Saa, A., Díaz, M.C. and Gascó, J.M., 2003. Scaling and Multiscaling of Soil Pore Systems Determined by Image Analysis. In: Scaling Methods in Soil Physics, Pachepsky, Radcliffe and Selim Eds., CRC Press, 434 pp. Tarquis, A.M., McInnes, K.J., Keys, J., Saa, A., García, M.R. and Díaz, M.C., 2006. Multiscaling Analysis In A Structured Clay Soil Using 2D Images. J. Hydrol, 322, 236–246. Tel, T. and Vicsek, T., 1987. Geometrical multifractality of growing structures, J. Physics A. General, 20, L835–L840. VandenBygaart, A.J. and Protz, R., 1999. The representative elementary area (REA) in studies of quantitative soil micromorphology. Geoderma 89, 333–346.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION
213
Vicsek, T. 1990. Mass multifractals. Physica A, 168, 490–497. Vogel, H.J. and Kretzschmar, A., 1996. Topological characterization of pore space in soil-sample preparation and digital image-processing. Geoderma 73, 23–38.