MATHEMATICAL APPROACHES TO NEURAL NETWORKS
North-Holland Mathematical Library Board of Advisory Editors: M. Artin, H. Bass, J. Eells, W. Feit, P.J. Freyd, F.W. Gehring, H. Halberstam, L.V. Hormander, J.H.B. Kemperman, H.A. Lauwerier, W.A.J. Luxemburg, F.P. Peterson, I.M. Singer and A.C. Zaanen
VOLUME 51
NORTH-HOLLAND AMSTERDAM LONDON NEW YORK TOKYO
Mathematical Approaches to Neural Networks
Edited by
J.G. TAYLOR Centre f o r Neural Networks Department of Mathematics King's College London London, U.K.
1993
NORTH-HOLLAND AMSTERDAM LONDON NEW YORK TOKYO
ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands
L i b r a r y o f C o n g r e s s Cataloging-in-Publication
Data
Mathematical approaches to neural networks / edited by J . G . Taylor. p. cm. -- (North-Holland mathematical library ; v . 51) I n c l u d e s bibliographical references. ISBN 0-444-81692-5 1 . Neural networks ( C o m p u t e r science)--Mathematics. I. Taylor, John G e r a l d , 1931. 11. S e r i e s . 1993 ClA76.87.M38 006.3--dc20 93-34573
CIP
ISBN: 0 444 81692 5
0 1993 ELSEVIER SCIENCE PUBLISHERS B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science Publishers B.V., Copyright & Permissions Department, P.O. Box 521, 1000 AM Amsterdam, The Netherlands. Special regulations for readers in the U.S.A. - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the U.S.A. All other copyright questions, including photocopying outside of the U.S.A., should be referred to the copyright owner, Elsevier Science Publishers B.V., unless otherwise specified. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein.
This book is printed on acid-free paper Printed in The Netherlands
V
Preface The subject of Neural Networks is being seen to be coming of age, after its initial inception
50 years ago in the seminal work of McCulloch and Pitts. A distinguished gallery of workers (some of whom are included in this volume) have contributed to building the edifice which is now proving of value in a wide range of academic disciplines and in important applications in industrial and business tasks. These two strands of neural networks are thus firstly appertaining to living systems, their explanation and modelling, and secondly that to dedicated tasks to which living systems may be ill adapted or involve uncertain rules in noisy environments. the progress being made in both these approaches is considerable, but yet both stand in need of a theoretical framework of explanation underpinning their usage and allowing the progress being made to be put on a firmer footing. The purpose of this book is to attempt to provide such a framework. Mathematics is rightly to be regarded as the queen of the sciences, and it is through mathematical approaches to neural networks that a suitable explanatory framework is expected to be found. Various approaches are available here, and are contained in the contributions presented here. These span a broad range from single neuron details, through to numerical analysis, functional analysis and dynamical systems theory. Each of these avenues provides its own insights into the way neural networks can be understood, both for artificial ones through to simplified simulations. The breath and vigour of the contributions underline the importance of the ever-deepening mathematical understanding of neural networks. I would like to take this opportunity to thank the contributors for their contributions and the publishers, especially Dr Sevenster, for his forbearance over a rather lengthy gestation period.
J G Taylor King's College, London 28.6.93
This Page Intentionally Left Blank
vii
Table of Contents Control Theory Approach P.J. Antsaklis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.l
Computational Learning Theory for Artificial Neural Networks M. Anthony and N. Biggs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.25
Time-summating Network Approach P.C. Bressloff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.63
The Numerical Analysis Approach S.W. Ellacott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
Self-organising Neural Networks for Stable Control of Autonomous Behavior in a Changing World S . Grossberg.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
On-line Learning Processes in Artificial Neural Networks T.M. Heskes and B. Kappen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
199
Multilayer Functionals D.S. Modha and R . Hecht-Nielsen . . . . . . . . . . . . . . . . . . . . . . . . . . .
235
Neural Networks: The Spin Glass Approach D. Sherrington . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
261
Dynamics of Attractor Neural Networks T. Coolen and D. Sherrington . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
293
Information Theory and Neural Networks J.G. Taylor and M.D. Plumbley . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
307
Mathematical Analysis of a Competitive Network for Attention J.G. Taylor and F.N. Alavi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
341
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B.V. All rights reserved.
1
Contmltbeoryappmach Panos J. Antsaklis Department of Electrical Engineering, University of Notre Dame, Notre Dame, Indiana 46556, USA
Abstract The control of complex dynamical systems is a very challenging problem especially when there are significant uncertainties in the plant model and the environment . Neural networks are being used quite successfully in the control of such systems and in this chapter the main approaches are presented and the advantages and drawbacks are discussed. Traditional control methods are based on firm and rigorous mathematical foundations, developed over the last hundred years and 80 it is very desirable to develop corresponding results when neural networks are used to control dynamical systems.
1. INTRODUCTION Problems studied in Control Systems theory involve dynamical systems and require real time operation of control algorithms. Typically, the system to be controlled, called the plant, is described by a set of differential or difference and perhaps nonlinear equations; the equations for the decision mechanism, the controller, are then derived using one of the control design methods. The controller is implemented in hardware or software to generate the appropriate control signals; actuators and sensors are also necessary to translate the control commands into control actions and the values of measured variables into appropriate signals. Examples of control systems are the autopilots in airplanes, the pointing mechanisms of space telecommunication antennas, speed regulators of machines on the factory floor, controllers for emissions control and suspension systems in automobiles, controllers for temperature and humidity regulators at home, to mention but a few. The model of the plant to be controlled can be quite poor either because of lack of knowledge of the process to be controlled, or by choice to reduce the complexity of the control design. Feedback is typically used in control systems to deal with uncertainties in the plant and the environment and achieve robustness in stability and performance. If the control goals are demanding, that is the control specifications are tight, while the uncertainties are large then fixed robust controllers may not be adequate. Adaptive control may be used in this case, where the new plant parameters are identified on line and this information is used to change the coefficients of the controller. The area is based on firm mathematical foundations, although in practice engineering skill and intuition are used to make the theoretical methods applicable to real practical systems, as it is the case in many engineering disciplines.
2
Intelligent Autonomous Control Systems. In recent years it has become quite apparent that in order to achieve high autonomy in control systems, that is to be able to control effectively under significant uncertainties even for example when certain types of failures occur (such as faults in the control surfaces of an aircraft), one needs to implement methods beyond conventional control methods. Decision mechanisms such as planning and expert systems are needed together with learning mechanisms and sophisticated FDI (Failure Diagnosis and Identification) methods. One therefore needs to adopt an interdisciplinary approach involving concepts and methods from areas such as Computer Science, Operations Research in addition to Control Systems and this leads to the area of Intelligent Autonomous Control Systems; see Antsaklis and Passino (1992a) and the references therein. A hierarchical functional intelligent controller architecture, as in Fig. 1, appears to offer advantages; note that in the figure the references to pilot, vehicle and environment etc come from the fact that such functional architecture refers to a high autonomy controller for future space vehicles as described in Antsaklis and Passino (1992b) and Antsaklis, Passino and Wang (1991). A three level architecture in intelligent controllers is quite typical: The lower level is called the Execution level and this is where the numerical algorithms are implemented in hardware or software, that is this is where conventional control systems reside; these am systems characterized by continuous states. The top level is the Management level where symbolic systems reside, which are systems with discrete states. The middle level is the Coordination level where both continuous and discrete state systems may be found. See Antsaklis and Passino (1992b) and the references therein for details. pilot and Crew/Ground Stati0nK)nFbudSystems
t
Middle Management Decision Making. Learning. and Algorithms
Adaptive Contrd &
I
Algorithms in Hardware md Software
Vehicle and Environment
Figure 1. A hierarchical functional architecture for the intelligent control of high autonomy systems
3
Neural Networks in Control Systems. At all levels of the intelligent controller architecture there appears to be room for potential applications of neural networks. Note that most of the uses of neural networks in control to date have been in the Execution and Coordination levels - they have been used mostly as plant models and as fixed and adaptive controllers. Below, in the rest of the Introduction a brief summary of the research activities in the area of neural networks in control is given. One should keep in mind that this is a rapidly developing field. Additional information, beyond the scope of this contribution, can be found in Miller, Sutton and Werbos (19901,in Antsaklis (1990),Anteaklis (1992) and in Warwick (19921,which are good starting sources; see also Antsaklis and Sartori (1992). It is of course well known to the readers that neural networks consist of many interconnected simple processing elements called units, which have multiple inputs and a single output. The inputs are weighted and added together. This sum is then passed through a nonlinearity called the actiuution function, such as a sigmoidal function like fix) = l/(1 + e-') or ffx)= tanh(x), or a gaussian-type hnction, such as fix) = exp(-x2), or even a hard limiter or threshold function, such as f(x) = sign(x) for x # 0. The terms artificial neural networks or connectionist models are typically used t o describe these processing units and to distinguish them from biological networks of neurons found in living organisms. The processing units o r neurons are interconnected, and the strength of the interconnections are denoted by parameters called weights. These weights are adjusted, depending on the task at hand, to improve performance. They can be either assigned values via some prescribed off-line algorithm, while remaining fixed during operation, or adjusted via a learning process on-line. Neural networks are classified by their network structure topology, by the type of processing elements used, and by the kind of learning rules implemented. Several types of neural networks appear to offer promise for use in control systems. These include the multi-layer neural network trained with the backpropagation algorithm commonly attributed to Rumelhart et aZ. (19861,the recurrent neural networks such as the feedback network of Hopfield (1982),the cerebellar model articulation controller (CNLAC) model of Albus (19751,the content-addressable memory of Kohonen (19801,and the gaussian node network of Moody and Darken (1989).The choice of which neural network to use and which training procedure t o invoke is an important decision and varies depending on the intended application. The type of neural networks most commonly used in control systems is the feedforward multilayer neural network, where no information is fed back during operation. There is however feedback information available during training. Supervised learning methods, where the neural network is trained to learn inpuffoutput patterns presented to it, are typicaly used. Most often, versions of the backpropagation algorithm are used to adjust the neural network weights during training. This is generally a slow and very time consuming process as the algorithm usually takes a long time to converge. However other optimization methods such as conjugate directions and quasiNewton have also been implemented; see Hertz et al. (19911,Aleksander and Morton (1990). Most often the individual neuron activation functions are sigmoidal functions, but also signum or radial basis Gaussian functions. Note
4
that in this work the emphasis is on multilayer neural networks. The reader should keep in mind that there are additional systems and control results involving recurrent networks, especially in system parameter identification; one should also mention the work in associative memories, which are useful in the higher levels of intelligent control systems. One property of multilayer neural networks central to most applications to control is that of function approximation. Such networks can generate input/output maps which can approximate any continuous function with any desired accuracy. One may have to use a large number of neurons, but any desired approximation of a continuous function can be accomplished with a multilayer network with only one hidden layer of neurons or two layers of neurons and weights; if the function has discontinuities, a two hidden layer network may be necessary-see below, Section 2.2. To avoid large numbers of processing units and the corresponding inhibitively large training times, a smaller number of hidden layer neurons is often used and the generalization properties of the neural network are utilized. Note that the number of inputs and outputs in the neural network are determined by the nature of the data presented to the neural network and the type of output desired from the neural network, respectively. To model the inputloutput behavior of a dynamical system, the neural network is trained usind inputloutput data and the weights of the neural network are adjusted most often using the backpropagation algorithm. The objective is to minimize the output error (sum of squares) between the neural network output and the output of the dynamical system (output data) for a specified set of input patterns. Because the typical application involves nonlinear systems, the neural network is trained for particular classes of inputs and initial conditions. The underlying assumption is that the nonlinear static map generated by the neural network can adequately represent the system’s behavior in the ranges of interest for the particular application. There is of course the question of how accurately a neural network, which realizes a static map, can represent the inputloutput behavior of a dynamical system. For this to be possible one must provide to the neural network information about the history of the system-typically delayed inputs and outputs. How much history is needed depends on the desired accuracy. There is a tradeoff between accuracy and computational complexity of training, since the number of inputs used affects the number of weights in the neural network and subsequently the training time. One sometimes starts with as many delayed signals as the order of the system and then modifies the network accordingly; it also appears that using a two hidden layer networkinstead of a one hidden layer-has certain computational advantages. The number of neurons in the hidden layeds) is typically chosen based on empirical criteria and one may iterate over a number of networks to determine a neural network that has a reasonable number of neurons and accomplishes the desired degree of approximation. When a multilayer neural network is trained as a controller, either an open or closed loop controller, most of the issues are similar to the above. The difference is that the desired output of the neural network, that is the controller generated appropriate control input to the plant, is not readily available, but has to be derived from the known desired plant output. For this, one may use the mathematical model of the plant if available, or some approximation based
5
on certain knowledge of the process to be controlled; or one may use a neural model of the dynamics of the plant or even of the dynamics of the inverse of the plant if such models have been derived. Neural networks may be combined to both identify and control the plant, thus implementing an adaptive controller. In the above, the desired outputs of the neural networks are either known or they can be derived or approximated. Then, supervised learning via the backpropagation algorithm can be used to train the neural networks. Typical control problems which can be solved in this way are problems where a desired output is known. Such is the case in designing a controller to track a desired trajectory; the error then to be minimized is the sum of the squares of the errors between the actual and desired points along the trajectory. There are control problems where no desired trajectory is known but the objective is to minimize say the control energy needed to reach some goal state(s1. This is an example of a problem where minimization over time is required and the effect of present actions on future consequences must be used to solve it. Two promising approaches for this type of problems are either constructing a model of the process and then using some type of backpropagation through time procedure, or using an adaptive critic and utilizing methods of reinforcement learning. These are discussed below. Neural networks can also be used to detect and identify system failures, and to help store information for decision making, thus providing for example the knowledge to decide when to switch to a different controller among a finite number of controllers. In general there are potential applications of neural networks a t all levels of hierarchical intelligent controllers that provide higher degree of autonomy to systems. Neural networks are useful at the lowest Execution level where the conventional control algorithms are implemented via hardware and software, through the Coordination level, t o the highest Organization level, where decisions are being made based on possibly uncertain and/or incomplete information. One may point out that at the Execution level, the conventional control level, neural network properties such as the ability for function approximation and the potential for parallel implementation appear to be most relevant. In contrast, a t higher levels, abilities such as pattern classification and the ability to store information in a say associative memory appear to be of most interest. When neural networks are used in the control of systems it is important that results and claims are based on firm analytical foundations. This is especially important when these control systems are to be used in areas where the cost of failure is very high, for example when human life is threatened, as in aircraft, nuclear plants etc. It is also true that without a good theoretical framework it is unlikely that the area will progress very far, as intuitive invention and tricks cannot be counted on t o provide good solutions to controlling complex systems under high degree of uncertainty. The analytical heritage of the control field, was in fact pioneered by the use of a differential equation model by J.C.Maxwel1 to study certain stability problems in Watt’s flyball governor in 1868, and this was a case where the theoretical study provided the necessary knowledge to go beyond what the era of Intuitive Invention in control could provide. In a control system which contains neural networks it is in general hard to guarantee typical control systems properties such as stability. The main
6
reason is the mathematical difficulties associated with the study of nonlinear systems controlled by highly nonlinear neural network controllers-note that the control of linear systems is well understood and neural networks are typically used to control highly nonlinear systems. In view of the mathematical difficulties encountered in the past in the adaptive control of linear systems controlled by linear controllers, it is hardly surprising that the analytical study of nonlinear adaptive control using neural networks is a difficult problem indeed. Some progress has been made in this area and certain important theoretical resulta have begun to emerge, but clearly the overall area is still at its early stages of development. In Section 2, the different approaches used in the modeling of dynamical systems are discussed. The function approximation properties of multilayer neural networks are discussed at length, radial basis networks and the Cerebellar Model Articulation Controller (CMAC) are introduced and the modeling of the inverse dynamics of the plant, used in certain control methods is also discussed. In Section 3, the use of neural networks as controllers in problems which can be solved by supervised learning are discussed; such control problems for example would be following a given trajectory while minimizing some output error. In Section 4, control problems which involve minimization over time are of interest; an example would be minimizing the control energy to reach a goal state-there is not known desired trajectory in this case. Methods such as back propagation through time and adaptive critic with reinforcement learning are briefly discussed. Section 5 discusses other uses of neural networks in the failure detection and identification area (FDI), and in higher level control. Sections 6 and 7 contain the concluding remarks and the references respectively. 2. MODELING OF DYNAMICAL SYSTEMS
2.1 ~ t h e ~ o f t h e p l s n t In this approach, the neural network is trained to model the plant's behavior, as in Fig. 2. The input to the neural network is the same input used by the plant. The desired output of the neural network is the plant's output.
U
I
4
r
++
CF
Figure 2. Modeling the plant's dynamics. The signal e = y -
from the summation in Fig. 2 is the error between the
plant's output and the actual output of the neural network. The goal in training the neural network is to minimize this error. The method to accomplish this varies for the type of neural network used and the type of training algorithm chosen. In the figure, the use of the error to aid in the training of the neural network is denoted by the arrow passing through the neural network at an angle. Once the neural network has been successfully trained, it is actually an analytical model of the plant that can be further used to design a controller or to test various control techniques via simulation of this neural plant emulator. This type of approach is discussed in Section 3. In Fig. 2, the type of plant used is not restricted. The plant could be a very well behaved single-input single-output system, or it could be a nonlinear multi-input multi-output system with coupled equations. The actual plant or a digital computer simulation of the plant could be used. The plant may also operate in continuous or discrete time; although for training the neural network, discrete samples of the plants inputs and outputs are often used. If the plant is time-varying, the neural network clearly needs to be updated online and so the typical plant considered is time invariant or if tit is time varying it changes quite slowly. The type of information supplied to the neural network about the plant may vary. For instance, the current input, previous inputs, and previous outputs can be used as inputs to the neural network, This is illustrated in Fig. 3 for a plant operating in discrete time. The boxes with the "A" symbol indicate the time delay. The bold lines stress the fact that signals with varying amounts of delay can be used. The plant's states, derivatives of the plant's variables, or other measures can be used as the neural networks inputs. This type of configuration is conducive to training a neural network when the information available about the plant is in the form of an input-output table.
1
Figure 3. Modeling the discrete time plant's dynamics using delayed signals. Training a neural network in this manner, by using input-output pairs, can be viewed as a form of pattern recognition, where the neural network is being trained to realize some (possibly unknown) relation between two sets. If a multi-layer neural network is used to model the plant via the configuration depicted in Fig. 3, a dynamic system identification can be performed with a
8
static model. The past history information needed to be able to model a dynamic systems via a static model is provided by delayed input and output SigIidS. If the back-propagation algorithm is used in conjunction with a multilayer neural network, considerations need to be made concerning which among the current and past values of the inputs and outputs to utilize in training the neural network; this is especially important when the identification is to be on line. In Narendra and Parthasarathy (1990) it is shown that when a series-parallel identification model is used (and the corresponding delayed signals), then the usual backpropagation algorithm can be employed to train the network; when however a parallel identification model is used then a recurrent network results and some type of backpropagation through time, see Section 4, should be used. A moving window of width p time steps could be employed in which only the most recent values are used. An important question t o be addressed here concerns the number of delays of previous inputs and outputs to be used as inputs to the neural network; most often the number of delays is taken to be equal to the order of the plant, at least initially. If there is some apriori knowledge of the plant's operation, this should be incorporated into the training. This knowledge can be imbedded in a linear or nonlinear model of the plant, or incorporated via some other means; see Sartori and Antsaklis (1992a). A possible way of utilizing this information via a plant model is illustrated in Fig. 4; this can be viewed as modeling the unmodelled dynamics of the plant with a neural network. I I I I
-_
- 1 1 I 1
I
I
I I
t
A
l Y
I
Plant
Figure 4. Using apriori knowledge of the plant. Modeling the plant's behavior via a multilayer sigmoidal neural network has been studied by a number of researchers; see among others Narendra and Parthasarathy (1990), Parthasarathy (1991), Bhat et al (1990), &in, Su and McAvoy (19921, Hou and Antsaklis (1992). In general, the results show that neural networks can be very good models of dynamical systems behavior. This
9
is of course true for stable plants, for certain ranges of inputs and initial conditions and for time invariant or slowly varying systems.
22FunCtLon ' approxhdion Neural networks are useful as models of dynamical systems because of their ability to be universal function approximators. In particular, it turns out that feedforward neural nets can approximate arbitrarily well any continuous function; this in fact can be accomplished using a feedforward network with a single hidden layer of neurons with a linear output unit. More specifically, consider the nonlinear map g: Rm + R where n
g(u>= Cw;l s(w'u) i=l
(1)
with u E R" the input vector, w' = [wil, ..., wim1 the input layer weights and wil the output layer (first layer) weights of unit i. S(.) is the unit activation function. Such maps are generated by networks with a single hidden layer and a linear output unit. Consider now S(x) to be a sigmoid function, defined as a function for which lim,+, S(x) = 1 and lim,+ a x ) = 0. Hornik et al. (1989) has shown that when the activation sigmoid function S(x) is non-decreasing, the above net can approximate a n arbitrary continuous function flu) uniformly on compact sets; in fact it is shown that the set of functions g above is dense in the set off, the continuous functions mapping elements of compact sets in Rm into the real line. Cybenko (1989)extended this result to continuous activation sigmoid functions Jones (1990)showed that the result is still valid when S(x) is bounded. Typical proofs of this important approximation result are based on the Stone-Weierstrass theorem and use approximations of the function via trigonometric functions, which in turn are approximated by sums of sigmoidal functions. Similar results have appeared in Funahashi (19891, among others. An important point is that the exact form of the activation function S(x) is not important in the proof of this result, however it may affect the number of neurons needed for a desired accuracy of approximation of a given function f. These results show that typically one increases accuracy by adding more hidden units; one then stops when the desired accuracy has been achieved. It should be noted that for finite number of hidden units, depending on the function, significant errors may occur; this reminds us of the Gibbs phenomenon in Fourier Series. How many hidden units then one needs in the hidden layer? This is a difficult question to answer and it has attracted much attention. Several authors have shown (with different degrees of ease and generality) that p-1 neurons in the hidden layer suffice to store p arbitrary patterns in the network; see Nilsson (19651,and Baum (19881,Sartori and Antsaklis (1991)for constructive proofs. The answer to the original question also depends of course on the kind of generalization achieved by the network; also note that certain sets of p patterns may be realizable by fewer than p-1 hidden neurons. The question of adequate approximation becomes more complicated in control applications where the functions which must be approximated by the neural network may be functions that generate the control signals, the range and the shape of which may not be known in advance. Because of this, the guidelines to select the appropriate number of hidden neurons are rather empirical at the moment.
ax);
10
The above discussion dealt with continuous functions f, which can be approximated by a neural network with one hidden layer of neurons. When the function under consideration has discontinuities, then two hidden layers may have to be used; Cybenko (1988), Chester (19901, Sontag (1990). In control considerations, Sontag has pointed out that one may need a two hidden layer neural network to stabilize certain plants. It is true of course that a two hidden layer network can approximate any continuous function as well. In addition, experimental evidence tends to show that using a two hidden layer network has advantages over a one layer as it requires shorter training time and overall fewer weights. Because of this a two hidden layer network is many times the network of choice. There are many other issues relevant to function approximation and control applications, such us issues of network generalization, input representation and preprocessing, optimal network architectures, methods to generate networks, methods of pruning and weight decay. These topics are of great interest to the area of neural networks at large, and a number of these are currently attracting significant research efforts. We are not directly addressing these topics here; the interested reader should consult the vast literature on the subject.
23 Radialbasiepetworke To approximate desired functions, networks involving activation functions other than sigmoids can be used. Consider again a feedforward neural network with one hidden layer and a linear output unit. Assume that the hidden neurons have radial basis functions as activation functions, in which case the neural network implements the nonlinear map g: Rm + R where n
-
g(u) = Z W i l alru qll) i=l
(2)
with u E Rm the input vector and wil the output layer (first layer) weights of unit i. G(.) is a radially symmetric activation function, typically the Gaussian function Uiiu ciii) = exp ( -c+ iiu q i i 2 ) (3) where si = 1/ 0i2. The vectors ci i = 1,...,n are the centers of the Gaussian function and if for a particular value of the input u P q then the ith unit gives an output of +l. The deviation 0; controls the width of the Gaussian and for large llu till, more than 3 0 , the output of the neuron is negligible; in this way, practically only inputs in the locality of the center of the Gaussian contribute to the neuron output . It is known from Approximation Theory, see for example Poggio and Girosi (19901, that radially symmetric functions, as g
-
-
-
11
on compact sets. The issue now is how to select the weights wil , centers q and deviations si to achieve some desired approximation. For large number of points methods from pattern recognition can be used to determine ci and si ;in Moody and Darken (1989) k-means clustering algorithms are used for the centers and P nearest neighbor algorithms for the deviations. For small number of points the centers and deviations can be empirically selected to satisfy the problem requirements Parthasarathy (1991); an algorithm is given in Sartori and Antsaklis (1991b) to satisfy certain constraints between given points (see also Sartori and Antsaklis (1992) for similar algorithms but for networks involving sigmoids). After the centers and the deviations have been decided upon, the problem is linear with respect to the output weigths wil as it is clear from the expression for g above. This can be used to advantage as now linear system identification results may be used to derive provably stable adaptive control systems; this line of inquiry has been pursued successfully in Parthasarathy (1991) and elsewhere.
CMAC Albus (1975) introduced the Cerebellar Model Articulation Controller (CMAC), which is briefly described below. CMAC networks have been gaining popularity in control applications especially in robotics, but also in signal, speech and image processing; see Miller, Glanz and Kraft (1990), Kraft and Campagna (1990). The main reasons for the appeal of the CMAC are the speed of training and the fact that it can be inplemented in hardware quite easily using logic cell arrays; note also that CMAC accepts real numbers as input and outputs also real numbers and that it exhibits local generalization properties. However, the size of required memory, which could be in the thousands, and the collisions caused by Hash coding may cause difficulties. The input to CMAC is quantized and mapped to a number (C) of consecutive association cells; C may be for example 32, 256 or larger. There is significant overlap between the association cells excited by different input levels; that is each association cell, also called state-space detector, is excited by a number of input levels. Overlap determines generalization. Each association cell is assigned an address which is then mapped via some Hash coding to a location in a memory where the weights are stored. Collisions OCCUT when two Merent association cells are mapped to the same location. The weights of all active memory locations are then summed to produce the output. That is 2.4
where x j i s 1for i active and 0 otherwise; wiis the weight in the ith memory location and the superscript j denotes the jth input pattern or level. In CMAC networks, the training algorithms determine the weights; all other parameters, C, overlap, size of memory etc are design parameters fixed in advance. To determine the weights, supervised learning methods, typically the least mean square (LMS) training rule, are used. Each training input Uj produces a vector xi of 1s and 0 s which indicate the memory locations that have been excited; this is determined by the fixed interconnections. Then the weights are updated by the LMS rule
12
AW = q ( d i - w ’ x j ) r j
(5)
where di is the desired output. For convergence results see Wong and Sideris (1992). CMAC networks are being used in control systems to implement adaptive controllers; the speed of convergence of the training algorithm allows the use of CMAC on line for adaptive control. They are also being used to model dynamical systems, to implement fuzzy controllers etc. One should expect to see more applications of CMAC in the future. 2 5 Modelsofinvenieoftheplant Instead of training a neural network to identify the forward dynamics of the plant as discussed in Section 2.1. a neural network can be trained to identify the inverse dynamics of the plant as illustrated in Fig. 5. This type of model is useful in certain control approaches as discussed in the next section. The neural network’s input is the plant’s output, and the desired neural network output is the plant’s input.
Figure 5. Modeling the plant’s inverse dynamics. The error e = u - 6 is to be minimized and can be used to train the neural network. The type of neural network and the training algorithm used are not restricted. The plant can be continuous or discrete and can also be single-input single-output o r multi-input multi-output. The desired output of the neural network is the current input to the plant. The type of information used by the neural network to model the inverse dynamics of the plant may vary. For instance, the neural networks inputs may contain the current and previous outputs and the previous inputs of the discrete time plant, as illustrated for the neural network plant emulator in Fig. 3. In addition, other signals, such as the plant’s states, derivatives of the plant’s variables, or other measures can also be used as inputs t o the neural network. When modeling the inverse dynamics of the plant with a neural network, the assumption is being made, either implicitly or explicitly, that the neural network can approximate well the inverse of the plant. This of course means that the inverse exists and it is unique; if not unique then care should be taken with the ranges of the inputs to the network. It also means that the inverse is stable.
13
3. CO"TIO~CONTRoLLERsANDSupERvIsEDLEAR"G A neural network can also be used as a conventional controller in both open loop and closed loop configurations. Because of its ability to adjust its parameters via training, it can in principal implement adaptive control laws. For on line implementation the speed of convergence of the training algorithm is of course of great importance. Its use as an open loop controller is first examined. Open Zoop control. There are cases in control where the plant is stable and the uncertainty in the plant parameters and the external disturbances is negligible and can be ignored. Then open loop control, also called feedforward control, may be appropriate and neural networks can then be used in that capacity. In this case one may try to approximate the inverse of the plant, if this is possible, aiming to achieve the so called ideal control situation. Note that in the training of the neural network to model the inverse dynamics of the plant in the previous section, the purpose often is to use the trained neural network exactly as a feedforward controller for the plant in an open loop configuration. The desired output of the plant is the input to the neural network controller. "hie method is quite popular among researchers attempting to apply neural networks to the control of robot arms. Instead of training the neural network to model the inverse dynamics of the plant, as in Fig. 5, the neural network can be trained directly as an open loop controller. This is shown in Fig. 6.
Figure 6. Error back-propagated through plant for an open loop controller. The error e = yd - y is used to train the neural network. In this configuration, there does not exist a desired output for the neural network, and the error information must be "back-propagated"through the plant to account for this. The arrow passing through both the plant and the controller represents the back-propagated error. Note that this situation always arises in both open and closed loop control whenever supervised learning is used; the desired neural network output is not known and must be determined to use say the backpropagation algorithm. For the backpropagation algorithm the first derivative of the plant output with respect to the input to the plant is computed. This can be approximated by slightly changing the input to the plant and determining the plant's new output. In using this approach to train the neural network, the plant can be thought of as being the "output layer" of the multilayer neural network. Similarly if the mathematical model of the plant is available then these derivatives may be calculated directly. Alternatively, a
14
neural network model of the plant can also be used. If a multi-layer neural network is trained to emulate the plant as described in Sect. 2.1,the error can be back-propagated through the neural network model quite easily. If of course the model of the inverse of the plant is available then one can propagate the error forward in the neural network model. Closed loop control. The use of a neural network in a closed loop conf5guration as a simple unity or error feedback controller is illustrated in Fig. 7.
Figure 7. Neural network controller in a conventional closed loop configuration. Note that the feedback is not restricted to output feedback, state feedback could also be used, and the desired output yd can be zero as in a regulator loop. As previously discussed, signals, other than e, can also be used as inputs to the neural network. Note that the desired output of the neural network must be determined, from the known desired plant output, before any supervised learning algorithm such as the backpropagation algorithm can be used. Methods to accomplish this were discussed in the open loop control case above. Once trained, the neural network controller can be updated on-line to cope with unforeseen situations or with a slowly time-varying plant. The neural network can also be trained to be a part of an existing control structure; for instance, as part of an internal model control scheme or in conjunction with a PID controller. For this application, the neural network is trained to perform an operation or to augment the operation of an existing controller. The issues involved however are the same as in the case of the simple unity feedback controller shown in Fig. 7. As another application of a neural network controller, the neural network can be trained to mimic an existing control law. This is illustrated in Fig. 8.
~
~~
Figure 8. Neural network trained to mimic existing controller.
15
This use for a neural network is plausible if the controller currently in use is for example too expensive or unreliable. In addition, after the neural network has replaced the current controller, it can be adjusted via training thus taking into account variations in the plant and the environment. For training, the neural network is given the same inputs as the current controller, and the desired neural network output is the output of the current controller. This use of a neural network is equivalent to the design and use of an expert system to capture the reasoning process of an expert. Note that the approach described in Fig. 8 is quite versatile. If a controller is designed to use signals from the plant which are not available or too expensive to measure or compute, the controller can not be directly implemented; however, if a neural network can be trained to emulate the already designed controller using different signals, the neural network can then be used as the actual controller for the plant. It is also possible to use a neural controller in parallel with an already existing controller to enhance its operation as it was mentioned above, when the control action has deteriorated because of aging or damage to the plant. Model reference control. Besides training a neural network to mimic a control law, the neural network controller can also be trained to reduce the error at the output of the plant compared to a reference model as illustrated in Fig. 9.
Figure 9. Model reference control. Error propagated through plant. Here again the desired neural controller output must be determined in order to use the backpropagation algorithm, that is the error must be backpropagated through the plant and then used to adjust the weights of the neural network controller. As discussed previously, the gradient of the error can be approximated by varying the plant's inputs and measuring the resulting outputs or be calculated from the mathematical model of then plant if available. Alternatively, if the plant is first modelled with a multi-layer neural network as discussed in Sect. 2.1, this neural network can be used to replace the plant for the training of the neural network controller, and the back-propagation of the error through the plant emulator can then be accomplished; similar method is used if a neural model of the inverse of the plant is available. It is clear that neural networks can be used in model reference control configurations other than the one shown in Fig. 9. The issues are the same and the difficulties and problems encountered are similar. As a side comment,
16
by using a sigmoidal nonlinearity at the output of the neural network, the saturation property of most actuators that follow the controller can be accommodated, since the range of the sigmoid function is from -1 to +1 (tanh(x)) and can be appropriately shifted and scaled. There are many references of successful neural controllers; see for example Antsaklis (1990 and 19921, Chen (19901, Hou and Antsaklis (1992). Kana and Guez (1989), Kraft and Campagna (19901, Miller, Sutton and Werbos (19901, Narendra and Parthasarathy (1990), Ungar, Powell and Kamens (19901, Ydstie (1990). 4. C 0 N T R O L 0 ~ 0 v E R T r M E A N D R E x N F o R c E M E N T
LEARNING In all the control problems considered in the previous section a desired trajectory of the plant output was known. This meant that the desired outputs of the neural networks were either known or they could be derived or approximated, and supervised learning via the backpropagation algorithm could be used to train the neural networks. Typical control problems of this type are the regulation and the tracking problems where the plant output must follow a given trajectory. When the desired plant trajectories are not known then supervised learning methods cannot be used to train the neural networks. This is the case for example when the control objective is to minimize the control energy needed to reach some goal state(s1. This is an example of a problem where minimization over time is required and the effect of present actions on fbture consequences must be determined to solve it. Two approaches may be used to address this type of problems: either constructing a model of the process and then using some type of backpropagation through time (BTT) procedure, or using an adaptive critic and utilizing methods of reinforcement learning. Backpropagation through time (B'IT). The basic backpropagation algorithm is an efficient way to calculate the derivatives of the (output) error with respect to a large number of input variables in a feedforward network. BTT extends this method to recurrent networks; see Werbos (1990). BTT is in effect a computationally more efficient first-order calculus of variations. This method uses noise free models to calculate the derivatives of future utility with respect t o present actions. BTT can be derived from the basic backpropagation algorithm by unfolding an arbitrary recurrent net into a multilayer feedforward net that grows by one layer a t each time step; the algorithm takes its name from the fact that the computation of the error derivatives is based on information propagating from later times to earlier times in the recurrent network. This method is difficult to use in its general form, however specific successful applications have been reported; see for example the robot arm applications of Kawato (1990) and the truck backer-upper of Nguyen and Widrow (1990). Adaptive critic. It consists of two networks, the action network which is the neural network controller of Fig. 10 below, and the critic network which may or may not be a neural network. It approximates the dynamic programming solution to the problem and it performs well in noisy environment and with inexact models; see Barto (1990). The critic guides how the controller is
17
adapted; it generates a performance index J and the controller is rewarded when the control u leads to larger J and punished when u leads to smaller J in the next time step. The neural controller needs the gradient of the critic’s evaluation J with respect to the control signals. One could use a neural network model of the plant together with the critic for the derivatives. If however only values of performance J but not derivatives are available, as it is the case when a model of the evaluation process is not available, the derivatives must be estimated. In this case reinforcement learning could be used to advantage. In reinforcement learning, contrary to supervised learning, there is no knowledge of the desired output and the learning system receives only a measure of performance; see Barto (1990)and Sutton, Barto, and Williams (1992). Note that when supervised learning can be used, it is much more efficient than reinforcement learning.
Figure 10. Adaptive critic used to adjust neural network controller. To design the critic one must be able to determine the current performance, so to reward or punish the controller, based on properties of future plant
behavior. Optimal control requires accurate plant model and large amounts of computation or mathematical tractability of the model and of the objective function to obtain an analytical solution and this is not usually possible. In the 70s the problem of subgoal performance measures was studied, while in the AI literature progress in the problem of credit-assignment for reinforcement learning has been reported; these are problems related to the design of the critic network and along similar lines of inquiry progress has been made-see Barto (1990) for a comprehensive discussion. One could of course develop a model of the plant and then use conventional dynamic programmic to solve these minimization over time control problems; or one could develop a spacehime model and use BTT. These model based methods however are not necessarily effective with inaccurate models and noisy environments. Most often one needs some exploratory search in control space, search which one typically associates with reinforcement learning methods. So the adaptive critic and reinforcement learning methods show great promise; additional progress needs to be made before it is clearly demonstrated that this is the best way to address these optimization over time control problems.
18
6.
FDI,AND HIGHER LEVEL DECISION MAKING
Neural networks discussed in this section are used in the control of the plant, but are not actually in the conventional control loop, per se. As described in Antaaklis and Passino (1992b), see also the Introduction above, they perform some higher level decision making tasks in a manner which adds more autonomy to the system. This configuration is illustrated in Fig. 11. The neural network becomes a high level decision maker and is not directly involved in determining the input to the plant. Instead, the neural network supplies to the controller information to properly form control signals for the plant. This may require adapting the controller based on the information provided by the neural network and in that sense there are similarities between this method and the adaptive critic discussed above when the critic is implemented as a neural network. In Fig. 11,the neural networks inputs are the desired output of the plant, and the actual input and output of the plant. As previously discussed, this can be extended to include other signals as well. This configuration also does not preclude the use of a reference model. The output of the neural network is a signal that is useful for the control of the plant.
F’igure 11. Neural network used as a high level decision maker. Two uses are discussed here: changing the control parameters and supplying failure information. In the first, the neural network is used to determine appropriate values for parameters in the controller. For example, for a PID controller, the neural network can be trained to determine values for the gains based on the operating conditions of the plant, thus providing parameter tuning. The neural network could also be used as a scheduler; see Sartori and Antaaklis (1991b). Given the current operating point of the plant, the neural network decides which control law to use. Depending on the choice of the neural network implementation, the neural network scheduler may give rise to quite smooth control law switching. As another option, the neural network can be used as an optimizer to find the minimum of a cost function, such as in a llinear quadratic optimal controller. The output of the neural network is the value of the controller parameters that minimize that cost function. This would be an alternative to solving a Ricatti equation at each time step.
19
Besides being used to determine parameters for the controller, the neural network can also be trained to supply failure information to the controller. Depending on the type of training data used, the type of information can vary from fault detection to fault identification to fault diagnosis. The controller can then use this knowledge to take appropriate actions. A neural network trained for fault detection and identification can even be used in conjunction with another neural network trained to choose the appropriate control parameters given specific failures of the system. There are several references which address FDI problems. See for example Naidu, Zafiriou and McAvoy (1990), Sartori, and Antsaklis (1992b), Yao and Zafiriou (1990) and the references therein. 6. CONCLUSION
The ever-increasing demands of the complex control systems being built today and planned for the future dictate the use of novel and more powerful methods in control. The potential for neural networks in solving many of the problems involved is great, and this research area is evolving rapidly. The viewpoint is taken that conventional control theory should be augmented with neural networks in order to enhance the performance of the control systems. In the tradition of the Systems and Control area, such developments in neural networks for control should be based on firm theoretical foundations and this is still at its early stages. Strong theoretical results guaranteeing control system properties such as stability are still to come, although promising results reporting progress in special cases have been reported recently. The potential of neural networks in control systems clearly needs to be further explored and both theory and applications need to be further developed.
Acknowwnt This work was completed while being on sabbatical leave of abscence at Imperial College in London, England and at the National Technical University of Athens, Greece and I would like to thank these fine Universities for their hospitality. 7. €ummENcEs
Albus J.S.(1975), "A New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)," Transactions of the ASME, Journal of Dynamic Systems, Measurement, and Control, vol. 97, series G, no. 3, pp. 220-227,Sept. 1975. Aleksander I., Morton H.B. (1990), An Introduction to Neural Computing, Chapman and Hall, UK, 1990. Anderson C. (1990), "Learning to Control an Inverted Pendulum Using Neural Networks," IEEE Control Systems Magazine, vol. 10,no. 3, pp. 31-37, April 1990. Antsaklis P.J.(1990), Editor, Special Issue on (LNeuralNetworks i n Control Systems," IEEE Control Systems Magazine, vol. 10, no. 3, pp. 3-87,April 1990.
20
Antsaklis P.J. (1992), Editor, Special Issue on ‘Neural Networks in Control Systems,” IEEE Control Systems Magazine, vol. 12, no. 3,April 1992. Antsaklis P.J., Passino K.M. (1992a), Editors,Introduction to Intelligent and Autonomous Contrvl, Kluwer, 1992. Antsaklis P.J., Passino K.M. (1992b). “Introduction to Intelligent Control Systems with High Degree of Autonomy“, Introduction to Intelligent and Autonomous Control, P.J.Antsaklis and K.M.Passino, Eds., Chapter 1, Kluwer, 1992. Antsaklis P.J., Passino K.M., Wang S.J. (1991), “An Introduction to Autonomous Control Systems,“ IEEE Control Systems Magazine, vol. 11, no. 4, pp. 5-13,June 1991. Antsaklis P.J., Sartori M.A. (1992), ”Neural Networks i n Control Systems”, Encyclopedia of Systems and Control, Supplementary Volume I I , M.G. Singh Editor-in-chief. To appear. Barto A.G. (1990), ‘Connectionist Learning for Control: An Overview,” Neural Networks for Control, Miller, Sutton and Werbos, Editors, MIT Press, 1990. Baum E.B. (1988), ‘On the Capabilities of Multilayer Perceptrons,” J. Complexity, vol4, pp 193-215,1988. Bhat N.V., Minderman P., McAvoy, Wang N. (19901, ‘Modeling Chemical Process Systems via Neural Computation,” IEEE Control Systems Magazine, vol. 10, no. 3, April 1990. Chen F-C. (1990), ‘Back-Propagation Neural Networks for Nonlinear SelfTuning Adaptive Control,” IEEE Control Systems Magazine, vol. 10, no. 3, pp. 44-48, April 1990. Chester D. (1990), ‘Why Two Hidden Layers are Better than One,” Proc Int. Joint Conf. on Neural Networks, IEEE Publications, ppI.265-268, 1990. Cybenko G. (1988), ‘Continuous Valued Neural Networks with Two Hidden Layers are Sufficient,” Tech Report, Dept of Computer Science, Tufts University, 1988. Cybenko G. (1989), “Approximations by Superpositions of a Sigmoidal Function,” Math of Control, Signals, and Systems, vol 2, no 4, pp 303-314, 1989. Funahashi K.(1989), “On the Approximate Realization of Continuous Mappings by Neural Networks,” Neural Networks, vol2, pp183-192, 1989; also in Proc Int. Joint Conf. on Neural Networks, B E E Publications, ppI.641-648, 1988. Hecht-Nielsen R. (19901, Neurocomputing, Addison-Wesley, Reading, MA, 1990. Hertz J., Krogh A., Palmer R.G. (19911, Introduction to the Theory of Neural Computation, Addison-Wesley. Hopfield J.J. (1982). “Neural Networks and Physical Systems with Emergent Collective Computational abilities,“ Proceedings of the National Academy of Science, U.S.A.,V O ~ .79, pp. 2554-2558, April 1982. Hornik K.M., Stinchocombe M., White H. (19891, “Multilayer Feedforward Networks are Universal Approximators,“ Neural Networks, vol. 2, pp. 359366,1989. Hoskins J.C., Himmelblau D.M. (1988). “Artificial Neural Network Models of Knowledge Representation in Chemical Engineering,“ Computers and Chemical Engineering, vol. 12, pp. 881-890,1988.
21
Hou Z., Antsaklis P.J. (19921, Analysis of Auto Powertrain Dynamics and Modeling using Neural Networks, MSc Thesis, Department of Electrical Engineering, University of Notre Dame, May 1992. Jones L.K. (19901, "Constuctive Approximations for Neural Networks by Sigmoidal Functions," IEEE Proceedings, vol78, no 10,pp 1586-1589,1990. Kana I., Guez A. (19891, 'Neuromorphic Adaptive Control," Proc of the 28th IEEE Confon Decision and Control, pp. 17391743,1989. Kawato M. (19901, "Computational Schemes and Neural Network Models for Formation and Control of Multijoint Arm Trajectory," Neural Networks for Control, Miller, Sutton and Werbos, Editors, MIT Press, 1990. Kohonen T.(19801, Content-Addressable Memories, Springer-Verlag, New York, NY,1980. Kraft L.G., Campagna D.P. (19901, "A Comparison Between CMAC Neural Network Control and Traditional Adaptive Control Systems," ZEEE Control Systems Magazine, vol. 10, no. 3, pp. 36-43, April 1990. Leonard J.A., Kramer M.A. (19901, "Classiifying Process Behavior with Neural Networks: Strategies for Improved Training and Generalization," Proceedings of the 1990 American Control Conference, pp. 2478-2483,1990. Lippmann R.P.(1987), "An Introduction to Computing with Neural Networks," ZEEE ASSP Magazine, pp. 4-22, April 1987. Miller T.W., Glanz F.H., Kraft L.G. (19901, "CMAC: An Associative Neural Network Alternative to Backpropagation," IEEE Proceedings, vol78, no 10, pp 1561-1567,1990. Miller T.W., Sutton R.S., Werbos P.J. (19901, Editors, Neural Networks for Control, MIT Press, Cambridge, MA 1990. Moody J., Darken D.J. (1989), "Fast Learning in Networks of Locally-Tuned Processing Units," Neuml Computation, vol. 1,pp. 281-294, 1989. Naidu S.R.,Zafiriou E., McAvoy T.J.(19901, "Use of Neural Networks for Sensor Failure Detection in a Control System," ZEEE Control Systems Magazine, V O ~ ,10, pp. 49-55, April 1990. Narendra K.S.,Parthasarathy K.( 1990), "Identification and Control of Dynamical Systems Using Neural Networks," ZEEE Transactions on Neural Networks, vol. 1, no 1, pp. 4-27, March 1990. Nguyen H.D., Widrow B. (1990), "Neural Networks for Self-Learning Control Systems," IEEE Control Systems Magazine, vol. 10, no. 3, pp. 18-23, April 1990. Nilsson N.J. (1965), Learning Machines, McGraw-Hill, 1965. Parthasarathy K.( 1991), Identification and Control Using Neural Networks, PhD dissertation, Dept of Elec Engr, Yale University, Dec 1991. Passino K.M., Sartori M.A., Antsaklis P.J. (19891, "Neural Computing for Numeric-to-Symbolic Conversion i n Control Systems," ZEEE Control Systems Magazine, pp. 44-52, April 1989. Poggio T., Girosi F. (19901, "Networks for Approximation and Learning," ZEEE P m d i n g s , vol78, no 9, pp 1481-1497,1990. Psaltis D., Sideris A., Yamamura A. (19871, "Neural Controllers", ZEEE 1st Zntl Confon Neural Networks, vol4, pp 551-558, 1987. &in S-Z, SU H-T, McAvoy T.J. (19921, "Comparison of Four Neural Net Learning Methods for Dynamic System Identification," ZEEE Trans on Neural Networks, vol3, no 1, pp 122-130, 1992.
22
Rumelhart D.E., Hinton G.E., Williams R.J. (19861, "Learning Intm'nal Representations by Error Propagation," i n Rumelhart D.E., McClelland J.L., eds., Parallel Distributed Processing: Explorations in the Microstructune of Cognition, vol 1: Foundation, pp. 318-362, MIT Press, 1986. Sartori M.A. (19911, Feedforward Neural Networks and their Application in the Higher Level Control of Systems, Ph.D. Dissertation, Department of Electrical Engineering, University of Notre Dame, April 1991. Sartori M.A., Antsaklis P.J. (19901, "Neural Network Training via Quadratic Optimization," Technical Report # 90-05-01, Department of Electrical Engineering, University of Notre Dame, May 1990, Revised April 1991. Also in Proc of ISCAS, San Diego, CA, May 10-13,1992. Sartori M.A., Antsaklis P.J. (1991a), "A Simple Method to Derive Bounds on the Size and to Train Multilayer Neural Networks," IEEE Trans on Neural Networks, vol2, no 4, pp 467471, July 1991. Sartori M.A., Antsaklis P.J. (1991b), 'A Gaussian Neural Network Implementation for Control Scheduling," Proceedings of the 1991 IEEE International Symposium on Intelligent Control, pp. 400-404, August 1991. Sartori M.A., Antsaklis P.J. (1992a), 'Implementations of Learning Control Systems Using Neural Networks," IEEE Control Systems Magazine, in Special Issue on 'Neural Networks in Control Systems', Vol.12, No.3, April 1992. Sartori M.A., Antsaklis P.J. (1992b), "Failure Behavior Identification for a Space Antenna via Neural Networks", Proc of American Control Conference, Chicago, IL,June 24-26, 1992. Sartori M.A., Passino K.M.,Antsaklis P.J. (19891, "Artificial Neural Networks in the Match Phase of Rule-Based Expert Systems," Proceedings of the Twenty-Seventh Annual Allerton Conference on Communication, Control, and Computing, University of Illinois at Urbana-Champaign, pp. 1037-1046, September 27-29,1989. Sartori M.A., Passino K.M., Antsaklis P.J.(1992), "Neural Networks in the Match Phase of Rule-Based Expert Systems," IEEE Transactions on Knowledge & Data Engineering, To appear. Sontag E.D. (1990), "Feedback Stabilization using Two Hidden Layer Nets," Report SYCON-90-11. Rutgers Center for Systems and Control, Rutgers University, 1990. Sontag E.D. (19911, "Feedforward Nets for Interpolation and Classification," Report SYCON-, Rutgers Center for Systems and Control, Rutgers University, 1991. Sutton R.S., Barto A.G., Williams R.J. (19921, "Reinforcement Learning is Direct Adaptive Optimal Control," IEEE Control Systems Magazine, vol. 12, no. 3, April 1992. Yao S.C.,Zafiriou E. (19901, "Control System Sensor Failure Detection via Networks of Localized Receptive Fields," Proceedings of the 1990 American Control Conference,pp. 2472-2477.1990. Ydstie B.E. (19901, "Forecasting and Control Using Adaptive Connectionist Networks," Computers in Chemical Engineering, vol 14, pp. 583-599.1990. Ungar L.H.,Powell B.A., Kamens S. (19901, 'Adaptive Networks for Fault Diagnosis and Process Control," Computers in Chemical Engineering, 1990.
23
Warwick K. (19921, Editor, Neural Networks for Control and Systems: Principles and Applications, Peter Peregrinus, UK, 1992. Werbos P.J. (19901, "Backpropagation Through Time: What it Does and How to Do it,"IEEE Proceedings, ~ 0 1 1 8no , 10, pp 1550-1560,1990. Wong Y., Sideris A. (19921, 'Learning Convergence in the Cerebellar Model Articulation Controller,"IEEE Trans on Neural Networks, vol 3, no 1, pp 115-121,Jan 1992.
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks
J.G.Taylor (Editor) Q
1993 Elsevier Science Publishers B.V. All rights reserved.
25
Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs
Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton St., London WC2A 2AE, United Kingdom
1. LEARNING AND TRAINING: A FRAMEWORK
There are many types of activity which are commonly known as ‘learning’. Here, we shall discuss a mathematical model of one such process, known as the the ‘probably approximately correct’ (or PAC) model. We shall illustrate how key problems of learning in artificial neural networks can be studied within this framework, presenting theoretical analyses of two important issues: the size of training sample that should be used, and the running time of learning algorithms; in other words, sample complezity and computational complexity.
In a general framework for our discussion of learning, we have a ‘real world’ W containing a set of objects which we shall refer to as examples. We also have a ‘pre-processor’ P , which takes an example and converts it into a coded message, such as a string of bits or a real vector. This coded version of the example is then presented to M, a machine whose purpose is to recognise certain examples. The output of M is a single bit, either 1 (if the example is recognised as belonging to a certain set), or 0 (if not). For example, M may be a feedforward linear threshold network. The machine can be in one of many states. For example, M might be in a state in which it will recognise examples of the letter A, coded as a string of bits in which the 1’s correspond to the positions of black pixels on a grid; other states of M might enable it to recognise examples of other letters. The learning process we consider involves making changes to the state of M on the basis of examples presented to it, so that it achieves some desired classification. For instance, we might be able to change the state of M in such a way that the more positive and negative examples of the letter A that are presented, the better will M recognise future examples of that letter.
26
For our purposes, a 'concept' can be described by a set of examples. Let C be a set, which we shall call the alphabet for describing examples. In this article, C will be either the boolean alphabet (0, l}, or the real alphabet R. We denote the set of n-tuples of elements of C by C" and the set of all non-empty finite strings of elements of C by C'. Let X be a subset of C'. We define a concept, over the alphabet C, to be a function c : X + {0,1}. In some cases X will be the whole of C*; while in other cases X will be taken as C" for one specific value of n , which will be clear from the context. The set X will be referred to as the ezample space, and its members as ezamples. An example y E X for which c ( y ) = 1 is known as a positive ezample, and an example for which c ( y ) = 0 is known as a negative ezample. The union of the set of positive examples and the set of negative examples is the domain of c. So, provided that the domain is known, c determines, and is determined by, its set of positive examples. Sometimes it is helpful to think of a concept as a set in that way. There are two sets of concepts inherent in our framework for learning. First, there is the set of concepts derived from the real world which it is proposed to recognise. This set might contain concepts like 'the letter A', 'the letter B', 'the letter C', and so on, each of which can be coded to determine a set of positive and negative examples. When a set of concepts is determined in this way, we shall use the term concept space for it. Secondly, there is the set of concepts which the machine M is capable of recognising. We shall suppose that M can assume various states, and that in a given state it will classify some inputs as positive examples (output l), and the rest as negative examples (output 0). Thus a state of M determines a concept, which we may think of as a hypothesis. For this reason, the set of all concepts which M determines will be referred to as its hypothesis space. Generally speaking, we are given two sets of concepts, C (the concept space) and H (the hypothesis space), and the problem is to find, for each c E C , some h E H which is a good approximation to c . In realistic situations hypotheses are formed on the basis of certain information which does not amount to an explicit definition of c. In the framework we describe, we shall assume that this information is provided by a sequence of positive and negative examples of c. In practice there are constraints upon our resources, and we have to be content with a hypothesis h which 'probably' represents c 'approximately', in some sense to be defined. Let X C' be the example space, where as always C is ( 0 , l ) or R. A sample of length m is just a sequence of m examples, that is, an m-tuple x = (z1,z2,. . . ,zm)in Xm. A training sample s is an element of (X x (0, l})", that is,
s = ( ( Z l , b l h ( z z , b z ) , ." , ( z m r b r n ) ) , where the 2 ; are examples and the bi a.re bits. We say that s is a training sample for the target concept t if b, = t ( x ; ) , ( 1 _< i 5 m ) .
A learning algorithm for
(c,
H ) (or R (C, H)-learning algorithm) is a procedure which accepts training samples for functions in C and outputs corresponding hypotheses in
27
H . Of course, in order to qualify as an algorithm the procedure must be effective in some sense; we shall discuss this point in more detail in due course. If we ignore the problem of effectiveness, a learning algorithm is simply a function L which assigns to any training sample s for a target concept t E C a function h E H. We write h = L ( s ) . In what follows, we generally assume that C 2 H. If there are no contradictory labels in the sample, then s may be regarded as a function defined on the set { q , x z , ... ,I,}, given by s ( I ; )= b, (1 5 i 5 m). The hypothesis L ( s ) is a function defined on the whole example space X. A hypothesis h in H is said to be consistent with s, or to agree with s, if h ( z ; )= b, for each 1 5 i 5 m. We do not, in general, make the assumption that L ( s ) is consistent with s, but when this condition does hold for all s we shall say that L itself is consistent. In that case the function L ( s ) is simply an extension of s. As indicated above, the machines needed in this area of Computational Learning Theory have the additional feature that they can take on a number of different states, and so a machine represents a set of functions, rather than a single function. A state of the machine is a representation of a function, and the set of all such functions comprises the hypothesis space defined by the machine. For example, in the case in which M is an artificial neural network, the states of the machine are the possible weight assignments. It is convenient to define a representation to be a surjection R -+ H , where R is a set and H is a hypothesis space. The set R may be the set of states of a machine. The surjection assigns to each state w a corresponding function h,.
Example Consider the linear threshold machine, or boolean perceptron, with n boolean inputs y1, yz, . . . ,yn E ( 0 , l ) and a single active node. The arcs carrying the inputs a,y,of the inputs is have associated weights 011,. . . , a , and the weighted sum applied to the active node. This node outputs 1 if the weighted sum is at least 8, and 0 otherwise. In this case the representation 0 + H is defined as follows. The set R is the real space R n + l , that is, the set of ( n 1)-tuples w = ( 0 1 , . . . ,a,,@). The function 0 h, is given by h,(ylyz . . . yn) = 1 if and only if C:=l a i y , 2 8.
c:=,
+
2 . THE BASIC PAC MODEL
In general, not every extension of a training sample will be a valid generalisation, because the target concept is only partially defined by the sample. Furthermore, a training sample may be unrepresentative or misleading. This prompted Valiant (1984a, 1984b) to propose a probabilistic model of learning, which we now describe. We start by illustrating the basic ideas by means of a very simple example. For each real number @ the ray r e is the concept defined on R by re(y) = 1 if and only if y 2 0. An algorithm for learning in the hypothesis space H = {re I B E R} is based on the idea of taking the current hypothesis to be the ‘smallest’ ray containing all the positive examples in the training sample. A suitable default hypothesis when there are no positive examples in the sample will be the identically-0 function. For convenience,
28
we therefore consider this to be a ray, and call it the empty ray. It will be denoted roo, where 00 is merely a symbol taken to be greater than any real number. Then, for a given training sample s = ((11, b l ) , . . . ,(I,, b,)), the output hypothesis L ( s ) should be r x , where X = X(s) = min {I, l
I bi = I},
and X = ~3 if the sample contains no positive examples. It is easy to see that if the training sample is for a target concept re then L ( s ) will be a ray rx with X = X(s) 2 8. Because there are only finitely many examples in a training sample, and the example space is uncountable, we cannot expect that X = 0. However, it seems that as the length of the training sample increases, so should the likelihood that there is small error resulting from using rx instead of re. In practical terms, this property can be characterised as follows. Suppose we run the algorithm with a large training sample, and then decide to use the output hypothesis rx as a substitute for the (unknown) target concept re. In other words, we are satisfied that the ‘learner’ has been adequately trained. If X is not close to B, this indicates that positive examples close to 8 are relatively unlikely and did not occur in the training sample. Consequently, if we now classify some more examples which are presented according to the same distribution, then we shall make few mistakes as a result of using rx instead of re. Consider a general learning framework in which a training sample s for a target concept t is generated by drawing the examples 11,12,. . . ,I, from X ‘at random’, according to some fixed, but unknown, probability distribution. A learning algorithm L produces a hypothesis L ( s ) which, it is hoped, is a good approximation to t. More fully, we require that, as the number m of examples in the training sample increases, so does the likelihood that the error which results from using L ( s ) in place o f t is small. To formaliie this, we suppose that we have a probability space (X, C, p ) . Here, C is a a-algebra of subsets of X and p is a probability measure. In the cases under discussion here, the example space X is boolean or real. In the boolean case, we take C to be R”, we take C to be the Bore1 the set of all subsets of X and in the real case, X algebra on X. (Those unfamiliar with the notions just discussed may find the details in Billingsley (1986), for example.) In both cases, we use the appropriate o-algebra without explicit reference to the details. From now on, then, we shall simply speak of a ‘probability distribution p on X’, by which we mean a function p defined on the appropriate family C and satisfying the axioms given above. It must be emphasised that, in the applications we have in mind, we make no assumptions about p, beyond the conditions stated in the definition. The situation we are modelling is that of a world of examples presented to the learner according to some fixed but unknown distribution. The ‘teacher’ is allowed to classify the examples as positive or negative, but cannot control the sequence in which the examples are presented.
29
We shall continue to assume that C H , so that the target concept belongs to a hypothesis space H which is available to the learner. Given a target concept t E H we define the error of any hypothesis h in H , with respect to t , to be the probability of the event h ( z ) # t(i). That is, er,(h,t) = A x E
x I h ( z )# t(.)).
We refer to the set on the right-hand side as the error set, and we assume that it is an event, so that a probability can be assigned to it. This can be guaranteed when the hypothesis space is universally separable; see Blumer e t al. (1989). In order to streamline the notation, we suppress the explicit reference to t when it is clear from the context, and we write er,(h) in place of er,(h,t). When a given set X is provided with the structure of a probability space, the product set Xm inherits a probability space structure from X . The corresponding probability distribution on X" is the product distribution pm. Informally, for a given Y X"' we shall interpret the value p " ( Y ) as 'the probability that a random sample of m examples drawn from X according to the distribution p belongs to Y ' . Let S ( m ,t ) denote the set of training samples of length m for a given target concept t , where the examples are drawn from an example space X . Any sample x E Xm determines, and is determined by, a training sample s E S ( m ,t ) : if x = ( q , ~. ., ,xm), . then s = ((q, t ( z l ) ) ,(z2,t ( q ) ) ,. . . , (im, t ( z m ) ) ) . In other words, there is a bijection 4 : X" 4 S(m,t ) for which +(x) = s. Thus, we can interpret the probability that s E S(m,t ) has some given property P in the following way. We define pm{s
E S(m,t ) I s has property P } = pm{x E X"
1 4(x) E S ( m , t )has property P}.
It follows that, when the example space X is equipped with a probability distribution p, we can give a precise interpretation to: (i) the error of the hypothesis produced when a learning algorithm L is given S; (ii) the probability that this error is less than e. The first quantity is just er,(L(s)). The second is the probability, with respect to pm, that s has the property er,(L(s)) < e. Putting all this together we can formulate the notion that, given a confidence parameter 6 and an accuracy parameter e, the probability that the error is less than E is greater than 1- 6. The result is one of the key definitions in this field. It was formulated first by Valiant (1984a) and, using this terminology, by Angluin (1988). We say that the algorithm L is a probably approximately correct learning algorithm for (C, H ) if, given 0 a real number 6 (0 < 6 < 1); 0 a real number e (0 < e < 1); then there is a positive integer mo = mo(6,E ) such that
30 0 0
for any target concept t E C , and for any probability distribution p on X ,
whenever m 2 mo, pm{s E S ( m , t ) 1 er,(L(s)) < E }
> 1 - 6.
The term 'probably approximately correct' is usually abbreviated to the acronym p a c . The fact that mo depends upon 6 and e , but not on t and p , reflects the fact that the learner may be able to specify the desired levels of confidence and accuracy, even though the target concept and the distribution of examples axe unknown. The reason that it is at all possible to satisfy the condition for any p is that it expresses a relationship between two quantities which involve p: the error er, and the probability with respect to om of a certain set. Pac learning is, in a sense, the best one can hope for within this probabilistic framework. Unrepresentative training samples, although unlikely, will on occasion be presented to the learning algorithm, and so one can only expect that it is probable that a useful training sample is presented. In addition, even for a representative training sample, an extension of the training sample will not generally coincide with the target concept, so one can only expect that the output hypothesis is a p p r o x i m a t e l y correct. For illustration, we now give a formal verification that the algorithm described above is a pac learning algorithm for rays.
Theorem 1 The learning algorithm L for rays given above is probably approximately correct.
Proof Suppose that 6, E , re, and p are given. Let s be a training sample of length m for re and let L ( s ) = rx. Clearly, the error set is the interval [8,A). For the given value of E , and the given p , define PO = Po(€, p ) = sup{p I p[O,p) < E } . Then it follows that p[O,Po) 5 E and p[O, Po] L. E . Thus if X 5 Po we have er,(L(s)) = PI@, A) 5 0,PO) 5 E . The event that s has the property X 5 Po is just the event that at least one of the examples in s is in the interval [8, /30]. Since p[O, ,601 2 e the probability that a single example is not in this interval is at most 1 - E . Therefore the probability that none of the m examples comprising s is in this interval is at most (1 - E ) ~ . Taking the complementary event, it follows that the probability that X 5 Po is at least 1 -(1 - c ) ~ . We noted above that the event X 5 PO implies the event er,(L(s)) 5 E , and so pm{sE S ( m , r e ) I er,(L(s))
5 E } 2 1 - ( I - E)"".
Note that the right-hand side is independent of the target function re (and p ) . In order to make it greater than 1 - 6 we can take
31
For then it follows that
(1 - E
5~ (1 - e)mo < exp(-Erno) < exp(ln6) = 6.
)
This calculation shows that the algorithm is pac. (The inequality er,(L(s)) 5 e is obtained, whereas our definition of pac learning requires the inequality to be strict. However, it is easy to see that this makes no difference.) 0 This provides an explicit formula for the length of sample (that is, the amount of training) sufficient to ensure prescribed levels of confidence and accuracy. Suppose we require 6 = 0.001 and E = 0.01. Then the value of mo is ~1001n1000]= 691. So if we supply at least 691 labelled examples of any ray (these examples being chosen at random according to a fixed distribution p ) and take the output ray as a substitute for the target, then we can be 99.9% sure that at most 1% of future examples will be classified wrongly, provided they are drawn from the same source as the training sample. It is convenient at this stage to define the notion of ‘observed error’. For a given training sample s = ((ll,bl),(ZZ,b2),,(5,,bm)),
we define the observed error of h on s to be
When s is a training sample for t E C, the observed error of h on s is the fraction of examples in the sample which h and t classify differently. Recall that a learning algorithm L for (C, H ) is consistent if, given any training sample s for a target concept t E H , the output hypothesis h = L ( s ) E H agrees with t on the examples in S. In terms of observed error, a learning algorithm is consistent if for all t E C and all m, given any s E S(rn,t),we have er,(L(s)) = 0. As usual, assume that there is an unknown probability distribution p on X . Suppose we fix, for the moment, a target concept t E C. Given E E (0,l) the set of h E H for which er,(h) 2 E may be described as the set of e-bad hypotheses for t . A consistent algorithm for (C, H ) produces an output hypothesis h such that er,(h) = 0, and the pac property requires that such an output is unlikely to be €-bad. This leads to the following definition.
We say that the hypothesis space H is potentially learnable if, given real numbers 6 and e (0 < 6. E < l), there is a positive integer rno = rno(6, e ) such that, whenever m 2 mo,
I
p m { s E S(rn,t) for all h E
H , er,(h) = 0
ers(h) < E } > 1 - 6,
for any probability distribution p on X and any t E H . The following result is clear.
32
Theorem 2 If H is potentially learnable, and L is a consistent learning algorithm for H , then L is pac. The definition of potential learnability is quite complex, but we have the following standard result, observed by Blumer et al. (1989), for example.
Theorem 3 Any finite hypothesis space is potentially learnable. Proof If h has error at least e then, p { z E X
1 h ( z ) = t ( z ) }= 1 - erlc(h)I 1- E .
Thus
This is the probability that any one +bad hypothesis is consistent with s. The probability that there is some e-bad consistent hypothesis is therefore less than IHI(1 - E ) ~ This is less than 6 if
Taking the complementary event we have pm{s E
for m
S ( m , t ) I for all h E H , er,,(h) = 0 ==+er,(h) < e} > 1 - 6,
2 mo, which is the required conclusion.
It is clear that this is a useful theorem. It covers all boolean cases, where the example space is {0,1}” (or a subset thereof) with n fixed. In any such situation a consistent algorithm is automatically pac. 3. THE VAPNIK-CHERVONENKIS DIMENSION Suppose that we have a hypothesis space H defined on an example space X. We have seen that if H is finite, then it is potentially learnable. The proof depends critically on the finiteness of H and cannot be extended to provide results for infinite H. However, there are many situations where the hypothesis space is infinite, and it is desirable to extend the theory to cover this case. A pertinent comment is that most hypothesis spaces which occur ‘naturally’ have a high degree of structure, and even if the space is infinite it may contain functions only of a special type. This is true, almost by definition, for any hypothesis space H which is constructed by means of a representation SZ -+ H. The key to extending results on potential learnability to infinite spaces is the observation that what matters is not the cardinality of H , but rather what may be described as its ‘expressive power’. This notion can be formalised by means of the Vapnik-Chervonenkis dimemion of H , a notion originally defined by Vapnik and Chervonenkis (1971), and introduced into learnability theory by Blumer et al. (1989).
.
33
In order to illustrate some of the ideas, we consider the real perceptToon. This is a machine which operates in the same manner as the linear threshold machine, but with real-valued inputs. Thus, there are n inputs and a single active node. The arcs carrying the inputs have real-valued weights 01, az, . . . ,a , and there is a real threshold value 6 at the active node. As with the linear threshold machine, the weighted sum of the inputs is applied to the active node and this node outputs 1 if and only if the weighted sum is at least the threshold value 6 . More precisely, the real perceptron P,, on n inputs is defined by means of a representation R 4 H ,where the set of states fl is R"+'. For a state w = ( a l , a z , . . . ,a n l o ) ,the function h, E H,from X = R" to (0,l}, is given by h,(y) = 1 if and only if C:='= a,i y i 2 8. It should be noted that w I+ h, is not an injection: for any X > 0 the state X u defines the same function as w. Suppose that H is a hypothesis space defined on the example space X , and let x = ( q , z2, . . . ,z , ) be a sample of length m of examples from X . We define IIH(x), the number of classifications of x by H , to be the number of distinct vectors of the form
as h runs through all hypotheses of H . Although H may be infinite, HI,, the hypothesis space obtained by restricting the hypotheses of H to domain Ex = { I I , x ~ , .. .,xm}, is . that for any sample x of length m, ~IH(x) _< 2m. finite and is of cardinality l T ~ ( x ) Note An important quantity, and one which shall turn out to be crucial in applications to potential learnability, is the maximum possible number of classifications by H of a sample of a given length. We define the growth function IIH by
II,(m) = max {lT~(x) : x E Xm}. We have used the notation IIH for both the number of classifications and the growth function, but this should cause no confusion. We noted above that the number of possible classifications by H of a sample of length m is at most 2", this being the number of binary vectors of length m. We say that a sample x of length m is shattered by H , or that H shatters x, if this maximum possible value is attained; that is, if H gives all possible classifications of x. Note that if the examples in x are not distinct then x cannot be shattered by any H . When the examples are distinct, x is shattered by H if and only if for any subset S of E x , there is some zi E S. S is then the subset hypothesis h in H such that for 1 _< i 5 m, h ( z ; )= 1 of Ex comprising the positive examples of h. Based on the intuitive notion that a hypothesis space H has high expressive power if it can achieve all possible classifications of a large set of examples, we use as a measure of this power the Vapnik-Ghervonenkis dimension, or VC dimension, of H , defined as follows. The VC dimension of H is the maximum length of a sample shattered by H; if there is no such maximum, we say that the VC dimension of H is infinite. Using the notation introduced in the previous section, we can say that the VC dimension of H , denoted VCdim(H), is given by VCdim(H) = max {m : IIH(m) = Y } ,
34
where we take the maximum to be infinite if the set is unbounded.
A result which is often useful is that if H is a finite hypothesis space then H has VC this follows from the observation that if d is the VC dimension dimension at most log !HI; of H and x E X d is shattered by H , then IHI 2 lHlxl = 2d. (Here, and throughout, log denotes logarithm to base 2.) Consider now the perceptron P,,with n inputs. The set of positive examples of the function h, computed by the perceptron in state w = ( ( ~ 1 (, ~ 2 , .. . an,8) is the closed a i y , 2 8. This is bounded by the half-space 1; consisting of y E R" such that hyperplane I , with equation Cbl a i y , = 8. Roughly speaking, I , divides R" into the set of positive examples of h, and the set of negative examples of h,
EL,
We shall use the following result, known as Radon's Theorem, in which, for S C R", conv(S) denotes the convex hull of S. Let n be any positive integer, and let E be any set of n 2 points in R". Then there is a non-empty subset S of E such that
+
con.( S) n conv( E \ S) #
0.
A proof is given, for example, by Grunbaum (1967). Theorem 4 For any positive integer n, let P,, be the real perceptron with n inputs. Then VCdim(P,) = n 1.
+
+
Proof Let x = (11 , x 2 , . . . , x n + 2 ) be any sample of length n 2. As we have noted, if two of the examples are equal then x cannot be shattered. Suppose then that the set Ex of examples in x consists of n 2 distinct points in R". By Radon's Theorem, there is a non-empty subset S of Ex such that conv(S) n conv(Ex \ S) # 0. Suppose that there is a hypothesis h, in P, such that S is the set of positive examples of h, in Ex. Then we have S C_ 12, E x \ S C R"
+
\Is.
Since open and closed half-spaces are convex subsets of R", we also have conv( S) C I s ,
conv( Ex \ S ) C R"
\ It.
Therefore con.( S) n conv(E x \ S) G 1: n R"
\ 1:
= 0,
which is a contradiction. We deduce that no such h , exists and therefore that x is not shattered by P,,. Thus no sample of length n 2 is shattered by P, and the VC dimension of P,, is at most n 1.
+
+
It remains to prove the reverse inequality. Let o denote the origin of R" and, for 1 5 i 5 n, let e , be the point with a 1 in the ith coordinate and all other coordinates 0.
35
+
Then Pn shatters the sample x = (0,e l , ez, . . . ,en) of length n 1. To see this, suppose that S is a subset of Ex = ( 0 , e l , . . . , e n } . For i = 1,2,. . . ,n, let oli be 1 if e , E S and -1 otherwise, and let 0 be -1/2 if o E S, 1/2 otherwise. Then it is straightforward to verify that if w is the state w = (01, az,. . . ,a,, 19) of P, then the set of positive examples of h, in Ex is precisely S. Therefore x is shattered by Pn and, consequently, 0 VCdim(P,) 2 n 1.
+
The growth function II,(m) of a hypothesis space of finite VC dimension is a measure of how many different classifications of an m-sample into positive and negative examples can be achieved by the hypotheses of H,while the VC dimension of H is the maximum value of m for which IIH(m) = 2"'. Clearly these two quantities are related, because the VC dimension is defined in terms of the growth function. But there is another, less obvious, relationship: the growth function II,(m) can be bounded by a polynomial function of m, and the degree of the polynomial is the VC dimension d of H. Explicitly, we have the following theorem. The first inequality is due to Sauer (1972) and is usually known as Saue~'8Lemma. The second inequality is elementary-a proof was given by Blumer e t aZ. (1989). T h e o r e m 5 (Sauer's Lemma) Let d 2 0 and m >_ 1 be given integers and let H be a hypothesis space with VCdim(H) = d 2 1. Then f o r m 2 d,
where e is the base of natural logarithms.
0
We have motivated our discussion of VC dimension by describing it as a measure of the expressive power of a hypothesis space. We shall see that it turns out to be a key parameter for quantifying the difficulty of pac learning. Our first result along these lines is that finite VC dimension is necessary for potential learnability. T h e o r e m 6 If a hypothesis spa.ce has infinite VC dimension then i t is not potentially learnable. Proof Suppose that H has infinite VC dimension, so that for any positive integer m there is a sample z of length 21n which is shattered by H . Let E = Em be the set of examples in this sample and define a probability distribution p on X by p ( ~ = ) 1/2m if I E E and P(I) = 0 otherwise. In other words, p is uniform on E and zero elsewhere. We observe that p m is uniform on Em and zero elsewhere. Thus, with probability one, a randomly chosen sample x of length m is a sample of examples from E . Let s = (x,t(x)) E S(m, t ) be a training sample of length m for a target concept t E H. With probability 1 (with respect to p m ) , we have 2; E E for 1 5 i 5 m. Since z is shattered by H , there is a hypothesis h E H such that h ( s i ) = t ( q ) for each
36
z, (1 5 i 5 m ) , and h ( z ) # t ( z ) for all other 2 in E . It follows that h is consistent with s, whereas h has error at least 1/2 with respect to t . We have shown that for any positive integer m, and any target concept t , there is a probability distribution p on X such that the set {s
1 for all h E H , er.(h)
= 0 ===+ er,(h)
< 112)
has probability zero. Thus, H is not potentially learnable.
0
The converse of the preceding theorem is also true: finite VC dimension is sufficient for potential learnability. This result can be traced back to the statistical researches of Vapnik and Chervonenkis (1971) (see also Vapnik (1982) and Vapnik and Chervonenkis (1981)). The work of Blumer et al. (1989) showed that it is one of the key results in Computational Learning Theory. We now give some indication of its proof. Suppose that the hypothesis space H is defined on the example space X , and let t be any target concept in H ,p any probability distribution on X and r any real number with 0 < E < 1. The objects t , p, r we to be thought of as fixed, but arbitrary, in what follows. The probability of choosing a training sample for which there is a consistent, but e-bad, hypothesis is pm{ s E S(m,t)1 there is h E H such that er,(h) = O,er,(h)
2 e} .
Thus, in order to show that H is potentially learnable, it suffices to find an upper bound f ( m ,e) for this probability which is independent of both t and p and which tends to 0 as m tends to infinity. The following result, of the form just described, is due to Blumer e t al. (1989), and generalises a result of Haussler and Welzl (1987). Better bounds have subsequently been obtained by Anthony, Biggs and Shawe-Taylor (1990) (see also Shawe-Taylor, Anthony and Biggs (1993)), but the result presented here suffices for the present discussion.
Theorem 7 Suppose that H is a hypothesis space defined on an example space X , and that t , p , and r are arbitrary, but fixed. Then pm {s E S(m,t)I there is h E H such that er,(h) = O,er,(h) 2
for all positive integers m
2 816.
c}
< 2 I I ~ ( 2 r n2-‘m’2 ) 0
The right-hand side is the bound f ( m , e ) as postulated above. If H has finite VC dimension then, by Sauer’s Lemma, IIw(2m) is bounded by a polynomial function of m, and therefore f ( m ,c) is eventually dominated by the negative exponential term. Thus the right-hand side, which is independent o f t and p , tends to 0 as m tends to infinity and, by the above discussion, this establishes potential learnability for spaces of finite VC dimension.
31
At this point it is helpful to introduce a new piece of terminology. Supose that real numbers 0 < 6 , <~ 1 are given, and let L be a learning algorithm for a concept space C and a hypothesis space H. We say that the sample complezity of L on T E C is the least value m ~ ( T , 6 , €such ) that, for all target concepts t E T and all probability distributions p , em{s E S ( m , t ) I er,(L(s)) < E } > 1- 6 whenever m 2 mL(T,6 , E ) ; in other words, a sample of length r n ~ ( T6 , e ) is sufficient to ensure that the output hypothesis L ( s ) is pac, with the given values of 6 and E . In practice we often omit T when this is clear and we usually deal with a convenient upper bound r n o 2 mL, rather than mL itself; thus mo(6,E ) will denote any value sufficient to ensure that the pac conclusion, as stated above, holds for all m 2 mo. The following result follows from Theorem 7.
Theorem 8 There is a constant I< such that if hypothesis space H has VC dimension d 2 1 and the concept space C is a subset of H, then any consistent learning algorithm L for (C, H ) is pac, with sample complexity
for 0 < 6 , <~ 1.
0
Values of K are easily obtained; see Blumer e l al. (1989), Anthony, Biggs and ShaweTaylor (1990) and Anthony and Biggs (1992). This result provides an extension of the bound for finite spaces, mentioned at the beginning of the section, to spaces with finite VC dimension.
+
Example The real perceptron P, has VC dimension n 1. Suppose that for any training sample for a hypothesis of P,, we can find a state w of the perceptron such that h, is consistent with the sample. Then if we use a training sample of length
we are guaranteed a probably approximately correct output hypothesis, regardless of both the target hypothesis and the probability distribution on the examples. 0 We now present a lower bound result, part of which is due to Ehrenfeucht et al. (1989) and the other part of which is due to Blumer et al. (1989). This provides a lower bound on the sample complexity of a n y (C, H ) pac learning algorithm when C has finite VC dimension and is 'non-trivial'.
Theorem 9 Let C be a concept spa.ce and H a hypothesis space, such that C has VC dimension at least 1 and consists of at least three distinct concepts. Suppose that L is
38 any pac learning algorithm for (C, H ) . Then the sample complexity of L satisfies
for all
E
5 118 and 6 5 1/100.
0
An immediate corollary of this result is that if a concept space C has infinite VC dimension then there is no pac learning algorithm for (C, H) for any hypothesis space
H. An interesting consequence of these results is that a set H of functions is pac learnable if and only if it is potentially learnable, since each condition holds precisely when H has finite VC dimension. The results show that if there is a pac learning algorithm for H , then any consistent learning algorithm for H is pac. 4. FEEDFORWARD ARTIFICIAL NEURAL NETWORKS
A perceptron contains only one ‘active unit’, and is consequently severely limited in its capabilities. The idea that more complex assemblies of units may have greater power is an old one, motivated by the fact that living brains seem to be constructed in this way, and it has led to the intensive study of ‘artificial neural networks’. We shall examine some such networks in the context of Computational Learning Theory. The basic structure is a pair of sets ( N ,A ) , where N is a finite set whose members are called nodes, and A is a subset of N x N whose members are called arcs. The structure ( N , A ) is a directed graph, or digraph, which we think of as a fixed architecture for a ‘machine’. For simplicity, we consider only digraphs which have no directed cycles: that is, there is no sequence of arcs beginning with ( T , s) and ending with ( q , r ) , for any node r . In the present context this is known as the feedforward condition. In order to present this set-up as a ‘machine’, we require some additional features. First we specify a subset J of the nodes, which we call input nodes, and a single node z $ J which we call the output node. The underlying idea is that all nodes receive and transmit signals; the input nodes receive their signals from the outside world and the output node transmits a signal to the outside world, while all other nodes receive and transmit along the relevant arcs of the digraph. Each arc ( T , s) has a weight, w ( r , s), which is a real number representing the strength of the connection between the nodes T and s. A positive weight corresponds to an ‘excitatory’ connection, a negative weight to an ‘inhibitory’ connection. Another feature is that all nodes except the input nodes are ‘active’, in that they transmit a signal which is a predetermined function of the signals they receive. For this reason, the nodes in N \ J are called computation nodes. To make this idea precise, we introduce an activation function fr for each computation node r . The activity of such a node is specified in two stages. First the signals arriving at T are aggregated by taking their weighted sum according to the connection strengths on the
39
arcs with terminal node T , and then the function f,.of this value is computed. Thus the action of the entire network may be described in terms of two functions p : N -+ R and q: N 4 R, representing the received and transmitted signals respectively. We assume that a vector of real-valued signals y = ( y j ) j E j is applied externally to the input nodes, and p ( j ) = q ( j ) = y j for each j in J. For each computation node I the received and transmitted signals are defined as follows.
The output is the value q ( z ) transmitted by the node
2.
For now, we assume that every activation function is a simple linear threshold function: f(u)= 1 if u 2 O and f ( u ) = 0 otherwise. We shall write 0,. to denote the value of the threshold for the node r .
When all the computation nodes are linear threshold nodes, a state w of the machine is described by the real numbers W ( T , s) and Or, for ( r ,s) E A , T E N \ J. The set of all states which satisfy some given rules (such as bounds on the values of the weights and thresholds) will be denoted by R. Now we are firmly within the framework developed earlier. The function computed by the machine in state w will be denoted by h,, so that h,(y) = p(z). Note that this is a boolean value, because the output node has a linear threshold activation function. The set { h , I w E R} of functions computable by the machine is the hypothesis space H , and the assignment w H h, is a representation
R
-+H .
5. EXPRESSIVE POWER OF THRESHOLD NETWORKS
We shall prove a result of Baum and Haussler (1989), which gives an upper bound on the VC dimension of a feedforward linear threshold network in terms of the number of nodes and arcs. Suppose that we have a feedforward network of linear threshold nodes, with underlying digraph ( N , A ) and set of states R. The feedforward condition allows us to label the computation nodes by the positive integers in their natural order, 1,2,. . . ,t, in such a way that z is the output node and ( i , j ) E A implies j > i . (This may be done by numbering first those computation nodes which are linked only to input nodes, then those which are linked only to input nodes and already-numbered computation nodes, and so on.) For each state w E R, corresponding to an assignment of weights and thresholds to all the arcs and computation nodes, we let w1 denote the part of w determined by the thresholds on computation nodes 1 , 2 , . . . , 1 and the weights on arcs which terminate at
those nodes. Then for 2 5 I 5 z we have the decomposition w' = ( w ' - ' , ( / ) where (1 stands for the weights on arcs terminating at 1 and the threshold at 1. In isolation, the output of a computation node 1 is a linear threshold function, determined by ( 1 , of the
40
outputs of all those nodes j for which (j,I) is an arc; some of these may be input nodes and some may be computation nodes with j < I. We denote the space of such functions by H I and the growth function of this 'local hypothesis space' by Ill. Suppose that x = ( r l ,12,. . . ,r,) is a sample of inputs to the network. (Each example zi is a IJI-vector of real numbers, where J is the set of input nodes.) For any computation node I (1 5 I 5 z ) , we shall say that states w 1 , q of the network are I-distinguishable by x if the following holds. There is an example in x such that, when this example is input, the output of at least one of the computation nodes 1 , 2 , . . . ,I, is different when the state is w1 from its output when the state is w2. In other words, if one has access to the signals transmitted by nodes 1 to 1 only, then, using the sample x , one can differentiate between the two states. We shall denote by Sl(x) the number of different states which are mutually I-distinguishable by x.
Lemma 10 With the notation defined as above, we have
Proof We prove the claim by induction on I. For 1 = 1 we have S l ( x ) 5 lI,(x), because two states are 1-distinguishable if and only if they give different classifications of the training sample at node 1. Thus Sl(x) 5 lIl(m). Assume, inductively, that z . The decomposition wk = (&', &) the claim holds for I = k - 1, where 2 5 k I shows that if two states are k-distinguishable but not (k - 1)-distinguishable, then they must be distinguished by the action of the node Ic. For each of the Sk-l(x) (k - 1)distinguishable states there are thus at most & ( m ) k-distinguishable states. Hence Sk(x) 5 Sk-l(x) &(m). By the inductive assumption, the right-hand side is at most II1(m)II2(rn).. . I I k ( r n ) . The result follows. 0 If H is the hypothesis space of N then n ~ ( xis)the number of states which are mutually distinguishable by x. Thus, W m )
c nl(m)W m ) .. .W m ) ,
for any positive integer m. The next result follows from this observation and the previous result. Corollary 11 Let (N, A) be a feedforward linear threshold network with z computation nodes, and let W = IN \ JI IAl be the total number of variable weights and thresholds. Let H be the hypothesis space of the network. Then for m > W, we have
+
+
1 for 1 5 i 5 t and so, for each such i and for rn > W , IIi(m) 5 (em/d(z) l ) d ( i ) + by l , Sauer's Lemma and since the VC dimension of Hi is
Proof Certainly, W 2 d(i)
+
41
d(i)
+ 1. It follows that
From this one can obtain the desired result. We omit the details here; these may be found in Baum and Haussler (1989) or Anthony and Biggs (1992). 0 T h e o r e m 12 The VC dimension of a feedforward linear threshold network with z computation nodes and a total of W variable weights and thresholds is at most 2W log (ez).
Proof Let H be the hypothesis space of the network. By the above result, we have, for m 2 W , II,(m) 5 (zem/W)w, where W is the total number of weights and thresholds. Now,
which is true for any z 2 1. Therefore, n H ( r n ) < 2"' when m = 2Wlog(ez), and the VC dimension of H is at most 2W log(ez), as claimed. 0 Notice that the upper bound on the VC dimension depends only on the 'size' of the network; that is, on the number of computation nodes and the number of arcs. That it is independent of the structure of the network - the underlying directed graph - suggests that it may not be a very tight bound. In their paper, Baum and Haussler (1989) showed that certain simple networks have VC dimension at least a constant multiple of the number of weights. More recently, Bartlett (1992) obtained similar results for wider classes of networks. However, in a result which shows that the upper bound is essentially the best that can be obtained, Maass (1992) has shown that there is a constant c such that for infinitely many values of W , some feedforward linear threshold network with W weights has VC dimension at least cW log W . (The networks for which Maass showed this to be true have 4 layers.) If, as in Baum and Haussler (1989), we substitute the bound of Corollary 11 directly into the result of Theorem 7 then we can derive a better upper bound on sample complexity than would result from substituting the VC dimension bound into Theorem 8. Indeed, the former method gives a bound involving a log(z/r) term, while the latter yields a bound depending on log z log ( l / e ) . With this observation and the previous results, we have the following result on sufficient sample size. T h e o r e m 13 Let ( N ,A ) be a feedforward linear threshold network having z computation nodes and W variable weights and thresholds. Then for all 0 < 6 , <~ 1, there is mo(6,e) such that if a training sample of length m 2 mo(6,E ) is drawn at random according to some fixed probability distribution on the set of all inputs, then the following
42
holds. With probability at least 1- 6, if the sample is ‘loaded’onto the network, so that the function computed by the network is consistent with the sample, then the network correctly classifies a further randomly chosen input with probability at least 1- E . The sufficient sample size satisfies
for some (absolute) constant K1. firthemore, there is Kz > 0 such that for infinitely many W , there is a network with W weights for which the sufficient sample size must satisfy
6. THE COMPUTATIONAL COMPLEXITY OF LEARNING
Thus far, a learning algorithm has been defined as a function mapping training samples into hypotheses. We shall now be more specific about the algorithmics. If pac learning by a learning algorithm is to be of practical value, it must, first, be possible to implement the learning algorithm on a computer; that is, it must be computable and therefore, in a real sense, an algorithm, not just a function. Further, it should be possible to implement the algorithm ‘quickly’. The subject known as Complexity Theory deals with the relationship between the size of the input to an algorithm and the time required for the algorithm to produce its output for an input of that size. In particular, it is concerned with the question of when this relationship is such that the algorithm can be described as ‘efficient’. Here, we shall describe the basic ideas in a very simplistic way. More details may be found in the books by Garey and Johnson (1979), Wilf (1986), and Cormen, Leiserson and Ftivest (1990). The 3ize of an input to an algorithm will be denoted by s. For example, if an algorithm has a binary encoding as input, the size of an input could be the number of bits it contains. Equally, if the input is a real vector, one could define the size to be the dimension of the vector. Let A be an algorithm which accepts inputs of varying size s. We say that the running time of A is O(f(s)) if there is some constant K such that, for any input of size s, the number of operations required to produce the output of A is at most K f ( s ) . Note that this definition is ‘device-independent’ because the running time depends only on the number of operations carried out, and not on the actual speed with which such an operation can be performed. Furthermore, the running time is a worst-case measure; we consider the maximum possible number of operations taken over all inputs of a given size.
There are good remons for saying that an algorithm with running time O(a“), for some fixed integer r 2 1, is ‘efficient’. Such an algorithm is said to be a polynomial time
43
algorithm, and problems which can be solved by a polynomial time learning algorithm are usually regarded as ‘easy’. Thus, to show that a problem is easy, we should present a polynomial time algorithm for it. On the other hand, if we wish to show that a given problem is ‘hard’, it is enough to show that if this problem could be solved in polynomial time then so too could another problem which is believed to be hard. One standard problem which is believed to be hard is the graph k-colouring problem for k 2 3. Let G be a graph with vertex-set V and edge-set E , so that E is a subset of the set of 2-element subsets of V . A k-colouring of G is a function x : V + { 1,2,. . . ,k} with the property that, whenever i j E E , then x(i) # x(j). The graph k-colouring problem may formally be stated as: GRAPH k-COLOURING Instance A graph G = (V,E ) . Question Is there a k-colouring of G? When we say that GRAPH k-COLOURING is ‘believed to be hard’, we mean that it belongs to a class of problems known as the NP-complete problems. This class of problems is very extensive, and contains many famous problems in Discrete Mathematics. Although it has not yet been proved, it is conjectured, and widely believed, that there is no polynomial time algorithm for any of the NP-complete problems. This is known as the ‘P # N P conjecture’. We shall apply these ideas in the following way. Suppose that II is a problem in which we are interested, and IIo is a problem which is known to be NP-complete. Suppose also that we can demonstrate that if there is a polynomial time algorithm for lI then there is one for no. In that case our problem II is said to be NP-hard. If the P # N P conjecture is true, then proving that a problem n is NP-hard establishes that there is no polynomial time algorithm for II. We now wish to quantify the behaviour of learning algorithms with respect to n, and it is convenient to make the following definitions. We say that a union of hypothesis spaces H = H , is graded by example size n, when H , denotes the space of hypotheses defined on examples of size n. For example, H , may be the space P, of the perceptron, defined on real vectors of length n. By a learning algorithm f o r H = H,,we mean a function L from the set of training samples for hypotheses in H to the space H , such that when s is a training sample for h E H , it follows that L ( s ) E H , . That is, we insist that L preserves the grading. (Analogously, one may define, more generally, a learning algorithm for ( C , H ) when each of C and H are graded.) An example of a learning algorithm defined on the graded perceptron space P = P, is the perceptron learning algorithm of Rosenblatt (1959). (See also Minsky and Papert (1969).) Observe that this algorithm acts in essentially the same manner on each P,; the ‘rule’ is the same for each n,.
u
u
u
u
Consider a learning algorithm L for a hypothesis space H = H , , graded by example size. An input to L is a training sample, which consists of m examples of size n together with the m single-bit labels. The total size of the input is therefore m(n l), and it
+
44
would be possible to use this single number as the measure of input size. However, there is some advantage in keeping track of m and n separately, and so we shall use the n),to denote the worst-case running time of L on a training sample of notation R L ( ~ m examples of size n.
UH , is said to be a pac learning algorithm if L acts as a pac learning algorithm for each H , . The sample complexity provides the link between of a learning algorithm (that is, the number of operations the running time R~(m,n) required to produce its output on a sample of length rn when the examples have size n) and its running time as a pac learning algorithm (that is, the number of operations required to produce an output which is probably appmximately correct with given parameters). Since a sample of length mo(H,, 6, E) is sufficient for the pac property, the number of operations required is at most &(mo(H,, 6, E), n). A learning algorithm L for
Until now, we have regarded the accuracy parameter E as fixed but arbitrary. It is clear that decreasing this parameter makes the learning task more difficult, and therefore the running time of an efficient pac learning algorithm should be constrained in some appropriate way as e-l increases. We say that a learning algorithm L for H = H,, is eficient with respect to accuracy and example size if its running time is polynomial in m and the sample complexity m L ( H , , 6 , e ) depends polynomially on n and e - l .
u
We are now ready to consider the implications for learning of the theory of NP-hard problems. Let H = U H , be a hypothesis space of functions, graded by the example size n. The consistency problem for H may be stated as follows.
H - CONSISTENCY Instance A training sample s of labelled examples of size n. Question Is there a hypothesis in H , consistent with s? In practice, we wish to produce a consistent hypothesis, rather than simply know whether or not one exists. In other words, we have to solve a ‘search’ problem, rather than an ‘existence’ problem. But these problems are directly related. Suppose that we consider only those s with length bounded by some polynomial in n. Then, if we can find a consistent hypothesis in time polynomial in n, we can answer the existence question by the following procedure. Run the search algorithm for the time (polynomial in n) in which it is guaranteed to find a consistent hypothesis if there is one; then check the output hypothesis explicitly against the examples in s to determine whether or not it is consistent. This checking can be done in time polynomial in n also. Thus if we can show that a restricted form of the existence problem is NP-hard, this means that there is no polynomial time algorithm for the corresponding search problem (unless P = NP). If there is a consistent learning algorithm L for a graded hypothesis space H = U H , such that VCdim(H,) is polynomial in n and the algorithm runs in time polynomial in the sample length m , then the results presented earlier show that L pac learns H , with running time polynomial in 17 and and so is efficient with respect to accuracy and example size. Roughly speaking we may say that an efficient ‘consistent-hypothesis-
45
finder’ is an efficient ‘pac learner’. It is natural to ask to what extent the converse is true. It turns out that efficient pac learning does imply efficient consistent-hypothesis-finding, provided we are prepared to accept a randomised algorithm. A full account of the meaning of this term may be found in the book of Cormen, Leiserson and Rivest (1990), but for our purposes the idea can be explained in a few paragraphs. We suppose that there is available some form of random number generator which, given any integer I 2 2, produces a stream of integers i in the range 1 5 i 5 I , each particulax value being equally likely. This could be done electronically, or by tossing an I-sided die. A randomised algorithm A is allowed to use these random numbers as part of its input. The computation carried out by the algorithm is determined by its input, so that it depends on the particular sequence produced by the random number generator. It follows that we can speak of the probability that A has a given outcome, by which is meant the proportion of sequences which produce that outcome. We say that a randomised algorithm A ‘solves’ a search problem ll if it behaves in the following way. The algorithm always halts and produces an output. If A has failed to find a solution to II then the output is simply no. But, with probability at least 1/2 (in the sense explained above), A succeeds in finding a solution to II and its output is this solution. The practical usefulness of a randomised algorithm stems from the fact that repeating the algorithm several times dramatically increases the likelihood of success. If the algorithm fails at the first attempt, which happens with probability at most 1/2, then we simply try again. The probability that it fails twice in succession is at most 1/4. Similarly, the probability that it fails in k attempts is at most (1/2)k, which approaches zero very rapidly with increasing k. Thus in practice a randomised algorithm is almost as good as an ordinary one - provided of course that it has polynomial running time. We have the following theorem of Pitt and Valiant (1988) (see also Natarajan (1989) and Haussler et al. (1988)).
U
Theorem 14 Let H = H, be a hypothesis space and suppose that there is a pac learning algorithm for H which is efficient with respect to accuracy and example size. Then there is a randomised algorithm which solves the problem of finding a hypothesis in H , consistent with a given training sample of a hypothesis in H,, and which has running time polynomial in n and m (the length of the training sample). Proof Suppose that s’ is a training sample for a target hypothesis t E H,, and that s* contains m* distinct labelled examples. We shall show that it is possible to find a hypothesis consistent with s* by running the given pac learning algorithm L on a related training sample. Define a probability distribution p on the example space X by p ( z ) = l / m * if r occurs in s* and p ( r ) = 0 otherwise. We can use a random number generator with output values i in the range 1 t o m* to select an example from X according to this distribution: simply regard each random number as the label of one of the m* equiprobable examples. Thus the selection of a training sample of length rn for t , according to the probability distribution p, can be simulated by generating a
46
sequence of m random numbers in the required range. Let L be a pac learning algorithm as postulated in the statement of the Theorem. Then, when 6, e, are given, we can find an integer mo(n,6, e) for which the probability (with respect to training samples s E S(m0,t))that the error of L ( s ) is less than E is greater than 1 - 6. Suppose we specify the confidence and accuracy parameters to be 6 = 1/2 and e = l/m*. Then if we run the given algorithm L on a training sample of length mo(n, 1/2,l/m*), drawn randomly according to the distribution p , the pac property of L ensures that the probability that the error of the output is less than l / m * is greater than 1- 1/2 = 1/2. Since there are no examples with probability strictly between 0 and l/m*, this implies that the probability that the output agrees exactly with the training sample is greater than 1/2. The procedure described in the previous paragraph is the basis for a randomised algorithm L* for finding a hypothesis which agrees with the given training sample s * . In summary, L* consists of the following steps. 0 0 0 0 0
Evaluate mo = mo(n,1/2,l/m*). Construct, as described, a sample s of length nzo, according to p. Run the given pac learning algorithm L on s. Check L ( s ) explicitly to determine whether or not it agrees with s*. If L ( s ) does not agree with s * , output no. If it does, output L ( s ) .
As we noted, the pac property of L ensures that L* succeeds with probability greater than f . Finally, it is clear that, since the running time of L is polynomial in m and its sample complexity mo(n,1/2,l/m*) is polynomial in n and m* = 1/e, the running time of L' is polynomial in n and m*. 0
7. HARDNESS RESULTS FOR NEURAL NETWORKS The fact that computational complexity-theoretic hardness results hold for neural networks was first shown by Judd (1988). In this section we shall prove a simple hardness result along the lines of one clue to Blum and Rivest (1988).
+
The machine has n input nodes and k 1 computation nodes ( k 2 1). The first k computation nodes are 'in parallel' and each of them is connected to all the input nodes. The last Computation node is the output node; it is connected by wcs with fixed weight 1 to the other computation nodes, and it has fixed threshold k. The effect of this arrangement is that the output node acts as a multiple AND gate for the outputs of the other computation nodes. We shall refer to this machine (or its hypothesis space) as P,".
A state w of P i is described by the thresholds 81 (1 5 1 5 k) of the first k computation nodes and the weights w(i, 1 ) on the arcs (2, I ) linking the input nodes to the computation nodes. We shall use the notation a(')for the n-vector of weights on the arcs terminating at I , so that a!') = w(z, I ) . The set R of such states provides a representation R -+ P,"
47
in the usual way. We shall prove that the consistency problem for Pk = UP: is NP-hard (provided k 1 3 ) , by reducing the problem to GRAPE k-COLOURING. Let G be a graph with vertex-set V = {1,2,. . . ,n } and edge-set E . We construct a training sample s(G) as follows. For each vertex i E V we take as a negative example the vector D , which has 1 in the ith coordinate position, and 0's elsewhere. For each edge ij E E we take as a positive example the vector v, vj. We also take the zero vector o = 00.. . O to be a positive example.
+
Theorem 15 There is a function in P,"which is consistent with s(G) if and only if the graph G is k-colourable. Proof Suppose h E P," is consistent with the training sample. By the construction of the network, h is a conjunction h = hl A hz A . . . A hk of linear threshold functions. (That is, h ( z ) = 1 if and only if h , ( z ) = 1 for all i between 1 and 6.) Specifically, there , . . . ,a(') and thresholds O 1 , & , . . . ,O t such that are weight-vectors a ( 1 )a('), hl(y) = 1
-
(&), y)
2 el
(1 5 i L k)
Note that, since o is a positive example, we have 0 = (a('),o) 2 81 for each I between 1 and k. For each vertex i, h(w,) = 0, and so there is at least one function h f (1 5 f 5 k) for which h f ( u , ) = 0. Thus we may define x : V + {1,2,. . . , k} by
x(z) = min{f 1 h f ( u l )= 0). It remains to prove that x is a colouring of G. Suppose that h f ( v , )= h f ( u J )= 0. In other words,
x(z) = x(j) = f,so that
(a(f), v,) < Bf, (a(f), vl) < Of.
Then, recalling that Of 5 0, we have u,
+
vl)
< of + ef 5 of.
It follows that h f ( u l + v l ) = 0 and h(w, + vj) = 0. Now if 13 were an edge of G, then we should have h(u, v J ) = 1, because we assumed that h is consistent with the training sample. Thus 23 is not an edge of G, and x is a colouring, as claimed.
+
Conversely, suppose we are given a colouring x : V -+ { 1,2,. . . , k}. For 1 5 I 5 k define the weight-vector a(') as follows: a!') = -1 if x(z) = I and a!') = 1 otherwise. Define the threshold 01 to be -1/2. Let h l , hz, . . . ,h k be the corresponding linear threshold functions, and let h be their conjunction. We claim that h is consistent with s(G). Since 0 2 81 = -1/2 it follows that hl(o) = 1 for each I , and so h(o) = 1. In order to evaluate h(vi), note that if x(z) = f then ( a ( f ) , v i= ) ajf) = -1
< -1/2,
48 so h f ( v i )= 0 and h(v,) = 0, as required. Finally, for any colour I and edge ij we know that at least one of x ( i ) and x(j) is not 1. Hence
where either both of the terms on the right-hand side are 1, or one is 1 and the other is -1. In any case the sum exceeds the threshold -1/2, and h,(vj v j ) = 1. Thus 0 h(v1 “j) = 1.
+
+
The proof that the decision problem for consistency in Pkis NP-hard for k 2 3 follows directly from this result. If we are given an instance G of GRAPH k-COLOURING, we can construct the training sample s(G) in polynomial time. If the consistency problem could be solved by a polynomial time algorithm A , then we could answer GRAPH kCOLOURING polynomial time by the following procedure: given G, construct s(G), and run A on this sample. The above result tells us that the answer given by A is the same as the answer to the original question. But GRAPH k-COLOURING is known to be NP-complete, and hence it follows that the Pk-CONSISTENCY problem is NPhard if k 2 3. (In fact, the same is true if k = 2. This follows from work of Blum and Rivest (1988).) Thus, fixing k, we have a very simple family of feedforward linear threshold networks, each consisting of k 1 computation nodes (one of which is ‘hard-wired’ and acts simply as an AND gate) for which the problem of ‘loading’ a training sample is computationally intractable. Theorem 14 enables us to move from this hardness result for the consistency problem to a hardness result for pac learning. The theorem tells us that if we could pac learn P,“ with running time polynomial in eW1 and n then we could find a consistent hypothesis, using a randomised algorithm with running time polynomial in m and n. In the language of Complexity Theory this would mean that the latter problem is in RP, the class of problems which can be solved in ‘randomised polynomial time’. It is thought that RP does not contain any NP-hard problems - this is the ‘RP # NP’ conjecture, which is considered to be as reasonable as the ‘P # NP’ conjecture. Accepting this, it follows that there is no polynomial time pac learning algorithm for the graded space P‘ = UP; when k 2 2.
+
This may be regarded as a rat,her pessimistic note, but it should be emphasised that the ‘non-learnability’ result discussed above is a worst-case result and indicates that training feedforward linear threshold networks is hard in general. This does not mean that a particular learning problem cannot be solved in practice. 8. EXTENSIONS AND GENERALISATIONS
The basic pac model is useful, but it has clear limitations. A number of extensions to the basic model have been made in the last few years. In this section, we briefly describe some of these. It is not possible to give all the details here; the reader is referred to the references cited for more information.
49
8.1
Stochastic concepts
The results presented so far have nothing to say if there is some form of ‘noise’ present during the learning procedure. Further, the basic model applies only to the learning of functions: each example is either a positive example or a negative example of the given target concept, not both. But one can envisage situations in which the ‘teacher’ has difficulty classifying some examples, so that the labelled examples presented to the ‘learner’ axe not labelled by a function, the same example being on occasion presented by the ‘teacher’ as a positive example and on other occasions (possibly within the same training sample) as a negative example. For example, in the context of machine vision, if the concept is a geometrical figure then points close to the boundary of the figure may be difficult for the teacher to classify, sometimed being classified as positive and sometimes as negative. Alternatively, the problem may not lie with the teacher, but with the ‘concept’ itself. This may be ill-formed and may not be a function at all. To deal with these situations, we have the notion of a stochastic concept, introduced by Blumer et al. (1989). A stochastic concept on X is simply a probability distribution P on X x ( 0 , l ) . Informally, for finite or countable X , one interprets P ( ( z ,b)) to be the probability that z will be given classification b. This can be specialised to give the standard pac model, as follows. Supppose we have a probability distribution p on X and a target concept t ; then (see Anthony and Shawe-Taylor (1990), for example) there is a probability distribution P on X x ( 0 , l ) such that for all measurable subsets S of
x,
p (((2,t ( z ) ) I z E S ) ) = 4s); p (((2, b) I 2 E s,b # t ( 2 ) ) )= 0In this case, we say that P corresponds to t and p. What can be said about ‘learning’ a stochastic concept by means of a hypothesis space H of (0, 1)-valued functions? The error of h E H with respect to the target stochastic concept is the probability
of misclassification by h of a further randomly drawn training example. If P is truly stochastic (and not merely the stochastic representation of a function, as described above) it is unlikely that this error can be made arbitrarily small. As earlier, observed error of h on a training sample s = ( ( 2 1 , b l ) , (22,b z ) , . . . ,(z,, b,)) is defined to be
Clearly this may be non-zero for all h (particularly if the same example occurs twice in the sample, but with different labels). What should ‘learning’ mean in this context? What we should like is that there is some sample size mo, independent of the stochastic concept P , such that if a hypothesis has ‘small’ observed error with respect to a random sample of length at least mo then, with high probability, it has ‘small’ error with respect to P. The following result follows from one of Vapnik (1982) and was first presented in the context of computational learning theory by Blumer et al. (1989). (The result presented here is a slight improvement due to Anthony and Shawe-Taylor (1990).)
50
Theorem 16 Let H be a hypothesis space of (0, l)-valued functions defined on an input space X . Let P be any probability measure on S = X x ( 0 , l ) (that is, P is a stochastic concept on X ) , let 0 < E < 1 and let 0 < y 5 1. Then the Pm-probability that, for s E S“, thereissomehypothesisfrom H such that erp(h) > eander,(h) 5 (1-y)erp(h) is at most 4n,(2m)exp (-:y2ern). firthennore, there is a constant h’ > 0 such that if H has finite VC dimension d, then there is mo = rno(6, e,y) satisfying
such that if m
> mo then, for s E S”, er,(h) 5 (1 - T ) E ==+ erp(h)
<E
with probability at least 1 - 6. The notion of a stochastic concept can be applied to a number of situations. As already indicated, it can be useful when the target ‘concept’ is not a function. It can also be useful when there is ‘classification noise’ (see Angluin and Laird (1988)), that is, where there is an underlying target function, but the randomly chosen examples have their labels ‘flipped’ occasionally. This corrupts the training data and results in a stochastic concept. Additionally, in the standard pac model, we have assumed that the concept space is a subset of the hypothesis space. Suppose this is not so and t E C \ H . Then there can be no h E H such that the error of h with respect to t is 0 for all probability distributions p on X . However, Theorem 16 is applicable. Suppose p is a distribution on X and take F to be the stochastic concept corresponding to t and p . Since the sample size given in Theorem 16 is independent of the stochastic concept P, we obtain a type of learnability result when H has finite VC dimension: there is a sample size mo(6,e) such that if a randomly drawn training sample of t of length rno is presented, then with probability at least 1 - 6, if h E H is correct on a fraction of at least 1 - ~ / 2 of the sample, then h has error at most e . (Here, we have taken y = 1/2 for simplicity.) In fact, somewhat more can be said about ‘learning’ in this case. Suppose we have a concept space C and a hypothesis space H defined on X , not necessarily such that C C H . Suppose that p is a probability distribution on X . For any target t E C, let OPtH(t) = infier,(h,t). Let us say that a learning algorithm L for (C, H ) is a probably approzimately optimal algorithm if for any 0 < 6,e < 1, there is m O ( 6 , E ) such that for any t E C and any p , if a training sample s for t of length at least rno is randomly drawn then with probability at least 1 - 6, er,(L(s)) < OPtH(t) E .
+
51
We say that a hypothesis space H has the un:form convergence of errors ( UCE) property if the following holds. Given real numbers 6 and e (0 < 6,e < l ) , there is a positive integer mo(6,e ) such that, for any probability distribution P on S = X x (0, l},
P"' ( { s E S"'
I
for allh E H , Jerp(h) - er.(h)l
< e}) > 1 - 6.
This means, roughly speaking, that one can guarantee with high confidence that the observed errors of functions in H are within e of their actual errors. Results of Vapnik and Chervonenkis (1971) on the uniform convergence of relative frequencies to probabilities show that if H has finite VC dimension then H has the UCE property. It follows, upon taking P to be the stochastic concept corresponding to t and p, that there is a probably approximately optimal learning algorithm for (C, H ) : the algorithm which returns the hypothesis which minimises the observed error. Of course, this minimisation may be a computationally intra.ctable problem, but the emphasis here is on whether such learning is theoretically possible. 8.2
Variations on t h e pac model
We have observed that pac learning may be computationally intractable, due to the difficulty of the associa.ted consistency problem. Many researchers have looked at a number of ways of varying the model to make learning easier. We shall now briefly discuss two important variations: distribution-dependent models and models which allow queries. Perhaps the main attraction of the definition of pac learning is the 'distribution-free' criterion: the sample size is independent of the probability distribution. The proofs of the standard computational hardness results for pac learning and the lower bounds on sample complexity involve the use of very particular probability distributions, so the theory presented earlier is very dependent on this criterion. If we know in advance what the distribution on the example space is, or if we know that it is one of a particular set of distributions, then the full stength of the pac definition is not needed. Generally, suppose C is a concept space, H is a hypothesis space, and P is a class of probability distributions on X . We may say that a learning algorithm L for (C,H ) is a (C, H , P ) pac learning algorithm if there is a sample size mo(6,e ) such that for any t E C and any p E P , with p"-probability at least 1 - 6 , if a training sample s is presented, erp(L(s)) < e . This is a weakening of the pac criterion, in that mo need be uniform only over P and not over all distributions. The case P = { p } , in which P consists of just one distribution, has been studied by Benedek and Itai (1988,1991), where a characterisation for learnability in terms of e-covers is obtained. In general, finite VC dimension is not necessary for (C, H , P) pac learning. (The theory of previous sections shows that it is necessary when P consists of all distributions.) In addition, learning with respect to a particular distribution may be computationally feasible in situations where standard pac learning is NP-hard. For further discussion of distribution-dependent learning, we refer the reader to the papers of Benedek and Itai (1992), Ben-David, Benedek and Mansour (1989), Bertoni e t al. (1992), Khtiritonov (1993), Li and Vitanyi (1989), Linial, Mansour and Nisan (1989).
52
In the standard pac framework, the learning algorithm receives labelled examples and forms a hypothesis only on the basis of these. The learning algorithm has no control over the sequence of training examples. Clearly, it might be possible to make learning easier if we allow L to ‘ask questions’, such as: ‘is x a positive example?’, for a particular I, chosen by the algorithm. This type of query is known as a membership query and, intuitively, one might expect that it makes it easier for the learning algorithm to converge to the target concept. The idea of learning with this and other types of query in addition to random labelled examples (and sometimes in place of random labelled examples) was mentioned by Valiant (1984a) and has been studied extensively in recent years. We refer the reader to the papers by Angluin (1988), Angluin, Fkazier and Pitt (1990), Baum (1990,1991), Maass and Turan (1990,1992), and the survey by Angluin (1992) and the references therein. 8.3
The graph dimension
As far as applications to artificial neural networks are concerned, the most significant and important extension of the basic pac model is to the learning of general function spaces. The basic model concerns (0, 1)-valued functions only; that is, it is concerned only with classification problems. We have seen how it applies to feedforward linear threshold networks having one output node. But what can be said about learning and generalisation in feedforward linear threshold networks having more than one output node, or in artificial neural networks with sigmoid activation functions and a real-valued output? To deal with these problems and others, the pac model has been extended in a number of ways. Most of the relevant work has been on sample complexity rather than computational complexity, with a few notable exceptions, such as the paper of Kearns and Schapire (1990). Here, we concentrate on the problem of sufficient sample size. In what follows, certain technical measure-theoretic conditions have to be satisfied; we shall not discuss these, but refer the reader to the paper of Haussler (1992) or to the book by Pollard (1984). Suppose that C ,H are sets of functions from an example space X into a set Y , with H ,and suppose that t E C. Suppose also that there is a probability distribution p on X. Generalising in the obvious way from our previous definitions, we may define the emor of h E H with respect to t to be
C
er,(h) = cL({x E
-r I h ( z ) # t ( l ) ) ) .
That is, h is erroneous on example x if h(x) # t ( x ) . When Y = R, for example, this may seem a little coarse; we shall later discuss an alternative approach. With this measure of error, we may define learning as earlier. For h E H, let Oh be the function from X x Y to ( 0 , l ) defined by
Q h ( z ,y) = 1
h(x) = y.
and let OH = {Gh : h E X}. Now, it can be shown that if the hypothesis space B H is pac learnable (in the usual sense), then so too is H, by any consistent learning
53
algorithm; see Natarajan (1989) and Anthony and Shawe-Taylor (1990). Furthermore, the sample complexity of any consistent learning algorithm for H (when regarded as a ‘pac algorithm) can be bounded by an expression involving the VC dimension of BH. (A suitable upper bound is any upper bound on the sample complexity of a consistent learning algorithm for OH, such as the expression of Theorem 8 with d = VCdim(PH).) This quantity VCdim(GH) is known as the graph dimension of H (Natarajan (1989)) and is denoted Gdim(H). Clearly, when Y = (0, l}, the graph dimension and the VC dimension coincide. We remark that the idea of stochastic concept can be extended to ‘stochastic function’. Indeed, the distribution P discussed above is an example of a stochastic function. More generally, the analysis presented earlier for (standard) stochastic concepts extends to stochastic functions; see Anthony and Shawe-Taylor (1990) and Buescher and Kumas (1992), for example, and later in this section, where a more general framework is described. We see from this discussion that if Gdim(H) is finite then H is pac learnable by any consistent algorithm. It is natural to ask whether finite graph dimension is a necessary condition for learnability in this generalised model. However, Natarajan showed this not to be the case: there are pac learnable function spaces with infinite graph dimension (see Natarajan (1991)). Natasajan finds a weaker necessary condition for learnability, showing a certain measure, now known as the Natarajan dimension, must be finite for H to be pac learnable. More recently, Ben-David, Cesa-Bianchi and Long (1992) have shown that when Y is finite, the finiteness of the graph dimension is a necessary and sufficient condition for H to be pac learnable. Furthermore, they show that the Natarajan dimension is finite if and only if the graph dimension is finite, so that Natarajan’s necessary and sufficient conditions are matching. In fact, they obtain a ‘meta-result’, characterising those measures of dimension which themselves characterise learnability in the case of finite Y . The graph dimension has been applied to obtain bounds on the required sample size for learning in artificial neural networks. Natarajan (1989) obtained a result bounding the graph dimension of a linear threshold network (not necessarily a feedforward one) with any number of output nodes and with (0, 1)-valued inputs. For feedforward linear threshold networks with real inputs, Shawe-Taylor and Anthony (1991) (see also Anthony and Shawe-Taylor (1990)) generalised the result of Baum and Haussler (1989) presented in Theorem 12. Specifically, they showed that if a feedforward linear threshold network has any number k of output nodes then the graph dimension of the space of {O,l}’-valued functions it computes is at most 2Wlog(ez), where z, W are, as earlier, the number of computation nodes and the number of variable weights and thresholds. Thus, the same upper bound on sample size as presented in Theorem 13 holds for this more general case.
54 8.4
T h e pseudo-dimension
We have seen that the graph dimension can be used to measure the expressive power of a hypothesis space of functions, in somewhat the same way as the VC dimension is used for (0, 1)-valued hypothesis spaces. But there are other such measures, as discussed by Haussler (1992) and Ben-David, Cesa-Bianchi and Long (1992), for example. In passing, we have already mentioned the Natarajan dimension. We now introduce a very useful dimension, known as the pseudo-dimension. This was introduced by Pollard (1984) and is defined whenever the set of functions maps into Y C_ R. (More generally, it may be defined when Y is any totally ordered set, but this shall not concern us here.) Let H be a set of functions from X to R. For any x = ( q , x 2 , . . . ,z,) E X"', and for h E H, let h(x) = ( h ( X l ) , h ( z z ) , . . . >h(Gn)) and let H ( x ) = { h ( x ) : h E H } . We say that x is pseudo-shattered by H if some translate r H ( x ) of H ( x ) intersects all orthants of R". In other words, x is pseudoshattered by H if there are q ,Q , . .. ,r , € R such that for any b E ( 0 , l}", there b; = 1. The largest d such that some sample of is h b E H with h b ( I i ) 2 ri length d is pseudo-shattered is the pseudo-dimension of H and is denoted by Pdim(H). (When this maximum does not exist, the pseudo-dimension is taken to be infinite.) When Y = ( 0 , l}, the definition of pseudo-dimension reduces to the VC dimension. Furthermore, when H is a vector space of real functions, then the pseudo-dimension of H is precisely the vector-space dimension of H ; see Haussler (1992).
+
8.5
A framework for learning function spaces
When considering a space H of functions from X to Rk,it seems rather coarse to say that a hypothesis h is erroneous with respect to a target t on example I unless h ( z ) and t ( 2 ) are precisely equal. For example, with a neural network having k real-valued outputs, it is extremely demanding that each of the Ic outputs be ezactly equal to that which the target function would compute. Up to now, this is the definition of error we have used. There are other ways of measuring error, if one is prepared to ask not is the output correct? but is the output close? in some sense. Haussler (1992) has developed a 'decision-theoretic' framework encompassing many ways of measuring error by means of 1033 functions. We shall describe this framework in a way which also subsumes the discussion on stochastic concepts. First, we need some definitions. A loss function is, for our purposes, a non-negative bounded function 1 : Y x Y -t [0, M ] (for some M ) . Informally, the loss Z(y, y') is a measure of how 'bad' the output y is, when the desired output is y'. An example of a loss function is the discrete loss function, defined by Z(y, y') = 1 unless y = y', in which case Z(y,y') = 0. Another useful loss function in the L'-loss, which is defined when Y Rk. This is given by
55
In both of these examples, the loss function is actually a metric, but there is no need for this. For example, a loss function which is not a metric and which has been usefully applied by Kearns and Schapire (1990), is the L2-loss or quadratic loss, defined on R k by
There are many other useful loss functions, such as the Lm-loss, the logistic loss and the cross-entropy loss. In order to simplify our discussion here, we shall concentrate largely on the L’-loss, which seems appropriate when considering artificial neural networks. The reader is referred to the influential paper of Haussler (1992) for far more detailed discussion of the general decision-theoretic approach and its applications. As in our discussion of stochastic concepts, we consider probability distributions P on X x Y . Suppose that I : Y x Y + [0, MI is a particular loss function. For h E H,we define the error of h with respect to P (and I ) to be
the expected value of I ( h ( z ) , y ) .When P is the stochastic concept corresponding to a target function t and a probability distribution p on X , then this error is E,,((h(z), t ( z ) ) , the average loss in using h to approximate to t. Note that if 1 is the discrete metric then this is simply the p-probability that h ( z ) # t ( z ) ,which is precisely the measure of error used in the standard pac learning definition. Suppose that a sample s = ((zl,gl), . . . ,(zm,ym))of points from X x Y is given. The observed error (or empirical l o s s ) of h on this sample is
c
l m er*,/(h)= m . I(h(Zj),Y j ) . 1=1
The aim of learning in this context is to find, on the basis of a ‘large enough’ sample S, some L ( s ) E H which has close to optimal error with respect to P ; specifically, if 6, E > 0 and if, as in our discussion of probably approximately optimal learning, opt,(P) = inf{erp,l(h) : h E H}, then we should like to have erp,r(l(s)) < o p t d P )
+ e,
with probability at least 1 - 6. As before, ‘large enough’ means at least mo(6,E), where this is independent of P. As for the standard pac model and the stochastic pac model described earlier, this can be guaranteed provided we have a ‘uniform convergence of errors’ property. Extending the earlier definition, we say that a hypothesis space H of functions from X to Y has the uniform convergence of errors ( U C E ) property if for
56
0 < 6, B < 1, there is a positive integer rno(6, e ) such that, for any probability distribution P on X x Y ,
P" ({s
I
for all h E H , (erp,,(h) - er,,,(h)l
< E } ) > 1 - 6.
If this is the case, then a learning algorithm which outputs a hypothesis minimising the observed error will be a probably approximately optimal learning algorithm; see Haussler (1992) for further discussion. We should note that minimisation of observed error is not necessarily the 'simplest' way in which to produce a near-optimal hypothesis. Buescher (1992) has obtained interesting results along these lines. 8.6
The capacity of a function space
An approach to ensuring that a space of functions has the UCE property, which is described in Haussler (1992) and which follows Dudley (1984), is to use the notion of the capacity of a function space. For simplicity, we shall focus here only on the cases in which Y is a bounded subset of some Rk, Y [O,M]', and we shall use the L'-loss function, which from now on will be denoted simply by 1. Observe that the loss function maps into [O,M] in this case. We first need the notion of an e-cover of a subset of a pseudo-metric space. A pseudo-metric a on a set A is a function from A x A to R such that a ( u , b) = a ( b , a) 2 0, a( u, a) = 0, a(a, b) 5 a( u, c ) a ( c , b)
+
for all a , b, c E A . An e-cover for a subset W of A is a subset S of A such that for every w E W , there is some s E S such that a ( w , s) 5 c. W is said to be totally-bounded if it has an e-cover for all e > 0. When W is totally bounded, we denote by N ( e , W ,a) the size of the smallest e-cover for W . To apply this to learning theory, suppose that H maps X into [0,MIk and that p is a probability distribution on X . Define the pseudo-metric on H by 3,,(f,9 ) = E,, (I(f(21,g(2)). We shall define the €-capacity of H to be
a,,
C H ( ~=) supN(e, Hlap), B
where the supremum is taken over all probability distributions p on X . If there is no finite e-cover for some p, or if the supremum does not exist, we say that the e-capacity is infinite. The definition just given is not quite the same as the definition given by Haussler; here, we take a slightly more direct approach because we are not aiming for the full generality of Haussler's analysis. Results of Haussler (1992)and Pollard (1984) provide the following uniform bound on the rate of convergence of observed errors to actual errors. Theorem 17 With the notation of this section, if P is any probability distribution on S = X x Y , then pm({s
1 there is h E H with lerp,l(h) - er,,l(h)l > E ) ) < 4 c H (c/16)
e-r2m/64M2
for all 0 < E
< 1.
0
When k = 1 and H maps into [O,M],the capacity can be related to the pseudodimension of H . Haussler (1992) (see also Pollard (1984)) showed that if d = Pdim(H) then
This, combined with the above result, shows that, in this case, H has the UCE property and that the sufficient sample size is of order
Thus, if H is a space of real functions and Pdim(H) is finite, then the learning algorithm which outputs the hypothesis with minimum observed error is a probably approximately optimal learning algorithm with sample complexity m o ( 6 , ~ ) . Thus, in a sense, for pac learning hypothesis spaces of real functions, the pseudo-dimension takes on a r d e analogous to that taken by the VC dimension for standard pac learning problems.
8.7
Applications to artificial neural networks
We now illustrate how these results have been applied to certain standard types of artificial neural network. We shall consider here the feedforward ‘sigmoid‘ networks. In his paper, Haussler (1992) shows how the general framework and results can also be applied to radial basis function networks and networks composed of product units. Referring back to our definition of a feedforward network, we assumed at that point that each activation function was a linear threshold function. Suppose instead that each activation function fr is a ‘smooth’ bounded monotone function. In particular, ] that it is differentiable suppose that fr takes values in a bounded interval [ c x , ~ and on R with bounded derivative, If:(z)l 5 B for all I. (We shall call such a function a sigmoid.) The standard example of such a function is
where 6’ is known as the threshold, and is adjustable. This type of sigmoid function, which we shall call a standard sigmoid, takes values in ( 0 , l ) and has derivative bounded by 1/4. By proving some ‘composition’ results on the capacity of function spaces and by making use of the pseudo-dimension and its relationship to capacity for real-valued function spaces, Haussler obtained bounds on the capacity of feedforward artificial neural networks with general sigmoid activation functions. It is not possible to provide all the details here; we refer the reader to his paper. Before stating the next result, we need a further definition. The depth of a particular computation node is the number of arcs in the longest directed path from an input node to the node. The depth of the network is the largest depth of any computation node in the network. We have the following special case of a result of Haussler (1992).
58
Theorem 18 Suppose that ( N , A ) is a feedforwaxd sigmoid network of depth d , with z computation nodes, n input nodes, any number of output nodes, and W adjustable weights and thresholds. Let A be the maximum in-degree of a computation node. Suppose that each activation function maps into the interval [a,PI.Let H be the set of PI" when the variable weights functions computable by the network on inputs from [a, are constrained to be at most V in absolute value. Then for 0 < e 5 /3 - a ,
CH(r)
(243
-
a)d(AvB)d-')'" e
where B is a bound on the absolute values of the derivatives of the activation functions. Further, for fixed V , there is a constant Ii such that for any probability distribution P on X x Rk,the following holds: provided
then with Pm-probability at least 1 - 6, a sample s from ( X x Rk)m satisfies l e r d h ) - erp,r(h)l < e for all h E
H. Moreover, there is K1 such that if m
L
+
(Wlog
(q) (i)), +log
and er.,l(h) = 0 then, with probability at least 1 - 6, erp,l(h) < e .
0
This result shows that the space of functions computed by a certain type of sigmoid network has the UCE property. It provides an upper bound on the order of sample size which should be used in order to be confident that the observed error is close to the actual error. In particular, therefore, it follows that a learning algorithm which minimises observed error is probably approximately optimal, with sample complexity bounded by the bounds in the theorem. The presence of the bound B on the absolute values of the derivatives of the activation functions means that this theorem does not apply to linear threshold networks, where the activation functions are not differentiable. Nonetheless, the sample size bounds are similar to those obtained for linear threshold networks. Furthermore, in the theorem, there is assumed to be some uniform upper bound V on the maximum magnitude of the weights. Recently, Macintyre and Sontag (1993) have shown that if every activation function is the standard sigmoid function and if there is one output node, then such a bound is not necessary. They show that the set of functions computed by a standard
sigmoid network with unrestricted weights (and on unrestricted real inputs) has finite pseudo-dimension.
59
We reniark that the results presented concerning sigmoid neural networks are upperbound results. One cannot easily give lower bounds on the sample size as for standard pac learning. One reason for this is that, although pac learnable function spaces from X to finite Y can be characterised (as in Ben-David, Cesa-Bianchi and Long (1992)), no matching necessary and sufficient conditions are known for the more general problem of pac learning when Y is infinite. In other words, it is an open problem to determine a single parameter which quantifies precisely the learning capabilities of a general function space.
REFERENCES Angluin (1988): D. Angluin, Queries and concept learning. Machine Learning, 2(4): 319-342. Angluin (1992): D. Angluin, Computational learning theory: survey and selected bibliography, in Proceedings of the Twenty-Fourth Annual A C M Symposium on the Theory of Computing. Angluin, Frazier and Pitt (1990): D. Angluin, M. Frazier and L. Pitt, Learning conjunctions of Horn clauses. In Proceedings of the Thirtieth-First IEEE Symposium on Foundations of Ccynpuler Science, IEEE Computer Society Press, Washington DC. (See also: Machine Learning, 9 (2-3), 1992: 174-164.) Angluin and Laird (1988): D. Angluin and P. Laird, Learning from noisy examples, Machine Learning, 2: 343-370. Anthony and Biggs (1992): M. Anthony and N. Biggs, Computational Learning Theory: an Introduction, Cambridge University Press. Anthony, Biggs and Shawe-Taylor (1990): M. Anthony, N. Biggs and J. Shawe-Taylor, The learnability of formal concepts. In Proceedings of the Third Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Anthony and Shawe-Taylor (1990): M. Anthony and J. Shawe-Taylor, A result of Vapnik with applications, Technical report CSD-TR-628, Royal Holloway and Bedford New College, University of London. To appear, Discrete Applied Mathematics. Bartlett (1992): P.L. Bartlett, Lower bounds on the Vapnik-Chervonenkis Dimension of multi-layer threshold networks. Technical report IML92/3, Intelligent Machines Laboratory, Department of Electrical Engineering and Computer Engineering, University of Queensland, Qld 4072, Australia, September 1992. Baum (1990): E.B. Baum, Polynomial time algorithms for learning neural nets. In Proceedings of the Third Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Baum (1991): E.B. Baum, Neural net algorithms that learn in polynomial time from examples and queries, IEEE Transactions on Neural Networks, 2: 5-19. Baum and Haussler (1989): E.B. Baum and D. Haussler, What size net gives valid generalization? Neural computation, 1: 151-160.
60
Ben-David, Benedek and Mansour (1989): S. Ben-David, G. Benedek and Y. Mansour, A parameterization scheme for classifying models of learnability. In Proceedings of the Second Workshop on Computational Learning Theory. Morgan Kaufmann, s a n Mateo, CA. Ben-David, Cesa-Bianchi and Long (1992): S. Ben-David, N. Cesa-Bianchi and P. Long, Characterizations of learnability for classes of ( 0 , . . . ,n}-valued functions. In Proceedings of the Fifth A n n u l ACM Workshop on Computational Learning Theory, ACM Press, New York. Benedek and Itai (1988): G. Benedek and A. Itai, Learnability by fixed distributions, In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Benedek and Itai (1991): G . Benedek and A. Itai, Learnability with respect to fixed distributions, Theoretical Computer Science 86 (2): 377-389. Benedek and Itai (1992): G. Benedek and A. Itai, Dominating distributions and learnability, In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York. Bertoni et al. (1992): A. Bertoni, P. Campadelli, A. Morpurgo, S. Panizza, Polynomial uniform convergence and polynomial sample learnability. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York. Billingsley (1986): P. Billingsley, Probability and Measure, Wiley, New York. Blum and Rivest (1988): A. Blum and R.L. Rivest, Training a 3-node neural network is NP-complete. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. (See also: Neural Networks, 5 ( l ) , 1992: 117-127.) Blumer e t al. (1989): A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth, Learnability and the Vapnik-Chervonenkis Dimension. Journal of the A CM, 36(4): 929-965. Buescher (1992): K.L. Buescher, Learning and smooth simultaneous estimation of errors based on empirical data (PhD thesis), Report UILU-ENG-92-2246, DC-144, Coordinated Science Laboratory, University of Ilinois at Urbana-Champaign. Buescher and Kumar (1992): K.L. Buescher and P.R Kumar, Learning stochastic functions by smooth simultaneous estimation, In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York. Cormen, Leiserson and Rivest (1990): T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms. MIT Press, Cambridge, MA. Dudley (1984): R.M. Dudley, A course on empirical processes. Lecture Notes in Mathematics, 1097: 2-142. Springer Verlag, New York. Ehrenfeucht et al. (1989): A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant, A general lower bound on the number of examples needed for learning. Information and Computation, 82 ( 3 ) : 247-261.
61
Garey and Johnson (1979): M. Garey and D. Johnson, Computers and Intractibility: A Guide to the Theory of NP-Completeness. Freeman, San Francisco. Grunbaum (1967): B. Grunbaum, Convez Polytopes. John Wiley, London. Haussler (1992): D. Haussler, Decision theoretic generalizations of the pac model for neural net and other learning applications, Information and Computation, 100: 78-150. Haussler et al. (1988): D. Haussler, M. Kearns, N. Littlestone and M. Warmuth, Equivalence of models for polynomial learnability. In Proceedings of the 1988 Workshop o n Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. (Also, Information and Computation, 95 (2), 1991: 129-161.) Haussler and Welzl (1987): D. Haussler and E. Welzl, Epsilon-nets and simplex range queries. Discrete & Computational Geometry, 2: 127-151. Judd (1988): J.S. Judd, Learning in neural networks. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Kearns and Schapire (1990): M. Kearns and R. Schapire, Efficient distribution-free learning of probabilistic concepts. In Proceedings of the Thirty-First IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Kharitonov (1993): M. Kharitonov, Cryptographic hardness of distribution specific learning. To appear, Proceedings of the Twenty-Fifih Annual ACM Symposium on the Theory of Computing, 1999. Li and Vitanyi (1989): M. Li and P. Vitanyi, A theory of learning simple concepts under simple distributions and average case complexity for the universal distribution. In Proceedings of the Thirtieth IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Linial, Mansour and Nisan (1989): N. Linial, Y. Mansour and N. Nisan, Constant depth circuits, Fourier transforms, and learnability. In Proceedings of the Thirtieth IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Maass (1992): W. Maass. Bounds for the computational power and learning complexity of Analog Neural Nets. Manuscript, Institute for Theoretical Computer Science, Technische Universitaet Graz, Austria, 1992. To appear, Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing, 1999. Maass and Turan (1990): W. Maass and G. Turan, On the complexity of learning from counterexamples and membership queries. In Proceedings of the Thirty-First IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Maass and Turan (1992): W. Maass and G. Turan, Lower bound methods and separation results for on-line learning models, Machine Learning 9 (2-3): 107-145. Macintyre and Sontag (1993): A. Macintyre and E.D. Sontag, Finiteness results for
62
sigmoidd “neural” networks (extended abstract). To appear, Proceedings of the TwentyFifth Annual A C M Symposium on the Theory of Computing. Minsky and Papert (1969): M. Minsky and S. Papert, Perceptrons. MIT Press, Cambridge, MA. (Expanded edition 1988.) Natarajan (1989): B.K. Natarajan, On learning sets and functions. Machine Learning, 4: 67-97. Natarajan (1991): B.K. Natarajan, Machine Learning: A Theoretical Approach, Morgan Kaufmann. Pitt and Valiant (1988): L. Pitt and L.G. Valiant, Computational limitations on learning from examples. Journal of the A CM, 35 (4): 965-984. Pollard (1984): D. Pollard, Convergence of Stochastic Processes, Springer-Verlag, New York. Rosenblatt (1959): F. Rosenblatt, Two theorems of statistical separability in the perceptron. In Mechanisation of Thought Processes: Proceedings of a Symposium Held at the National Physical Laboratory, November 1958. Vol. 1. HM Stationery Office, London. Sauer (1972): N. Sauer, On the density of families of sets, Journal of Combinatorial Theory ( A ) , 13: 145-147. ShaweTaylor and Anthony (1991): J. Shawe-Taylor and M. Anthony, Sample sizes for multiple output threshold networks, Network 2: 107-117. Shawe-Taylor, Anthony and Biggs (1993): J. Shawe-Taylor, M. Anthony and N. Biggs, Bounding sample size with the Vapnik-Chervonenkis dimension. To appear, Discrete Applied Mathematics, Vol. 41. Valiant (1984a): L.G. Valiant, A theory of the learnable. Communications of the A C M , 27 (11): 1134-1142. Valiant (1984b): L.G. Valiant, Deductive learning. Philosophical Transactions of the Royal Society of London A, 312: 441-446. Vapnik (1982): V.N. Vapnik, Estimation of Dependences Based on Empirical Data. Springer Verlag, New York. Vapnik and Chervonenkis (1971): V.N. Vapnik and A.Ya. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16 (2), 264-280. Vapnik and Chervonenkis (1981): V.N. Vapnik and A.Ya. Chervonenkis, Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and its Applications, 26 (3), 532-553. Wilf (1986): H.S. Wilf, Algorithms and Complexity. Prentice-Hall, New Jersey.
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) Q 1993 Elsevier Science Publishers B.V. All rights reserved
63
Time-summating network approach Paul C. Bressloff GEC-Marconi Ltd., Hirst Research Centre, East Lane, Wembley, Middx. HA9 7PP, United Kingdom.
Abstract A review of the dynamical and computational properties of timesummating neural networks is presented. 1
INTRODUCTION
The formal neuron used in most artificial neural networks is based on a very simple model of a real neuron due to McCulloch and Pitts [l]. In this model, the output of the neuron is binary-valued, indicating whether or not its activation state exceeds some threshold, and the activation state at any discrete time is equal to the linear sum of inputs to the neuron at the previous time-step, (We shall refer to such a neuron as a standard binary neuron). The simplicity of networks of these neurons has allowed many analytical and numerical results to be obtained. In particular, statisticalmechanical techniques, which exploit the analogy between standard binary networks and magnetic spin systems, have been applied extensively to the study of the collective behaviour of large networks [2,3]. Moreover, Gardner [4] has developed statistical-mechanical methods t o analyse the space of connection weights between neurons and thus determine quantities such as the optimal capacity for the classification and storage of random, static patterns. However, networks of standard binary neurons tend t o be rather limited in terms of (i) the efficiency with which they can process temporal sequences of patterns, and (ii) the range of dynamical behaviour that they can exhibit. These limitations are related to the fact that there is no memory of inputs to the neuron beyond a single time-step. A simple way to incorporate such a memory is to take the activation state of each neuron to be a slowly decaying function of time - a timesummating binary neuron. In this paper, we consider the consequences of this simple modification of the McCulloch-Pitts model for the deterministic dynamics (section 2). stochastic dynamics (section 3) and temporal sequence
64
processing abilities (section 4) of neural networks. One of the interesting features of time-summating neurons is that they incorporate, albeit in simplified form, an important temporal aspect of the process by which real neurons integrate their inputs. In section 5. we describe a n extension of the time-summating model that takes into account spatial aspects of this process such as the geometry of the dendritic tree and soma. 2
DETERMINISTIC DYNAMICS
Consider a fully-connected network of N standard binary-threshold neurons [I,51 and denote the output of neuron i, i = 1,...,N, at the mth time step by ai(m) E {0,1].The binary-valued output indicates whether or not the neuron has fired at time m. The neurons are connected by weights wij t h a t determine the size of an input to neuron i arising from the firing of neuron j. In this simple model, the activation state of neuron i at time m is equal to the linear sum of all the inputs received at the previous time-step,
j+i
where Ii denotes some fixed external input. Each neuron fires whenever its activation state exceeds a threshold hi, ai(m) = e(Vi(m)- hi)
(2.2)
4
where e(x) = 1ifx 20 and e(x) = 0 if x < 0. Note that the external inputs may be absorbed into the thresholds hi. Equations (2.1) and (2.2) determine the dynamics on the discrete space of binary outputs (0,1IN. (Unless otherwise stated, we shall assume throughout that the neurons update their states in parallel). The number of possible states of the network is finite, equal t o gN. Therefore, i n the absence of noise, there is a unique transition from one state to the next and the long-term behaviour is cyclic. This follows from the fact that a finitestate system must return to a state previously visited after a finite number of time-steps ( S 2N). Hence, the dynamics is restricted to attracting cycles consisting of simple sequences of states, i.e., a given state only occurs once per cycle. Complex sequences, on the other hand, contain repeated states so that there is an ambiguity a s to which is the successor of each of these states (see Figure 1);such ambiguities cannot be resolved by a standard binary network. From a computational viewpoint, if each attracting cycle is interpreted as a stored temporal sequence of patterns, then there are severe
65
limitations on the range of sequences that can be stored. B
B
A
D ABCD
D
ABCAD
Figure 1. Example illustrating the difference between a simple sequence
ABCD ... and a complex sequence ABCAD... In the latter case there is an ambiguity concerning the successor of state A. To allow the occurrence of complex sequences, it is necessary to introduce some memory of previous inputs that extends beyond a single time-step. The simplest way to achieve this is to modify the network at the single neuron level by taking the activation state to be a slowly decaying function of time with decay rate ki < 1,say. Equation (2.1)then becomes
(2.3)
We shall refer to a formal neuron satisfying equations (2.2)and (2.3)as a time-summating, binary neuron. The decay term kiVi(m-1) may be viewed as a positive feedback along a delay line of weight ki (see Figure 2); this should be distinguished from models in which the output of the neuron is fed back rather than the value of the activation state. Note that the decay term incorporates a n important temporal feature of biological neurons [6]there is a persistence of cell activity over extended periods due to the leakyintegrator characteristics of the cell surface. If we interpret the activation state of our formal neuron a s a mean soma potential then, crudely speaking, we may relate the decay rate ki to the electrical properties of the and Ci are the leakage capacitance cell surface, ki = exp(-l/RiCi), where and resistance respectively. (See also section 5).Recent neurophysiological evidence suggests that the time constant of certain cortical neurons is of the
66
order of hundreds of milliseconds [7].Since a single time-step corresponds to a few milliseconds, i.e. a refractory period, the decay rate ki could be close to unity.
delay
aj(m - 1)
ij
summation
threshold
Figure 2. A time-summating, binary-threshold neuron. It follows from equation (2.3)that the activation state depends on the previous history of inputs. Assuming that Vi(0) = 0, we have
Such a n activity trace allows a network of time-summating neurons to resolve the ambiguities arising from complex sequences, provided that incoming activity is held over a long enough period [8,9]. Moreover, a timesummating network can be trained t o store such sequences using perceptron-like learning algorithms that are guaranteed t o converge to a solution set of weights if one exists [S]. As in the case of standard binary networks [41, statistical-mechanical techniques may be used to analyse the performance of a time-summating network in the thermodynamic limit 18111. One of the features that emerges from such a n analysis is the nontrivial contribution from intrinsic temporal correlations that are set up between the activation states of the neurons due to the presence of activity traces. (We shall consider temporal sequence processing in section 4). Another important difference between time-summating and standard networks is that the former can display complex dynamics, including frequency-locking and chaos, a t both the single neuron and network levels [12-161. There is a great deal of current interest in the behaviour of networks of oscillatory and chaotic elements. For example, recent neurophysiological experiments [171, [18] suggest t h a t the phasesynchronisation and desynchronisation of neuronal firing patterns could be
67
used to determine whether or not activated features have been stimulated by a single object. This process would avoid the combinatorial explosion associated with the use of “grandmother cells” (the binding problem [191). Time-summating networks provide a discrete-time framework for studying such phenomena. k
I
summation
threshold n U
-W
delay
Figure 3.A time-summating neuron with inhibitory feedback. To illustrate the above, consider a single time-summating neuron with fixed external input I and inhibitory feedback whose activation state evolves according to the Nagumo-Sato equation [201 V(m) = F(V(m - 1))= [kV(m - 1)- wa(m - 1) + I]
(2.5)
where a(m) = B(V(m)-h). The operation of the neuron is shown in Figure 3. We shall assume t h a t the feedback is mediated by a n inhibitory interneuron that fires whenever the excitatory neuron fires. (A more detailed model that takes into account the dynamics of the interneuron essentially displays the same behaviour. Note that the coupling of a n excitatory neuron with an inhibitory neuron using delay connections forms the basic oscillatory element of a continuous time model used to study stimulus-induced phase synchronisation in oscillator networks [21,221). The map F of equation (2.5) is piecewise linear with a single discontinuity a t V = 0, a s shown in Figure 4. Assuming that w > 0 (inhibitory feedback) and 0 c I c w, then all trajectories converge to the interval Z = W-,V+] where V- = I - w and V+ = I. (For values of I outside [O,wl, the dynamics is trivial).The dynamics on Z has been analysed in detail elsewhere [13,23,241.In particular, the map F is equivalent to a circle map with a discontinuity a t V = V+. Such a circle map is obtained by imposing the equivalence relation on C given by V(m>eV(m) + V, - V- E
68
51. The activation state may then be viewed as a phase variable. To describe the behaviour on Z it is useful t o introduce the average firing-rate (2.6) (assuming that the limit exists), where B(FW))is the output of the neuron
at time n given the initial state V. In terms of the equivalent circle map description, p(V) is a rotation number. Y
R
I-------
/
-----
I I
/
I
I
I
/
/
I
/Jf
I
Figure 4. Map F describing the dynamics of a time-summating neuron with inhibitory feedback; for w > 0 and 0 < I < w all trajectories converge t o the bounded interval Z = CV-, V+1. It can be shown that the average firing-rate is independent of the initial point V, pW) = p. and that the dynamics is either periodic o r quasiperiodic depending on whether i?i is a rational o r irrational number. Moreover, as a function of the external input I, i5 forms a “devil’s staircase” [23]. That is, p is a continuous, monotonic function of I which assumes rational values on non-empty intervals of I and is irrational on a Cantor set of I. If j5 is ra€ional, j5 = plq, then there is a periodic orbit of period q which is globally attracting. On the other hand, when 5.j is irrational there are no periodic points and the attractor is a Cantor set [24]. Note that in the limit k + 0 (standard binary neuron), the devil’s staircase structure disappears,
69
and the neuron simply becomes a bistable element alternating between its on and off states, i.e. = 1/2 independently of I.
Figure 5. The map Fp for p = 25.0 with two critical points at V = f V* and an unstable fixed point at V = Vo. Another interesting feature of the above model is that, with slight modifications, chaotic dynamics can occur leading, amongst other things, to a break up of the devil’s staircase structure [12,13].For example, suppose that we replace the step function in equation (2.5)by a sigmoid of gain y so that [12] V(m) = Fy(V(m- 1)) = kV(m - 1)- 1 + e-uV(m-l) +I
(2.7)
We shall briefly discuss the dynamics of Fy as a function of the external I < w. Then Fy has two critical input I. Assume that K s wy/2k -1 pointa at fV*,where fl*= 10d~ f as shown in Figure 5.There is also a fixed point, denoted V = Vo, which lies in the interval [-V*,V*]. For y >> 1(high gain) there exists a range of values of I for which the fixed point is unstable and all trajectories converge to the interval R = Fy(v*), on which the dynamics is either periodic o r chaotic. The chaotic dynamics arises from the fact that for y >> 1 the negative gradient
70
branch of the graph of Fr has an average slope of modulus greater than unity, which can lead to a positive Liapunov exponent h(V(0))where 1251
(2.8)
We note that the circle map equivalent to Fr is nonmonotonic, which is a well known scenario for chaotic dynamics as exemplified by the sine circle map x + F(x) = x + a + ksin(2xxY2x (mod 1) [261. Recently, a network of coupled circle maps has been used to model the synchronisation and desynchronisation of neuronal activity patterns [IS]. The basic idea is to associate with each neuron a phase variable 8; and a n activity si = 0,l. In terms of the time-summating model, si = 1 indicates that the value of the external input t o the ith neuron lies within the parameter regime for which the neuron operates as a n oscillator (active mode), with Vi interpreted as the phase variable Bi. (If I < 0 in equation (2.5) or (2.7)then Vi converges t o a stable fixed point corresponding to a passive mode of the neuron and s i = 0). The dynamics of the phases for all active neurons is taken to be [161 1
Bi (m + 1) = -[F(ei 1+E
(m))+ &F(cpj (m))]
(2.9)
where F is a circle map such as the sine map or the one equivalent to F of equation (2.71, and
(2.10)
Suppose that all neurons are coupled such that wij = w for all i, j, i f j. F o r large N, the stability of the strongly correlated state (ei(m) = 8$m) for all i j ) is determined completely by the properties of the underlying onedimensional map F [16]. To show this, define 60i(m) = 8i(m) - cp(m), where cp(m) is the average phase of the network, N-12;;8i(m). Linear stability analysis then gives (2.11) Using the definition of the Liapunov exponent h (cf. equation (2.8)), it
71
follows that the coherent state is stable provided E > eh - 1. In Ref. [lS], a Hebb-like learning rule is used to organise a network of neurons into strongly coupled groups in which neurons within a group have completely synchronised (chaotic) time series along the lines of the above coherent state, whereas different groups are uncorrelated; each group corresponds to a separate object. The learning rule takes the explicit form
where y determines the learning-rate, h is a "forgetting" term and si = 1if a neuron is activated by an input pattern (object). "he function Q restricts the weights within the interval [a,bl; O(x) = x for x E [a,bl and 0 otherwise. ARer training, if a number of patterns are presented simultaneously to the network, then each of these patterns may be distinguished by the synchronisation of the corresponding subpopulation of active neurons. In this approach, chaos performs two functions. First, it allows separate groups to become rapidly decorrelated in the presence of arbitrarily small differences in initial conditions. Second, it enables a large number of different groups t o be independently synchronised. 3
STOCHASTIC DYNAMICS
It is well known that the most significant source of intrinsic noise in biological neurons arises from random fluctuations in the number of packets of chemical neurotransmitters released a t a synapse on arrival of an action potential [27]. Such noise can be incorporated into a n artificial neural network by taking the connection weights to be discrete random variables independently updated a t every time-step according to fixed probability distributions [28,29,30]. That is, each connection weight has the form w(m) = Eu(m), where I E I is related to post-synaptic efficacy, (the efficiency with which transmitters are absorbed on t o the post-synaptic membrane), sign(&)determines whether the synapse is excitatory o r inhibitory, and u(m) corresponds to the number of packets released a t time m. Following Ref. [31], we shall take the release of chemicals to be governed by a Binomial process. Before discussing the stochastic dynamics of timesummating networks with synaptic noise, it is useful to consider the more familiar case of standard binary networks. 3.1 Standard binary networks. Incorporating synaptic noise into a standard binary neural network leads to the stochastic equations
12
where uij(m) = 0 if aj(m) = 0, whereas uij(m) is generated by a Binomial distribution when aj(m) = 1. Thus, for a given state a(m) = a , the conditional probability that uij(m) = q j is given by
where are constants satisfying 0 I h j I 1and L is the maximum number of packets that can be released a t any one time, (assumed t o be synapseindependent). Note that a random fluctuation qi(m) of the threshold hi has also been included in equation (3.1). We shall take the probability distribution function of qi(m) to be a sigmoid,
where j3-l is a “temperature”parameter. Let p(i I a) be the conditional probability that neuron i fires given that the state of the network a t the previous time-step is a. We may obtain p(ila) by averaging the right-hand side of equation (3.1) over the distributions (3.2)and (3.3).This leads to the result,
(3.4)
since w(V) = 1 - qr(-V) when qt is a sigmoid function. (In the absence of synaptic noise, the conditional probability reduces directly t o that of the Little model [32]). Introducing the probability Pm(a)that the state of the network a t time m is a , we may describe the dynamical evolution of the network in terms of the homogeneous Markov chain
(3.5)
where Qba is the time-independent transition probability of going from
state a to state b in one time-step, and satisfies
N Qba = ~ { b i p ( i l a ) + [ l - b i l [ l - p ( i l a ~ l }
(3.61
i=l
Since the Markov chain generated by equations (3.4)and (3.6)is irreducible when P-l > 0 and hj > 0, (there is a nonzero probability that every state may be reached from every other state in a finite number of time-steps), and assuming that N is finite, we may apply the Perron-Frobenius theorem [32]: If Q is the transition matrix of a finite irreducible Markov chain with period d then (i) the d complex roots of unity, XI, ha = w,...,hd = where w = eanUd, are eigenvalues of Q, and (ii)the remaining eigenvalues hd+l,...,hN satisfy I hj I < 1;(a Markov chain is said t o have period d if, for each state a, the probability of returning to a after m time-steps is zero unless m is an integer multiple of d). For non-zero temperatures, the Markov chain is aperiodic (d = 1) so that there is a nondegenerate eigenvalue of Q satisfying 1.1 = 1,whilst all others lie inside the unit circle. By expanding the solution of equation (3.5) in terms of the generalised eigenvectors of Q, it follows that there is a unique limiting distribution P,(a) such that
(3.7) independently of the initial distribution, where P, is the unique eigenvector of Q with eigenvalue unity. Equation (3.7)implies that timeaverages are independent of initial conditions and may be replaced by ensemble averages over the limiting distribution P,. That is, for any wellbehaved state variable X,
(3.8) Note that in practice time-averages are defined over a finite-time interval T = Tabs. These averages may be replaced by ensemble averages provided ~ the ,maximum relaxation time characterising the rate of that zobs >> T fluctuations of the system. Although techniques have been developed to analyse P, [34], the explicit form for P, tends to be rather complicated except for the special cases in which detailed balance holds. For then there exists some function f
14
such that &baff a) = Qabf(b) and equation (3.5) has the stationary solution P*(a) = f(a&f(a). Since, by the Perron-Frobenius theorem, the limiting distribution is unique, and hence equal t o P*, we obtain the Gibbs distribution
(3.9) a
where H(a) = -P-llogfIa) is an effective Hamiltonian. An example of a network for which detailed balance holds is the Little [321 model with symmetric weights, wij = wj;. In this particular case, fla) = coshp(Cjwij9 hi)[351. One of the consequences of equation (3.7)is that, as i t stands, the network cannot display any long-range order in time since any injection of new information produces fluctuations about the limiting distribution that are then dissipated. Therefore, to operate the network as an associative memory, it is necessary to use one of the following schemes; (a) Introduce an external input I = (11,...,IN) and take the network t o be a continuous mapper [36] in which the limiting distribution P, i s considered as a fimction of I. (See also section 3.2) (b) Take the zero noise limit P - l + 0, Xij + 1, so that equation (3.1) reduces to equations (2.1) and (2.2), with wij = eijL, and the many attracting cycles of the deterministic system emerge. For small but nonzero noise, these cycles will persist for extended lengths of time with the noise inducing transitions between cycles. (c) Take the thermodynamic limit N + leading t o a breaking of the ergodicity condition (3.7). This forms the basis of statistical-mechanical approaches to neural networks [2,31. In contrast to (b), which views noise in terms of its effects on the underlying deterministic system, the statisticalmechanical approach is concerned with emergent properties arising from the collective behaviour of large systems with noise. Such behaviour may be analysed using mean field theory. To discuss the large-N limit, we shall follow the statistical dynamical approach of Ref. [30]. First, on setting u$m) = ilij(m)aj(m),where fii,(m) is generated by a Binomial distribution B(L, Gj), we may rewrite equation (3.1) as
-
with w(m) denoting the set of random parameters (hij(m),h;(m)) and fJa) =
€KZj iiijeija. +qi -hi). We then introduce the notion of a macroscopic variable along the knes of Amari et all [37]: A finite collection of state variables is said to be a closed set of macroscopic variables if there exists a set of functions Qr, r = 1,...,R such that for arbitrary a,
lim var,CX,(f,(a))l
N+m
=0
(3.1lb)
where <...>p and var denote respectively the mean and variance with P respect to the distribution p of o.Equation (3.11)implies that (3.12) Equations (3.11) and (3.12) also hold if a is replaced by the dynamical variable a(m) satisfying (3.10), since the random parameters o(m) are updated independently a t each time-step. Hence, equations (3.10) (3.12) lead to the dynamical mean field equations
-
where q ( m ) = q(a(m)). Equation (3.13) determines the long-term behaviour of the network in the limit N + -. Suppose, for simplicity, that the set (Xr, r = 1,...,R) completely characterises the macroscopic dynamics of the system. Moreover, assume that there exists a number of stationary solutions to (3.13)that are stable fixed points, denoted da). Each such solution satisfies Xy’ = Qr(X(a)) and the eigenvalues hr of the Jacobian Ars = &Dr(X(a))/aXssatisfy the stability criterion I hr I c 1. Assuming that X(0) E A,, where A, is the basin of attraction for X(a),the time-average of X(m) is given by
(3.14)
Broken ergodicity is reflected by the existence of more than one fixed point, since there is then a dependence on initial conditions. Note that broken ergodicity can only occur, strictly speaking, in infinite systems; in a finite system the entire state space is accessible. Hence the limit M + must be taken after the limit N + in equations (3.14).
-
-
76
A simple example of the above is provided by a fully-connected LittleHopfield model [MI, 1381 with threshold noise. Introducing, for convenience, “spin”variables Si = 2% - 1, the network evolves according to the equations (3.15)
with qi generated according t o equation (3.3) and wij is of the “Hebbian” form
(3.16)
sp, sy
for R random, unbiased patterns i.e. = fl with equal probability. For finiteR, a finite set of macroscopic variables satisfying equation (3.11) may be defined in terms of the overlaps
(3.17)
The corresponding dynamical mean field equations are
(3.18) where <<...>> = np(&€,p- 1x2 + + 1)/2),i.e. for large N we may assume that strong self-averaging over the random patterns sp holds. By studying the stability of the fixed points of (3.181, the pattern storage properties of the network may be determined. (The results are identical to those obtained using equilibrium statistical mechanics [39], i.e. the minima of the free energy derived in [39] correspond exactly t o the stable fixed points of equation (3.18) and the Hessian of the free energy - A ,where Apv is the Jacobian of Qp). For example, consider equals solutions in w&h there is only a non-zero overlap with a single memory, Xp = X$1. This is a solution t o (3.18)provided that X = tanhpX, and X f 0 only if T P - l < 1. There are 2R degenerate solutions corresponding to the
$”
R memories EP , and their opposites -E,F. The Jacobian is given by Aclv = tiPvp(l -X2)such that the solutions are always stable for T < l.(See Ref. [391 for a more detailed analysis based on the statistical-mechanical approach). Unfortunately, the statistical dynamics of a fully-connected HopfieldLittle model with parallel dynamics becomes much more complicated when the number of patterns to be stored becomes infinite in the large-N limit, i.e. R = aN.For then one finds that long-time correlations build up leading to a rapidly increasing number of order parameters or macroscopic variables. "his renders exact treatments ineffective after a few time-steps [40]. (Alternatively, one can consider sparsely-connected networks [41,301 in which the number of parameters becomes tractable]). Also note that for more general choices of the weights wij, it is possible that the resulting dynamical mean field equations exhibit periodic and chaotic behaviour [42].
3.2 Time-summatingbinary networks. Introducing synaptic and threshold noise along the lines of section 3.1, the stochastic dynamics of a time-summating network is given by [15]
where a(m) denotes the set of integers uij(m), i, j = 1,...,N, i # j , corresponding to the number of packets released into synapse (ij) a t the mth time-step, and
(3.20)
For a given state V(m) = V,the probability that a(m) = a is
where ly is the sigmoid function of equation (3.3)and hj = 0 for convenience. Let Cl denote the index set {Q,...,L)x, where x is the number of connections in the network. The set F = ((Fa,@a) I Q E Q) defines a random Iterated Function System (IFS)[43]on the space of activation states M c xN.That is, F consists of a finite, indexed set of continuous mappings on a metric space together with a corresponding set of probabilities for choosing one such map per iteration. (It is s f i c i e n t to endow M with the Euclidean metric. Note, however, that the dynamics is independent of the particular metric chosen: the introduction of a metric structure allows certain
78
mathematical results to be proven, and is useful for characterising the geometrical aspects of a system's attractor). The dynamics described by equation (3.19)corresponds to a n orbit of the IFS F. I n other words, a particular trajectory of the dynamics is specified by a particular sequence of events (a(m), m = 0, 1,...I a(m) E a} together with the initial point V(0). An important feature of 3 is that it is a n hyperbolic IFS (using Barnsley's terminology [43]);the affine maps F, of equation (3.20) are contraction mappings on M, i.e. the contraction ratio & of Fa, defined by
(3.22) satisfies 1 , < 1for all a E a. This result holds since the decay factors in equation (3.20)satisfy ki < 1 and h , = k 3 maxi (ki). By the contraction mapping theorem 1431,there exists a unique fixed point p of Fa such that lim, -+ oo (F,Im(v) = p for all V E M. This may be seen immediately using equation (3.201,with Vai = (Ii + Cj i uijqj)/( 1 - yi). The fact that 9 is hyperbolic allows us to apply a number of known results concerning the limiting behaviour of random IFS's [43-451. To proceed, it is convenient to consider the evolution of probability distributions on Mthat is generated by performing a large number of trials and following the resulting ensemble of trajectories. The stochastic dynamics of this ensemble is then described in terms of the sequence of probability measures (b, m = O,l,...) on M, where
(3.23)
is the probability of a trajectory passing through the (Borel) subset A of %f a t time m with P(W = 1. (We cannot assume that the measures pm are Lebesgue and introduce smooth probability densities on M accordingly such that d b ( V ) = p(V)dV. For, as will be made clear below, there is the possibility of fractal-like structures emerging). The sequence of measures {pm) describes a linear Markov process. Introduce the time-independent transition probability %fBI V)that, given V(m) = V at time m, V(m + 1)belongs to the subset B. This is equal t o the probability of choosing a map F E {Fa, a E Q) such that F O E B. Thus
(3.24)
19
where XB is the indicator function defined by XB(V)= 1 if V E B and 0 otherwise. Given a n initial probability measure 10, 4 generates the sequence of measures (pm) according to
(3.25)
Such a sequence then determines the evolution of the output states of the network by projection. That is, the probability Pm(a)that the network has the output configuration a a t time m is given by
However, the sequence (Pm) induced by (pm) does not generally evolve according t o a Markov chain, which reflects the fact that the activation states are functions of all previous output states, see equation (2.4).An exception occurs in the limit k i --f 0, when the projection of equation (3.25) reduces to the Markov chain (3.5). Using the results of Refs. [43,441,i t can be shown [151 that, in the presence of synaptic (kij > 0) and threshold ( P - l > 0) noise, the limiting behaviour of the associated IFS F is characterised by a unique invariant measure +with lim pm = pg
(3.27)
m+-
independently of the initial distribution po. Moreover, pF satisfies the condition [44,45]that, for almost all trajectories, time averages are equal to space averages, M-1
lim
1
M+- M
f (V(m))= I f (V)dpg(V)
m=O
(3.28)
M
for all continuous functions f: M -+ 2 An equivalent result to (3.28)is that the frequency with which a n orbit visits a subset B is p&B), lim
M+-
#(V(m) E B:l 5 m < M) = pg (B) M
(3.29)
From equation (3.26),it follows that equations (3.7)and (3.8)hold with P oo
80
replaced by the projected distribution Pp
(3.30)
One of the interesting features of random IFS's is that the invariant measure (ifit exists) often has a rich fractal-like structure. (This is a major reason why IFS's have attracted interest lately within the context of image generation and data compression [46,47]). We shall illustrate this with the simple example of a single time-summating neuron with inhibitory feedback (Figure 3). Incorporating synaptic (and threshold) noise, the stochastic evolution of the excitatory neuron's membrane potential is given by V(m) = kV(m-1) -eu(m) + I, where u(m) = u with probability p(u) if a(m 1)= e(V(m-1) + q(m-1)) = 1and u(m) = 0 if a(m-1) = 0. For simplicity we shall assume that u is generated according to a Binomial distribution with L = 1, i.e. p(u) = kU ( 1 for some h, 0 < h < 1. Moreover, the probability distribution of the random threshold q(m) is taken t o be sigmoidal, equation (3.3). The dynamics corresponds t o an IFS G consisting of two maps Fo, F1 : W l , VO]+ [Vl, Vol, where VO,J are the fixed points of
0.0
Membrane potential V
Figure 6. The invariant measure of the random IFS consisting of the two maps Fo, F1 with Fo(V) = kV + 1 - k and F1(V) = kV. The associated probabilities are @o&= V2. (a) k = 0.52.
81
Fo,J, with associated probabilities @o, 0 1 such that Fo(V) = kV + I
@OW)= 1 - Qf(v)
FIN) = kV --E+I
= hv(v)
(3.31)
A reasonable approximation of the resulting invariant measure p G may be obtained by plotting a frequency histogram displaying how often an orbit Wm)) visits a particular subinterval of W1,VO].This is a consequence of equation (3.29). Without loss of generality, we consider the high + 1/2 for all V and set E = I = 1- k so temperature limit p + 0 in which that VO = 1,V1 = 0. The invariant measure pG in the case h = 1 (no synaptic noise) has been an object of interest for over 50 years [48] and many of its mathematical properties are still not very well understood. For k < 1/2 the support of pG is a Cantor set. On the other hand, for k 1 1/2 the support of pgis the whole unit interval and for many values of k the measure has a fractal-like structure. In Figure 6 we display the frequency histogram representation of F~ for h = 1 and (a) k = 0.52, (b) k = 0.6, and (c) k = 0.9. It is clear that p becomes progressively smoother as k + 1. (In the 9 presence of synaptic noise, 0 < h < 1, similar behaviour occurs, but the histograms are no longer symmetric about V = 112).
w)
0.0
Membrane potential V
Figure 6 continued. (b)k = 0.6
82
Returning t o equation (3.281,we see that, as in the case of standard binary networks, it is necessary to operate the network according to one of the schemes discussed a t the end of section 3.1. Consider the problem of training a stochastic time-summating network to operate as a continuous mapper (scheme (a) of section 3.1). In this mode of operation, one can formulate learning in terms of the following inverse problem: For fixed threshold noise, decay factors kj and external inputs Ii, there is a family of IFS's !F= ((Fa,Qa) I a E Q) parametrised by the set r = Kqj, Xi$ I i, j, =
l,..,N,i z j); find a set r such that the resulting invariant measure pF is "sufficientlyclose" to some desired measure labelled by the external input I. One of the potential applications of the IF'S formalism is that a number of techniques have been developed for solving the inverse problem, eg. the Collage theorem [43]. These exploit the self-similarity of fractal structures inherent in typical IFS's. It would be interesting to see whether o r not such techniques are practical within the neural network context. (See also the analysis of associative reinforcement learning in terms of IFS's [49]). ~
- ~
.
~
.
_
_
_
_
_
~
_
_
-
I
0.0
Membrane potential V
Figure 6 continued. (c) k = 0.9. We end this section by briefly discussing the behaviour of stochastic time-summating networks in the large-N limit (scheme (c) of section (3.1)). First, it is useful to reformulate the dynamics in a similar fashion to section
83
3.1. That is, set q$m) = i&j(m)aj(m),aj(m) = e(vj(m>+ ?lj(m))and write
Macroscopic variables may then be defined along the lines of (3.11) with f, and a replaced by F, and V. A simple example [14] is given by a n homogeneous inhibitory network with threshold noise in which q-iii,(m) + -WIN,for all i, j, with w fixed, and ki = k, Ii = I for all i. The {ong-term macroscopic behaviour of the network is governed by the single dynamical mean-field equation X(m + 1) = Fp(X(m)), where X(m) is the mean activation state N-l%Vi(m) and FP is the map in (2.7) with gain y = p. The existence of periodic and chaotic solutions to this equation implies that, in the large-N limit, the network exhibits macroscopic behaviour in which asymptotic stability, i.e. convergence to a unique invariant measure (equation (3.2711,no-longer holds. For in the thermodynamic limit, X(m) is equal to the ensemble average N-lJ ZiVidpmW), given the initial condition N-lJ ZiVidpOW) = X(0); if (3.27) held then X(m>would converge t o a fixed point corresponding to the ensemble average over the invariant measure. It remains to be seen whether or not complex dynamical behaviour at the macroscopic level in a time-summating neural network can be exploited to the same degree as the fixed point behaviour of Hopfield-Little networks in the context of the storage and retrieval of static patterns. One of the issues that would need to be tackled is the appropriate choice of learningrule. 4
TEMPORAL SEQUENCE PROCESSING
A major limitation of standard feedforward neural networks is that they are not suitable for processing temporal sequences, since there is no direct mechanism for correlating separate input patterns belonging t o the same sequence. A common approach to many temporal sequence processing tasks is to convert the temporal pattern into a spatial one by dividing the sequence into manageable segments using a moving window and to temporarily store each sequence segment in a buffer. The resulting spatial pattern may then be presented to the network in the usual way and learning algorithms such as back-error-propagation applied accordingly [50]. However, there are a number of drawbacks with the buffer method: (i) Each element of the buffer is connected to all the units in the subsequent layer so that the number of weights increases with the size of the buffer, which may lead to long training times due to the poor scaling of learning algorithms; (ii) the buffer must be sufficiently large to accommodate the
84
largest possible sequence, which must be known in advance; (iii)the buffer converts temporal shifts t o spatial ones so that, for example, the representation of temporal correlations is obscure; and (iv) the buffer is inefficient if the output response to each pattern of the input sequence is required rather than just the final output. The deficiencies of the buffer method suggest that a more flexible representation of time is needed. One simple approach is t o introduce into the network a layer of time-summating neurons [8,9] each of which builds up an activity trace consisting of a decaying s u m of all previous inputs to that neuron (cf. equation (2.4)), thus forming an internal representation of an input sequence. The inclusion of such a layer eliminates the need for a buffer and allows the network to operate directly in the time domain. Such networks have been applied to the classification of speech signals [51,521, motion detection 1533, and the storage and recall of complex sequences CS]. 4.1 Classification and storage of temporal sequences. Consider a single time-summating binary neuron with N input lines that is required t o learn p sequences of input-output mappings (IYm); m = 1,...,R) + (&m)), m = 1,...,R),p = 1,...,p, where Ip = (If ,...,1%)and oqm>= 0 or 1.We shall reformulate this problem in terms of a classification task to which the perceptron learning theorem 1541 may be applied. First, define a new set of inputs of the form 191 m-1
r=O
where k is the decay-rate of the time-summating neuron. The activation state a t time m is taken to be V(m) = kV(m-1) + Xi w;Iy(m) = Zj wjI!(m). (Our definition of V(m) differs slightly from that of equation (2.3)).The output at time rn is a(m) = 0(v(m)- h), where h is a fixed threshold. Divide the RM inputs ip(m), p = 1,...,p, m = 1,...,R, into two sets F+ and F- where iP(m) E F+ if ohm) = 1 and iP(m) E P otherwise. Learning then reduces to the problem of finding a set of weights (wj, j = 1,...,N) such that the sets F+ and F- are separated by a single hyperplane in the space of inputs ?(m) - linear separability. In other words, the weights must satisfy the RM conditions N
N
cw,iy(rn) > h + Gifip(m) E F+,
xw,iy(m) c h -6 if i"m)
j=l
j=l
E
F (4.2)
85
The perceptron convergence theorem [54]for the time-summating neuron may be stated as follows [9]: Suppose that the weights are updated according to the perceptron learning-rule. That is, at each iteration choose an input icl(m) from either F+ or F- and update the weights according t o the rule wj
wj + (&m) - €)(w.ip(rn) - h))iy(rn)
(4.3)
If there exists a set of weights that satisfy equation (4.2)for some 6 > 0, then the perceptron learning-rule (4.3)will arrive at a solution of (4.2)in a finite number of time steps - independent of N. *
1
\
-. 1
0
1
- a \
\
1
0
0
(a)
(b)
Figure 7. Example of (a) separable and (b) non-separable sets F+ and Fassociated with the sequences of input-output mappings defined in the text. Points in F+ and F+ are denoted, respectively, by 0 and 0. The above result implies that a time-summating binary neuron can learn the set of mappings {I&); t = 1,...,R) + (oqt)), t = 1,...,R), = 1,...,M provided that the associated classes F+ and F- are linearly separable. We shall illustrate this with a simple example for N = 2 [9]. Define the vectors A = (1,O) and B = (0,l)and consider the mappings A B + 1 0 and B A + 0 0. This is essentially an ordering problem since the pattern A produces the output 1 or 0 depending on whether it precedes o r proceeds the pattern B. (Thus it could not be solved by a standard perceptron). Using equation (4.1) we introduce the four vectors
86
(4.4)
It is clear that the sets F+ = (i'(1)) and F- = (i1(Z),?(l),i2(Z)) are linearly separable (Figure 7a) and, hence, that the neuron can learn the above mappings. On the other hand, the neuron is not able to learn the mappings A B -+ 1 1and B A -+ 0 0, since the associated sets @ cannot be linearly separated by a single line, see Figure 7b. This is analogous to the exclusiveOR problem for the perceptron 1541.
k2
(a)
(b)
Figure 8. The sets F+ and F- associated with the mapping C C C + 1 0 1, C = (1,l) for (a)a single time-summating binary neuron and (b)a two-layer network in which the time-summating input neurons have different decayrates, k # k Another example of a non-separable problem is shown in Figure 8a, which describes the sets F+ and P associated with the mapping C C C -+ 1 0 1, where C = (1,l).One way to handle this and the previous example is to use a feedforward network with an input layer of time-summating neurons. More specifically, consider a two-layer network consisting of N timesummating neurons, each with a linear output function, connected to a single standard binary neuron as output, (Figure 9). For a n input sequence (I(l),...,I(R)), the activation state of the jth input neuron is Vj(m) = qkiIj(m-r) and the activation state of the output neuron is V(m) = qwjVj(m). In the special case kj = k, the network is equivalent, in terms of its input-output transformations, to the time-summating neuron considered
87
above. However, the network representation immediately suggests a useful generalisation in which the decay-rate associated with each input line is site-dependent. (The perceptron convergence theorem still holds when k is replaced by kj in (4.1)).It is clear that the network can solve a wider range of tasks than a single time-summating binary neuron. A simple example of this is the problem of Figure 8; the sets Fk become linearly separable when the two input neurons have different decay-rates (Figure 8b).
Figure 9. A two-layer time-summating network One can also use a time-summating neural network to solve nonlinearly separable tasks by introducing one o r more hidden layers. In the example of Figure 7b this may be achieved with two hidden units that separate the classes @ by two lines. However, as with standard networks, it is necessary to replace the perceptron learning-rule with a n algorithm such as back-error-propagation (BEP) [501. In order to implement BEP, the threshold function of all binary neurons must be replaced by a monotonically increasing, smooth function such as a sigmoid. Consider, for example, a three-layer network with a n input layer of linear timesummating neurons, a hidden layer and a n output layer, both of which consist of standard neurons with sigmoidal output functions f, fix) = U(1 + e-1. For a n input sequence MU, ...,I(R)), the input-output transformation realised by the network at time m is
where wjp is the weight from the pth input unit to the jth hidden unit and wij is the weight from the jth hidden unit to the ith output unit.
88
Given a desired output sequence ( ~ ( 1 1..., , o(R)), the network can be trained using a form of BEP that minimises the error E = q,m(Oi(m) oi(m)I2. That is, the weights and decay-rates, denoted collectively by 6, are changed according to a gradient descent procedure, A( = - r$E/aC,, where is the learning-rate. The gradients aEBC are calculated iteratively by backpropagating errors from the output to the hidden layer [50]. For the weights we have,
where q(m) and Ej(m) are the errors at the output and hidden layers respectively,
Similarly, for the decay-rates,
(Note that the above implementation of BEP differs from that of Mozer [52], who considers neurons that have their output rather than their activation state fed back as input). So far we have only discussed the problem of learning sequences of input-output mappings. However, with small modifications, a two-layer, time-summating neural network can also be trained to store and retrieve temporal sequences [8,91. Suppose that the input layer consists of N linear time-summating neurons and the output layer of N standard binary neurons; all neurons in the input layer are connected t o all neurons in the output layer. Thus, we effectivelyhave N independent networks of the form shown in Figure 9. "he network stores a sequence N O ) , ...,UR)) by learning to output the pattern I(m+l) on presentation of the previous patterns I(O),...,Um) for m = 0,...,R-1. For each output neuron, this is achieved by applying a perceptron-like learning-rule of the form (4.3) t o the set of weights connected to that neuron. In other words, for each i = 1,...,N, (4.9)
89
where wij is the weight from the jth input neuron to the ith output neuron. A schematic diagram of the learning-phase is shown in Figure 10. Once the network has been successfully trained, the full sequence may be recalled by simply seeding the network with NO) and feeding the output of the network at each subsequent time-step back to the input layer via delay lines. During recall, all weights are held fixed. The presence of the time-summating neurons in the input layer, with the possible addition of one or more hidden layers, allows the disambiguation of complex sequences. (Recall the discussion of complex sequences in section 2). This may be shown [9] using geometrical arguments similar to those illustrated in Figures 7 and 8. I I I
I
time-summating input layer
output layer
I I
I
I
I
I I
I(m+l)
-1
I(m+l)
Figure 10.Schematic diagram of the learning-phase of a two-layer timesummating network for the storage of a temporal sequence {I(O), ...,I(R)). The network learns to match its actual output O(m) a t time m with the desired output, which is the next pattern of the sequence I(m+l). Dotted line indicates feedback during recall. We conclude that the inclusion of an input layer of time-summating neurons allows a network to solve complex tasks in the time domain by transforming each input sequence into a more useful representation for processing by subsequent layers. However, one class of problems that such networks cannot solve concerns cases in which the response t o a particular
90
pattern within a sequence depends on subsequent rather than prior patterns of that sequence. Nevertheless it is possible to extend the timesummating model so that such problems may be handled. (See section 5).
Temporal correlations In section 4.1, we showed how time-summating neurons can solve certain tasks in the time domain by storing previous inputs in the form of an activity trace. Another consequence of the activity trace is that temporal correlations are set up between the activation states of the neuron on presentation of an input sequence [11,91. To illustrate this point, consider a time-summating binary neuron (or its network equivalent) being presented with sequences of input patterns of the form {Y(l),...,Y(R))where Y(t) = I(t) + <(t),I(t)is fixed and lJt) is a random pattern vector satisfying
4.2
The mean and covariance of the correspondinglocal fields are given by
where
(4.12) s=l
The RxR temporal correlation matrix D is symmetric, has positive definite eigenvalues and unit determinant (see Ref. [10,11]).The inverse matrix D-1 has the simpler form
Hence, although there are no extrinsic correlations between patterns within an input sequence, intrinsic temporal correlations do arise between the sequence of activation states of the time-summating neuron. Moreover, the variance of V(r), determined by the diagonal term D, satisfies varCV(r)l=7% = Y- 1 - k2' 1-k2
(4.14)
91
where we have set x j w y = 1. Therefore, the variance is a monotonically increasing function of r that, for infinitely long sequences, approaches the asymptotic value y/(l- k2) as r + -. For values of the decay rate k 1, this limiting value is large and may lead to an accumulation of errors. For example, assume that N is large so that the distribution of activation states (V(l),...,V(R)) associated with the random sequences (Y(l),...,Y(R)} is described by a multivariate Gaussian with mean and covariance given by (4.11). Suppose that the underlying patterns I(r) should be mapped to the outputs d r ) =fl for r = 1,...,R. (We have switched to the “spin” output representation for convenience). Then the probability that the neuron produces an incorrect output at time r, on presentation of a noisy input sequence (Y(l),...,Y(R)), is
-
(4.15) X
Equation (4.14) implies that the scale factor UdDrr in equation (4.15) could lead to an accumulation of errors along a sequence when k 1, since H(x) is a monotonically decreasing hnction of x. One cannot avoid this problem by taking k << 1, since, in order to disambiguate long sequences, it is necessary t o have k as close t o unity as possible. That is, t o capture relationships between patterns of a sequence of length R, kR should not too small. A reasonable condition might be kR > 0.1, which implies that I logk I c loglO/R 0 for large R. As it stands, the neuron is not robust to noisy inputs when trained on noise-free pattern sequences using the learning-rule (4.3). For if the network has successfully learned the mapping (I(l),...,I(R)} + (~(11, ...,o(R)} then (4.3) only ensures that x(r) = o(r)w.i(r)> 0 ; we have set the threshold h = 0 for simplicity. In particular, x(r) may be sufficiently small such that, on presentation of a noisy sequence (Y(l),...,Y(R)), the probability of an error satisfies E(r) = H[x(r) / = H[Ol = 112. Moreover, this feature is reinforced for long sequences when k = 1, since the scale factor dDrr considerably reduces the effective size of x(r). (For a more detailed analysis of E(r), based upon the statistical-mechanicaltechniques of Gardner [4], see Ref. [ll]).Therefore, to ensure a certain level of robustness to noise we introduce a stability parameter K, K > 0 and require that
-
-
a]
92
Then dr) satisfies E(r) < H[K/J"ID,], and it is now possible to reduce the upper bound for the probability of error by increasing K. (However, as in the case of static inputs, the number of patterns that may be learned without error decreases as K increases (see below and Ref. [lo])). Equation (4.16)is similar in form to the fixed point equation for the storage of a static pattern in an attractor network, with K guaranteeing a finite size for the basins of attraction [4].Therefore, it is necessary to modify the perceptron-like learning algorithm (4.3)when K > 0. Suppose that the neuron is required to learn p mappings of the form (1411, ...,I&R)) (OW) ,...,N R ) ) , p = 1,...,p. Introduce a mask of the form
where Ilwll=
(c
wf)'''
and update the weights according t o the rule
j
w j + wj+ #Jap(r)f!(r)
(4.18)
For K = 0, equation (4.18)is equivalent to (4.3).The convergence of the algorithm may be established using a straightforward extension of Gardner's proof for fixed points [4]:Assume that a solution w* exists such that for each cc. = 1,...,p and t = 1,...,R,
where 6, K > 0. Let w ( ~ be ) the set of weights after n iterations of the learning algorithm and define X(n) by
(4.20) Using equations (4.17M4.19)and the Schwartz inequality, it may be shown that x(n) satisfies
(4.21)
It follows from (4.21)that, in order for the upper bound of X(") not to be violated, the algorithm must terminate in less than nc time steps where Hence, the learning-rule (4.18)is guaranteed to find ndlogn, = N/(26(~+6)).
93
a solution to (3.9)if at least one such solution exists. Finally, note that, a s in the case of standard perceptrons, statisticalmechanical techniques may be used to analyse the performance of a timesummating binary neuron [10,11]. One of the features that emerges from such analyses is the non-trivial contribution of intrinsic temporal correlations, a s characterised by the matrix D of equation (4.12). In the case of random sequences, the main effect of such correlations is to alter the effective size of the stability parameter K [lo]. 5
COMPARTMENTAL MODEL.
So far we have considered a simple extension of the McCulloch-Pitts model that involves the introduction of a slowly decaying activation state. If we interpret the activation state as a mean soma potential, then this decay term reflects the leaky-integrator nature of the process by which real neurons integrate their inputs. However, as it stands, the time-summating model neglects other aspects of this process such as the spatial structure of the cell and dendritic tree. A common approach t o spatial structure is to use a compartmental model of a neuron [55,56]. The compartmental model takes each neuron to be split up into a set of M equivalent circuits, where the membrane potential i n the ath compartment, denoted V,, obeys the leaky-integrator shunting equation
(5.1)
Here, C,, R, are the membrane leakage capacitance and resistance of the ath compartment, with junctional resistance Rap between the ath and Pth compartments; if two compartments are not connected then R aP = - * Moreover, A g k is the increase in synaptic conductance of the kth synapse induced by incoming action potentials from other neurons and s,k is the membrane reversal potential. The number of neurons impinging on the ath compartment is N,. (For compartments without synaptic inputs, the final term on the right-hand side of (5.1) is absent; if a compartment has a single synapse then the index k will be dropped). Since A g d is positive, the effect of each shunting term Ag&[s,k - V,] is for V, to tend towards s,k. Thus positive and negative s,k correspond respectively t o excitatory and inhibitory synapses. (Note that, for simplicity, we shall only be concerned with a single neuron being stimulated by inputs from a set of pre-synaptic neurons; we shall not discuss behaviour a t a network level). In Figure 11,
94
we give a simple example of a compartmental model consisting of a chain of isopotential regions, with the first compartment having a synaptic input. The other compartments idealize the cable properties of a single dendrite. From this simple picture one can then build up the complex tree structure of a neuron's dendrites.
Figure 11. Simple example of a compartmental model of a neuron consisting of a chain of isopotential regions and a single synaptic input, with Agl describing the conductance changes a t the synapse with membrane reversal potential S1. Since equation (5.1)is formally linear in the set may be written as a matrix equation of the form [571 dV -=A(t)V(t)+U(t), dt
Wa,a = 1,...,MI, it (5.2)
(5.3)
(5.4)
(5.5)
(5.6)
95
Formally, equation (5.2)may be solved (for V(0)= 0 ) as
(5.7) where T denotes the time-ordered product, which is required since M is a time-dependent, non-commuting matrix. T h a t is, T [A(t)A(t')] = A(t)A(t')€Nt- t') + A(t')A(t)B(t'-t). We are interested in reducing the above model to a generalised timesummating model of a neuron. Therefore, some prescription must be given for discretising the time. We shall follow the approach of Ref. 161 by restricting the firing-times of all neurons t o the discrete-times t = m, integer m; however, the evolution of the membrane potentials is still governed by the leaky-integrator equation (5.1). Neglecting details of the action potential pulse-shape, Ag& then consists of a sequence of conductance spikes of the form,
where a,k(m) = 1 if the kth pre-synaptic neuron impinging on the ath compartment fires at the mth time-step and a&m) = 0 otherwise; we are assuming that there is a synaptic delay equal to a single time-step. The size of each conductance spike, o d ,is determined by factors such as the postsynaptic effieacy. (Note that we shall consider below the integration of Ag&t) over intervals [n,m] for integer m and n. It is important to uniquely specify the behaviour of A g d a t the end points. In the following, we shall assume that there is a contribution from A g k a t the upper limit m but not at the lower limit n. This corrresponds to an &-prescriptionin which - m) is replaced by &t- m + €1). For synaptic inputs of the form (5.81, the dynamics of the membrane potentials V(t) can be solved completely in terms of their values at the discrete times t = m, integer m. This may be seen by integrating equation (5.2)over the interval [m,t] leading to the condition
at
V,(At+m)= z(exp[AtQ])$Vp(m),
O< A t c 1
(5.9)
P where Q is the time-independent part of A, equation (5.3). (The vector U, equation (5.6),and the time-dependent part of A vanish for non-integer
96
times; this follows from equation (5.8)).The matrix Q has real, negative, non-degenerate eigenvalues [55]. Hence, the right-hand side of (5.9)is a finite linear combination of decaing exponentials. It follows that the time evolution of V(t) is described by a saw function in which jumps occur a t integer times, whereas the dynamics at non-integer times is characterised by decaying exponentials. It remains t o determine the membrane potentials V(m) at the discretetimes m. Substituting equation (5.8)into (5.71, we obtain,
where wp.ap = c kopks e pk k , J denotes the subset of compartments that have synaptic inputs, and
(5.11) In order to evaluate the time-ordered product in equation (5.10), we split the interval [n, m] into MT equal partitions [ti, $+I], where M = m - n, to = n ,...,tT = n+l,...,tMT = m and take 6(t - 5) + 8ti,s/At, A t = 1/T, for all t e (ti-l,ti]. The time-ordered product is then given by
(5.12)
for nem, Hence, from equations (5.12) and (5.10) with m-n = r-1, we obtain
(5.13) which m a y be rewritten in the iterative form (for V(0)= 0 )
97
(5.14)
Equations (5.13)and (5.14)are the generalisations of (2.4)and (2.3)to the case of a compartmentalised time-summating neuron. They need t o be supplemented by a threshold condition determining when the neuron fires (cf. equation (2.2)). We shall assume that the neuron fires whenever its membrane potential at the axon hillock exceeds some threshold h, provided that the neuron is outside its absolute refractory period (taken t o be unity). Using similar arguments to Ref. [6], it then follows from equation (5.9)that the neuron can only fire at integer times t = m. Thus, if equation (5.14)is supplemented by the threshold condition a(m) = e(VM(m)- h), where M labels the compartment containing the axon hillock, we then have a complete, discrete-time description of the neurons behaviour. The response characteristics of the neuron satisfying equation (5.14), and the particular role played by the nonlinearities associated with G, are discussed elsewhere [58]. For the sake of simplicity, we shall consider here a simplification of (5.13)and (5.14)obtained by taking the limit Sak + -, Oak + 0, for fixed S,ko&, so that G = 0. (This corresponds t o dropping the term -zkAg&v, from the leaky-integrator equation (5.1)). In the case of a single compartment we then recover equations (2.3) and (2.4). A less trivial example is given by the compartmental model of Figure 11 in which the first compartment receives synaptic inputs and the Mth compartment corresponds to the axon hillock. The matrix Q is of the form (5.15) Substituting equation (5.15)and G = 0 into (5.13),the membrane potential at the axon hillock is given by m
r=2
m
a r=l
where ha are the eigenvalues of Q and ca1 are constant coefficients. In the special case zp = z and z p - 1 ~= I for all p, the exponential term in equation (5.16)reduces to
98
(5.17)
where Nq[l,M] is the number of possible paths that can be taken by a random walk consisting of q steps of unit length between reflecting barriers a t x = 1and x = M [331.
Figure 12.Multi-input compartmental model. The above picture may be extended t o the case of a multi-input compartmental model consisting of a soma together with N dendritic chains each of which has an exponential factor of the form (5.161,see Figure 12. The membrane potential at the soma is given by
where a;(m) is the output from neuron i a t the rnth time step, Qi is the decay matrix associated with the ith dendrite and the weights wi have been absorbed into the coefficients cai. We see that an important consequence of the compartmental model of a neuron is that it allows one to associate a range of site-dependent decay-rates with each input line. This could enhance the temporal sequence processing abilities of the neuron (cf. the discussion of Figure 8 in section 4). Finally, note that if the right-hand side of (5.17)is truncated beyond the first term in the summation over p, then equation (5.18)becomes
99
(5.19)
An interesting feature of (5.19) is that the maximum effect of a n input occurs with a delay that increases with the length of the input line, i.e. the
number of compartments Mj. This follows from
which implies that the maximum occurs a t r = [(Mj-1)~ + 11, where [XI denotes the integer closest to x. (On the other hand, the untruncated expression in (5.17) has its maximum when r = 1, since it is equal to a linear combination of decaying exponentials). It may be possible to exploit this feature in temporal sequence processing tasks where the response to a pattern within a sequence depends on future patterns. That is, if the maximum response to a pattern is sufficiently delayed then the effects of future patterns can be taken into account. (See section 4 and Ref. 1583). 6.
References.
W. S. McCulloch and W. Pitts, Bull. Math. Biosci. 5, 115 (1943). D. J. Amit, Modeling Brain Function, (Cambridge University Press, Cambridge, 1989). 3 L. Garrido (ed.), Statistical Mechanics of Neural Networks, (SpringerVerlag, Berlin 1990) 4 E. Gardner, J. Phys. A 21, 257 (1988). 5 E. Caianiello, J. Theor. Biol. 1,209 (1961). 6 P. C. Bressloff and J. G . Taylor, Neural Networks, 4, 789 (1991). 7 P. J. Harrison, J. J. B. Jack and D. M. Kullman, J. Physiol. 412,43 (1989). 8 M. Fteiss and J. G. Taylor, Neural Networks 4, 773 (1991); J. G. Taylor, Int.J. Neural Syst. 2 , 4 7 (1991). 9 P. C. Bressloff and J. G. Taylor, J. Phys. A 25,833 (1992). 10 P. C. Bressloff and J. G. Taylor, Perceptron-like Learning in Tirnesummating Neural Networks, (HRC preprint, 1992). Submitted to J. Phys. A. 11 P. C. Bressloff, Learning Temporal Sequences from Examples in a 1 2
100
12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30
31
32 33
34 35 36 37
38 39 40 41
Time-summating Network, (HRC preprint, 1992). Submitted to Phys. Rev. A. K. Aihara, T. Takabe and M. Toyoda, Phys. Lett. A 144,333 (1990). P. C. Bressloff and J. Stark, Phys. Lett. A, 150,187 (1990). P. C. Bressloff, Phys. Rev. A 44, (1991). P.C. Bressloff, Analysis of Synaptic Noise Using Iterated Function Systems, Phys. Rev. A (1992). To appear. M. Bauer and W. Martienssen, Network 2,345 (1991). C. M. Gray, P. Koenig, A. K. Engel and W. Singer, Nature 338,334 (1989). R. Eckhorn, R.Bauer, W. Jordan, M. Borsch, W. Kruse, M. Munk and H. J. Reitbock, Biol. Cybern. 60,121 (1988). Ch. von de Malsburg and W. Schneider, Biol. Cybern. 54,29 (1986) J. Nagumo and S. Sato, Kybernetic 10, 155 (1972). P. Koenig and T. B. Schillen, Neural Comput. 3,155 (1991) P.Wagner and H. G. Schuster, Biol. Cybern. 64,77 (1990). I. Tsuda, Phys. Lett. A85,4 (1981). J. P.Keener, Trans. h e r . Math. SOC. 261,589 (1980). H.G.Schuster,Deterministic Chaos, (Weinheim:VCH, 1987). R. S.Mackay and C. Tresser, Physica 19D, 206 (1986). B.Katz, The Release of Neural Transmitter Substance, (Liverpool University, Liverpool, 1969). J. G. Taylor, J. Theor. Biol. 36,513 (1972) G.L. Shaw and R. Vasudevan, Math. Biosci. 21,207 (1974). P.C. Bressloff, in New Developments in Neural Computing, edited by J. G. Taylor and C. L T. Mannion, (Adam Hilger, Bristol, 1989);P. C. Bressloff and J. G. Taylor, Phys. Rev. A 41, 1126 (1990). H.Korn and D. S. Faber, in Synaptic Functions, edited by W. M. Edelman, W. Gall and W. M. Cowan, (Wiley, New York, 1987), p. 57. W. A. Little, Math. Biosci., 19,101 (1974). G.R. Grimmett and D. R. Stirzaker, ProbabiEity and Random Processes, (Oxford University Press, Oxford, 1986). J. W. Clark, Phys. Rep. 158,91(1988). P. Peretto, Biol. Cybern. 60,51 (1984); F.J.Pineda, J. of Complexity, Sept. 1988. S. Amari, K. Yoshida and K. Kanatani, SIAM,J. Appl. Math. 33,95 (1977). J.J. Hopfield, Proc. Natl, Acad. Sci., 81,3088 (1984). D. J. Amit, H. Gutfreund and H. Sompolinsky, Phys. Rev. A 32,1007 (1985). E.Gardner, B. Derrida and P. Mottishaw, J. Physique 48,741 (1987). B.Derrida, E.gardner and A. Zippelius, Europhys. Lett. 4,167 (1987);
101
R. G e e and A. Zippelius, Phys. Rev. A 36,4421 (1987). 42 M. Y. Choi and B. A. Huberman, Phys. Rev. B 28,2547 (1983). 43 M. F. Barnsley, Fractals Everywhere, (Academic Press, New York, 1988); M. F. Barnsley and S. Demko, Proc. Roy. SOC.London Ser A 399, 243 (1985). 44 M. F. Norman, J. Math. Psychology 5,61(1968). 45 J. H. Elton,Ergod. Th. and Dynam. Syst. 7,481 (1987). 46 M. F. Barnsley and A. D. Sloan, Byte 13,215 (1988). 47 J. Stark, IEEE Trans.Neural Networks 2,156 (1991); ibid Neural Networks 4,679 (1991). 48 P. Erdos, h e r . J. Math. 61,974 (1939); ibid62, 180 (1940). 49 P. C. Bressloff and J. Stark in Fractals and Chaos, edited by A. J. Crilly, H. Jones and R. A. Earnshaw (Springer-Verlag, 1991) p.145. 50 D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing (MI" Press, Cambridge, MA, 1986). 51 R. L. Watrous and L. Shastri, in Proc. IEEE First Int. Conf on Neural Networks, San Diego, CA, June 1987, vol. 11, p. 619; M. I. Jordan, in
52 53
54 55 56
57 58
Proceedings of the Eigth Annual Conference of the Cognitive Science Society, Hillsdale, NJ (1987) p. 531. M. C. Mozer, Complex Systems 3,349 (1987). W. S. Stornetta, T. Hogg and B. A. Huberman, in Neural Information Processing Systems, ed. D. Z. Anderson (American institute of Physics, New York, 1987). M. Minsky and S. Papert, Perceptrons, (MIT Press, Cambridge, MA, 1969). W. Rall, in Neural Theory and Modeling, edited by R. F. Reiss, (Stanford University Press, Stanford, 1964), p. 73. C. Koch and I. Segev (eds.), Methods in Neuronal Modeling, (MIT Press, Cambridge, MA, 1989). D. H. Perkel, B. Mulloney and R. W. Budelli, Neuroscience 6,823 ( 1981). P. C. Bressloff and J. G. Taylor, Discrete Time Compartmental Model of a Neuron , in preparation.
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B.V. All rights reserved.
103
The numerical analysis approach S.W. Ellacott Department of Mathematical Sciences, Brighton Polytechnic, Moulsecoomb, Brighton BN2 4GJ. United Kingdom.
Abstract Tools derived from numerical analysis can be used to explore algorithmic properties of neural networks. This paper discusses some issues of convergence, stability and preprocessing of data within this context.
1. T H E NUMERICAL ANALYSIS APPROACH
- HOW IS IT DISTINCTIVE?
Learning algorithms are (mostly) numerical algorithms. As such, they are also, of course, dynamical systems. Since they are normally concerned with convergent processes, they also have much in common with control problems and stability theory. These topics are addressed elsewhere. Nevertheless, neural systems fit within the context of traditional areas of numerical analysis concerned with iteration theory, approximation theory, numerical linear algebra and particularly, optimisation. Indeed it is common to think of neural models of learning as optimisation of a system over a set of parameters. This may involve minimising some least squares error. as when backpropagation is used to train a multi layer perceptron; it may involve some energy measure or similar Liapunov function to find the stable states of a system (as for a Hopfield net); or one may be optimising some more nebulous concept of “goodness”, as is the case with some unsupervised pattern classifiers. Dynamical systems theory tends to be concerned with global or generic properties of systems. What kind of attractors do they have? Do they exhibit chaotic behaviour, and if so with what characteristics? Numerical analysis on the other hand is concerned with more detailed issues. How quickly will the attractor be reached? And how safely? Does our implementation of the system involve any approximations or discretisation errors, and if so, what is their effect? The reader might think that such issues are less interesting than the generic properties. However, apart from the question of practical importance, study of detailed properties of algorithnis can often reveal unsuspected attributes which can throw considerable light on why they behave as they do, thereby opening the way to new models. Numerical analysis has an accumulated wealth of experience of the detailed behaviour of algorithms. At a simple level, existing algorithms such as the well known Conjugate Gradient, or Davidon-Fletcher-Powell methods (see e.g. punday, 19841) may be the most efficient way to train small or medium sized systems: various authors have experimented with these. Algorithms designed as “learning algorithms” may be variants of known methods. Thus workers in neural networks need to be familiar with basic numerical
104
methods to avoid re-inventing them! However there is a more fundamental way in which numerical analysis can contribute to neural computation. Obviously leaming systems do have special properties: they are not “just” optimisation or iteration processes. However, numerical analysis can bring a range of powerful methods and tools which can elucidate the behaviour of these systems. These can suggest new ways of thinking about them, and ways to improve them. The purpose of this chapter is to explore some of these ideas as they relate to some well known neural models. In the rest of this section we will explain the questions that the numerical analyst considers. In Section 2, we ask these questions of a well known and simple leaming algorithm, the Widrow-Hopf delta rule, in order to show how they can lead to results of practical importance. We shall also discuss generalisation to non-linear methods. The third section aims to relate these issues to principal component analysis, filters, preprocessing and preconditioning. Finally, in the Section 4 we briefly suggest ways in which numerical analysis can open up new avenues of research. The early part of Section 2 covers in expanded form work described in Ellacott, 19901. The rest of the material here has not appeared elsewhere. When presented with a problem to be solved numerically, the numerical analyst will want to ask three key questions about it. Firstly, is the problem to be solved well conditioned? Secondly, is a proposed algorithm for solving the problem numericaly stable? Thirdly is the algorithm eflcient? Before addressing neural systems as such, we need to explore the meaning of these concepts. The first and second questions are related and indeed are often confused, but it is well to be aware that there is a distinction.
1.1. Conditioning A problem is said to be well conditioned, or well posed, if the solution to the problem is not radically altered by a small change in the data which describes it. If not, it is said to be ill conditioned or ill posed. The problem of finding the ultimate state of a dynamical system will be ill posed if the intial state is in a chaotic region: a small change in the initial conditions can lead to a completely different final state. In order to illuminate this concept a little more, we briefly consider a trivial example of an ill conditioned problem, the system of equations x
i-
+
x+(l where y =
&
y = l
e)y = 1
and
+6
6 are small real numbers, say less than 10-6 in magnitude. The solution is
6/€,
x = 1
-a/€.
Obviously there are choices for 6 and€ in the specified range which will give any value of y we like, or, indeed, no value at all! The system of equations is nearly singular, and the answer is not well determined. Most mathematicians are familiar with the concept of an ill posed problem, but one thing which has become apparent from recent research in numerical analysis is that “ill
105
posedness" is not an absolute property. Often, the question of how well posed a problem is depends on how one states it, and, particularly, in what space one looks for a solution. An excellent example of this is provided by the problem of determining the Maclaurin series coefficients of an infinitely differentiable function f of one variable, i.e. the quantities
for n = 0,1,2 ... . (Here [n] denotes the nth derivative.) If we seek for a solution in the space C"(-2,2), the space of real infinitely differentiable functions on the real interval (-2.2). then the problem is ill posed. (Note that the coefficients exist for functions in this space, even if the series does not converge to f.) Indeed, even determination of the fEst derivative is ill posed. To see this, suppose that we add to our function f the (analytic) function sin(ox)/w, for some large value of W. This changes the value of f(x) for any x by at most l/w, but the value of f'(0) is increased by one. The effect of this ill posedness is the well known fact that by finite differencing, it is only possible to determine the value of the derivative to roughly half the available machine precision. On the other hand, the problem is perfectly well posed in the space of functions analytic on the circle z c 2 in the complex plane. We have
I I I.
Is1
If we add a function g to f, we find
Thus the change in the Maclaurin coefficient cannot be greater than the change in f itself. By evaluating the integral with the trapezium rule, it is perfectly possible to obtain accurate Maclaurin coefficients for values of n even in the hundreds. In Section 2.2 we shall see an example of a machine learning problem which is itself well posed, but which can become ill conditioned when expressed as an optimisation problem.
1.2. Numerical stability Whereas conditioning is a property of a problem to be solved, numerical stability is a property of the algorithm used to solve it. It must be admitted that it is not always easy to separate the two, but nevertheless the distinction is worth making. Numerical stability should not be confused with Liapunov stability: in dynamical systems parlance numerical stability is more closely related to structural stability, although even that is not exactly the same. Numerical stability is often given a precise definition in the context of a particular
106
class of algorithms, but a reasonable generic definition might be: an algorithm is numerically stable if the solution obtained mimics that of the original problem within a plausible class of errors or approximations. Numerical stability is a somewhat subtle concept and is often misunderstood or confused with conditioning. To illustrate it, we will give a simple example of a difficulty that also occurs in neural systems (see Section 2.2). Consider the differential equation d2x + 101 & dt
dt2
+ lOOx
= 0.
In phase plane form, with y=
& dt
and x =
(;)
,
this becomes i = Ax, where A = (-1;
-A)
and . denotes differentiation with res-
pect to t. The eigenvalues of A are -1 and -100, so for any initial conditions, the system tends to the structurally stable fixed point at x = y = 0. Obviously, the component of the solution corresponding to the eigenvalue -100 will rapidly disappear, and the solution goes to zero at a rate proportional to exp(-t). Now suppose that we take a simple Euler approximation to this system, namely xk+l=
Xk
+ hAXk = (I +
hA)x,,
where xk denotes the (approximate) value of x at t = kh. The eigenvalues of (I + hA) are 1 - h and 1 - 100h. The rate of convergence of the system should be dominated by the (1 - h) eigenvalue which corresponds to the desired slowly convergent solution of the differential equation. The (1 - 100h) eigenvalue corresponds to the very rapidly decaying solution of the differential equation. However, unless h is very small, this eigenvalue is larger than 1 in modulus, so it causes the discrete iteration to diverge. The point here is that, as a dynamical system, the original differential equation is perfectly well behaved. The only “nasty” thing about it is the significant difference in the time constants of its fundamental solutions. In numerical analysis, such equations are said to be srif. Stiff differential equations occur in various applications, notably in chemical reaction kinetics in which the ionic and inorganic components of a reaction system may have time constants differing by many orders of magnitude: ratios of 105 or more may occur. Since the “fast” solution rapidly decays, one would not expect it to cause any difficulty. However, when we make a discrete approximation to the differential equation, the stability becomes dominated by this component. Euler’s method is numerically unstable for stiff systems, unless a very small h is used, and special strfly stubk methods, such as Gear’s method, are used to solve such problems numerically. The relevance of this to neural systems is clear. The descent or other differential equations arising from continuous time neural models may be stiff, even if they can be shown to be stable. Corresponding discrete models (such as the Euler approximation to the
107
gradient descent equation) may have the same properties for very small learning rates, but quite different behaviour when reasonable values of the learning rate are used.
1.3. Efficiency Compared with numerical stability, efficiency is a relatively simple concept. In practice, it is not sufficient to know that a method will eventually produce a solution: we want to get to it as quickly as possible. Ideally, an algorithm should achieve the same computational complexity as the problem it purports to solve, while retaining numerical stability. However, in practice this issue is very difficult to address, and we have to content ourselves with pragmatic approaches to increasing speed. Nevertheless, is is important that these pragmatic solutions be rigorously analysed, in order to determine under what circumstances (if any!) they will actually help. Some approaches to this question are presented in Section 3. One way that efficiency can sometimes be improved is to change the problem. There may be no point in expending a large amount of computational effort in finding the exact optimum trained state for some neural system, if a solution which will achieve the required classification can be found with less work.
2. ANALYSIS OF LEARNING DYNAMICS In this section, we explore the ideas introduced in Section 1, as they relate to a well established and fairly simple learning rule. We consider the Widrow Hopf delta rule, and its extension to nonlinear systems often called the generalised delta rule, or backpropagation. This is not to suggest that this approach is considered to be the only onworth considering or even necessarily the best. However it seems more profitable to take one method and look at it in detail, rather than give a superficial coverage of a broad field. Moreover the delta rule actually has some quite interesting mathematical properties: more so indeed than might be apparent from its simple nature. Similar approaches can be applied to the various modifications of the delta rule which abound in the literature. It is the opinion of this author that such variations should be compared on the basis of rigorous analysis of the type presented here, rather than just be a few possibly fortuitous examples. The tools presented here should be applicable for authors desiring to prove superiority of a particular approach. Previously, this rule has usually been analysed as a stochastic least mean square algorithm, under the assumption that the learning rate tends to zero as the algorithm proceeds. However, this is not normally done in practice: more commonly a constant learning rate is used. Our analysis will aim to address three questions:
i)
How does the linear algorithm behave with a fixed learning rate?
ii) Given a fixed set of training patterns, is it better to update the weight matrix after each pattern, or after the whole set of patterns (in which case the method becomes a version of steepest descent)?
108
iii) How far do the answers to the first two questions apply to related non-linear methods such as backpropagation?
2.1. The delta rule
So let us begin by considering the simplest of all neural models, the simple perceptron. Denote the training vectors (generically) by x and desired output vectors by y. W is the weight matrix. We would like to find W such that y = Wx for all pairs (x,y) of patterns and corresponding outputs. We assume initially that W is updated after each training pattern. The change in W when the pattern x is presented is given mumelhart and McClelland,l986, p.3221 by
where q is the learning rate, and (WX)~denotes the jth element of Wx. Actually, we can simplify matters here by observing that there is no coupling between the rows of W in this formula: the new jth row of W depends only on the old jth row. This enables us to drop the subscript j, denoting yj just by y, and the jth row of W by the vector wT. Thus we get 6Wi
= q0' - WTX)Xi, so
6w
=
110'- wTx)x.
Thus given a current iterate weight vector wk, wk+f
= wk+8wk = wk
+
- wkTX)x
= (I - qxxT)wk + qyx.
(2.1)
The final equation is obtained by transposing the (scalar) quantity in brackets. Note the bold subscript k here, denoting the kth iterate, not the kth element. Observe also that the second equation makes clear what the delta rule actually does: it adds a suitable multiple of the current pattern x to the current weight vector. We now prove some results about this iteration: the first lemma is a special case of a well known results (see e.g. [Oja, 1983, ~ 1 8 1but ) the proof is given here for completeness.
Lemma 2.1 Let B = (I - VT). Then B has only two distinct eigenvalues: 1 - qllx112 corresponding to the eigenvector x and 1 corresponding to the subspace of vectors orthogonal to x. denotes the usual Euclidean norm.) (Here
11
109
Proof (I -qxxl)x
= x -qxxTx
= x(l -Tl11xlp> as required.
On the other hand if u is orthogonal to x, (I-qxxT)u
= u-VXXTU
= u.
That there can be no more eigenvalues follows from the fact that x together with the set of all vectors orthogonal to x span the input vector space.
I Recall that the matrix norm of any mxn matrix A corresponding to the Euclidean norm in IR* is defined by
This expression defines a norm on the mn dimensional space of matrices with the property that for any n-vector v, l l A ~ 1 1d~ llA11211v112.
Lemma 2.2
e(B)
Provided 0 S q d 2/11x112, we have 11B1I2 = Q@) = 1, where is the spectral radius of B. i.e. the magnitude of its eigenvalue of largest modulus. Proof The 2 norm and spectral radius of a symmetric matrix are the same: see Psaacson and Keller, 1966, p10, equation (ll)] noting that the eigenvalues of A2 are the squares of those of A. That the spectral radius is 1 follows from Lemma 1.
I Now suppose we actually have t pattern vectors xl,...xt. We will assume temporarily that these span the space of input vectors, i.e. if the x’s are n-vectors, then the set of pattern vectors contains n linearly independent ones. (This restriction will be removed later.) For each pattern vector xp, we will have a different matrix B, say B, = (I - qxpxpT). Let A = BtBt-1...B1.
Lemma 2.3 If 0 < ‘1 < 2/11xp112 holds for each training pattern xp. and if the xp span, then 1 1 ~ 1 1 2< 1.
110
Proof By definition, there exists v such that 11A112 = llAv112 and
11~112=
1.
Thus 11AIl2 = (IBtB,, ... Blvll2 c 1IBtB,, ...B,I1211B1v112(from the definition of the norm). We identify two cases:
Case 1) If vTxl # 0, 11B1v112< 1, since the component of v in the direction of x is reduced (see Lemma 2.1: if this is not clear write v in terms of x and the perpendicular component, and apply Bl to it.) on the other hand IIBtBt-i - - ~ 2 1 1 2c
ll~Jlzll~t-i11~..11~2llz = 1.
Case 2) If vTxl = 0,then Blv = v (Lemma 2.1). Hence 11A112 = IIBtB,l may carry on removing B’s until Case 1 applies.
... B2v1I2 and we I
Remark
In theory, at least, one could compute the which is optimal in the sense of minimising 11A112 However this is unlikely to be worthwhile unless we can get an efficient algorithm. A common way to apply the delta rule is to apply patterns xl,xz,...xt in order, and then to
xl. The presentation of one complete set of patterns is called an epoch. Assuming this is the strategy employed, iteration (2.1) yields start again cyclically with
wL+~=
Awk+qh
(2.2a)
where A is as defined above and
h
= yI(BtBt-1 ... B 2 ) ~ 1+ ... yt-lBtxt-1 + Ytxt
.
(2.2b)
Here, of course, yp denotes the target y value for the pth pattern, not the pth element of a and l the x’s, but nor on the current w. vector. Note that the B’s and hence h depend on ? Since 6W in the delta rule is proportional to the error in the outputs, we get a fixed point of (2.1) only if all these errors can be made zero, which obviously is not true in general. Hence the iteration (2.1) does not in fact converge in the usual sense. On the other hand, we have shown (Lemma 2.2) that provided the xp span the space of input vectors, then for < 1. Hence the mapping F(w) = Aw + q h satisfies sufficiently small q.
llA112
i.e. it is contractive with contraction parameter 11A112. Mapping Theorem that the iteration (2.2a) does have Mapping Theorem may be found in most textbooks of systems: see for instance [Vidyasagar, 1978, ~ 7 3 1 .The fixed point is unique. Now if there exists a w that makes
It follows from the Contraction a fixed point. The Contraction functional analysis or dynamical theorem also guarantees that the all the errors zero, then it is easy
111
to verify that this w is a fixed point of (2.1) and hence also of (2.2a). Otherwise, (2.1) has no fixed points, and the fixed point of (2.2a) depends on 7: we denote it by ~ ( 7 )In. the limit, as the iteration (2.1) runs through the patterns, it will generate a limit cycle of vectors wk returning to w(q) after the cycle o f t patterns has been completed.
2.1.1. Dependence of the limit cycle on '1 Since w(q) is a fixed point of (2.2a) we have (writing h = h(q) and A = A(q) to emphasise the dependence) w(7) = Nrl)W(rl)
+qm)
(2.3)
*
Now what can we conclude about w(q)? Let us denote by (unique since the xp span) that minimises
W*
the weight vector w
(2.4a) Denote by X the matrix whose columns are xI,x2,...xt, and let
L
xxr
=
t
=
c
XpXpT.
Pl
Then
W*
satisfies the normal equations t
Lw* =
ypxp = h(0).
(2.4b)
F='
The second equality follows from (2.2b), observing that all the B matrices tend to the identity as q+O. On the other hand from (2.3) we get H(V)w(q) = h(q) where H(q) = (1 - N m . Assuming L-1 exists, define the condition number x(L) by x(L) = 11L-111z&112. Since L is symmetric and positive definite, it is easy to see that x ( L ) is equal to the ratio of the largest and smallest eigenvalues of L (compare [Isaacson and Keller, 1966, p10, equation (1 l)]). A standard result on the solutions of linear equations [Isaacson and Keller, 1966, p37] gives, provided llL - H(q)1I2 < IdlL-1112,
But
112
and considering powers of q in this product we obtain
= I-qL+O(q2)
Thus H(q) = L + O(q). Also an examination of the products in (2.2b) reveals h(q) = h(0) + O(q). This gives the first part of the following theorem.
Theorem 2.4 Suppose that the pattern vectors xp span, and that W* is (as above) the weight vector which minimises the least square error of the outputs over all patterns. If w(q) is a weight vector obtained by applying the delta rule with fixed q until the limit cycle is achieved, then as q+O,
b) If &(q)is the root mean square error corresponding to w(q) (see (2.4a)). and E' is the corresponding error for w*,then ~ ( q-)E* = O(q2) .
Proof a) follows from the remarks immediately preceding the theorem. The condition that the xp span is necessary and sufficient for L to be non singular: see (2.7) below. Note however, that it does not really matter if we look only at the end of the epoch of patterns: the result will apply to any w in the limit cycle. b) is simply the observation that for a least squares approximation problem, the vector of errors for the best vector w' is orthogonal to the space of possible w's, so an O(q) error in W* yields an O(q2) increase in the root mean square error. We will omit the details of this argument. I Remark In actual fact, Theorem 2.4 is not quite as satisfactory as it may appear, since the bound (2.5) depends on llL-1112. Although the spanning of the patterns xp is sufficient to guarantee the existence of L-1, in practice the norm can very large, as we shall see later. These results are illustrated by the following small numerical experiment. This used four input patterns each with a single output. These were
113
The first test used patterns 1 to 3 only. Since the input patterns are independent 3-vectors. the least squares error can in this case be made zero, and it is easily verified that in fact this occurs with
w =
(8).
The spectral radius and 11A\12for various values of T are shown in the Table 2.1. Table 2.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.967 0.928 0.881 0.823 0.750 0.654 0.525
0.967 0.929 0.888 0.848 0.866 0.939 1.035
With 7 = 0.7 we would expect that once the algorithm had settled down, the error would be almost halved after each epoch (complete cycle of patterns). and this indeed proved to be the case. The algorithm converged to the expected weight vector, and the mot sum of squares of the output errors was 5.33E-3 after 10 epochs, and 9.34E-9 after 30 epochs. We then repeated the tests with all 4 patterns. In this case, the output errors cannot all be made zero, and we expect a limit cycle. The corresponding values of spectral radius and two norm are given in Table 2.2. For small q, A is, of course, nearly symmetric, and the spectral radius and norm are almost the same. However, this property degenerates quite rapidly with increasing 1. and for ? >l0.2734088(7dp), the largest eigenvalue of A occurs as a complex conjugate pair. It would of course be unwise to make general deductions from this very small and simple exam le. but it is interesting to note that for this data, at least, e(A) is significantly smaller than gAl12 when ‘1is greater than about 0.3.
114
Table 2.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.935 0.868 0.816 0.819 0.866 0.929 1.006
0.933 0.852 0.725 0.640 0.612 0.677 0.786
The delta rule algorithm itself was run for this data with several values of 7. With ‘1 = 0.35 (the approximate minimiser of the two norm) the final value of the root sum of squares error was 0.53502 and convergence was obtained to 3dp after only 15 epochs. With
’1 = 0.1 the final error o b i n e d was 0.50157, but convergence to 3dp took 55 epochs. The exact least squares miiiimum is 0.5, and we observe that the error for q = 0.1 is comct to nearly 3dp, as suggested by Theorem 2.4b). The matrix L=XXT in this case is
4 2 2
.
(z:)
It has eigenvalues 6.372(3dp), 1 and 0.628(3dp). Therefore x ( L ) = 10.15(2dp). An interesting, and as yet unexplained, observation from the experiments is that when the limit cycle is reached, the error in the output for each individual pattern after the correction for that pattern has been applied was the same for all patterns. 2.1.2. When the patterns do not span One case we have not yet considered is when the xp fail to span. At first sight this may Seem unlikely, but in some applications, for instance in vision, it is quite possible to have more data per pattern than the number of distinct patterns. In this case we have more free weights than desired outputs, so we would normally expect to be able to get zero error. However this may not always be possible: it depends on whether the vector of required outputs is in the column span of the matrix whose rows are the input patterns. Since the row and column rank of a matrix are equal. we can guarantee zero error for any desired outputs if: i) The number of weights 3 the number of patterns, and ii) the patterns are linearly independent.
115
(For a discussion of interpolation properties of semilinear feedforward networks, see wartin, J.M.. 19901.) Now if the patterns do not span the input space, Lemma 2.3 fails to apply. However this is only a minor complication. Any weight vector w can be written as w = W + We where E span(xl ..... xt) and wC is in the orthogonal complement of this space. It follows from Lemma 2.1 that the matrix A simply leaves the part invariant. Thus, a simple extension of the arguments of Lemma 2.3 and the remarks following it shows that the mapping F(w) = /\w + tlh is contractive over the set of vectors which share a common vector Wr. Thus we will get convergence to a limit cycle. Note, however, that the particular
+
limit cycle obtained depends in this case on the wC part of the initial weight vector. A more serious problem is that L will be rank deficient. In this case there is no unique best w*,and (2.5) fails. This problem can in principle be tackled using the the singular value decomposition tools introduced in Section 3 but we omit the discussion in this paper. There is another reason for looking closely at the properties of L, as we shall now see.
2.2. The “epoch method”
Since we are assuming that we have a fixed and finite set of patterns xp. p = 1....t, an alternative strategy is not to update the weight vector until the whole epoch of patterns has been presented. This idea is initially attractive since it can be shown that this actually generates the steepest descent direction for the least squares error. We will call this the “epoch method‘’ to distinguish it from the usual delta rule. This leads to the iteration
n = (I - qXXT) = (I - qL). (2.6) is, of course, the equivalent of (2.2a), not (2.1), since it corresponds to a complete epoch of patterns. There is no question of limit cycling, and, indeed a fixed point will be a true least squares minimum. Unfortunately, however, there is a catch! To see what this is, where
we need to examine the eigenvalues of fl. Clearly L = X x ’ is symmetric and positive semi definite Thus it has real non-negative eigenvalues. In fact, provided the xp span, it is (as is well known) strictly positive definite. To see this we note that for any vector v, t
VTXPV =
c VT(XpXp’,V P=l t
=
c (VTX,)’
p= 1
.
(2.7)
116
Since the patterns span, at least one of quantities in brackets must be non-zero. The eigenvalues of R are 1 - q(the corresponding eigenvalues of XS), and for a strictly positive definite matrix all the eigenvalues must be strictly positive. Thus we have for 7 sufficiently small e(L)= < 1. Hence the iteration (2.6) will converge, provided the patterns span and is sufficiently small.But how small does q have to be? (Recall that for the usual 6 rule we need only the condition of Lemma 2.3) To answer this question we need more precise estimates for the spectrum of L and the norm of R. From these we will be able to see why the epoch algorithm does not always work well in practice. Suppose L = X S has eigenvalues [A? j = 1 ... n ) , with
ll~ll~
The eigenvalues of R are (l-qkl) C (1-qh2) C .... 6 (1-qh,), and e(n) = max ( l - ~ & Il-qk,, (Observe that fl is positive definite for small q, but ceases to be so when q becomes large.) Now
I
1,
I}.
Thus from (2.7)
Hence (2.9)
Note that this upper bound, and hence the corresponding value of l l required, is computable. On the other hand, we can get a lower bound by substituting a particular v into the expression on the right hand side of (2.8).For instance, we have for any k, k=l, ... t,
We next consider some special cases. Case 1: The xp collapse towards a single vector. This situation can arise in practice
117
when the neural net is q u i r e s to separate two classes of vectors which lie close together in the sense that the angle between their means is small. For definiteness, suppose for some f i x 4 v, 11v11* = 1,
xp = v + &e, where e,, is the pth standard basis vector. Then considering (2.8) and (2.9) we see that lim 1, = t. E 4
Also, lim A,, = 0 : this follows simply from the fact the rank of XXr collapses to 1. €+O Case 2: The xp cluster around two vectors u and v which are mutually orthonormal. If these represent two classes which are to be separated, we are in the ideal situation for machine learning. However, even in this case the behaviour of the epoch method is less than ideal. If the clusters are of equal size, we have from (2.8)
limA1)t/2 &+O
andagain
limI,=O:
€+O
since the rank of L = X S collapses to 2. Case 3 The example considered at the end of 2.1.1. Here n = 3, and as was described above, 1, = 6.372, h, = 0.628. From these values, the eigenvalues of R = (I- VL) are easily calculated, as given in Table 2.3.
Table 2.3
tl 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.937 0.874 0.912 1.550 2.186 2.823 3.411
A comparison of this table with Table 2.2 clearly illustrates the rapid growth of Q(R) compared with e(A), as tl increases from zero. In these cases we will need a relatively small value of 7 to get the spectral radius less than 1. In practice, the epoch method does tend to be less stable than the ordinary delta rule. An inkresting open question is whether we always have @A) c Q(n).Note also that
118
I, will be very small, and the corresponding eigenvalue of R close to 1: however, since small eigenvalues correspond to vectors nearly orthogonal to the span of the patterns, the exact effect of this is not immediately clear. This issue is addressed further in Section 3.1: see equation (3.3). Another way of looking at these results is to consider the iterations as approximations to the gradient descent differential equation. Provided the xp span, this differential equation (for the linear perceptron) obviously has a single, asymptotically stable fixed point which is globally attractive. However, at least in the three cases above, the differential equation is s t i r . (Compare Section 1.2). This stiffness is severe in the frst two cases, and marginal in Case 3. The epoch iteration (2.6) is simply Euler’s approximation to the descent differential equation. Euler’s method is not stiffly stable, and so only mimics the behaviour of the The l. iteration (2.2) provides a kind of stiffly stable differential equation for very small ? approximation to the differential equation, albeit having the disadvantage of an oscillatory solution.
2.3. Generalisation to non-linear systems As is well known, the usefulness of linear neural systems is limited, since many pattern recognition problems are not linearly separable. We need to generalise to non linear systems such as the backpropagation algorithm for the multi-layer perceptron (or semilinear feedforward) network. Clearly we can only expect this type of analysis to provide a local result: as discussed in Section 1, global behaviour is likely to be more amenable to dynamical systems or control theory approaches. Nevertheless, a local analysis can be useful in discussing the asymptotic behaviour near a local minimum. The obvious approach to this generalisation is to attempt the “next simplest” case, i.e. the backpropagation algorithm. However, this method looks complicated when written down explicitly: in fact much more complicated than it actually is! A more abstract line of attack turns out to be both simpler and more general. We will define a general non-linear delta rule, of which backpropagation is a special case. Suppose the input patterns x to our network are in Rn, and we have a vector w of parameters in R M describing the particular instance of our network i.e. the vector of synaptic weights. For a single layer perceptron with m outputs, the “vector” w is the the mxn weight matrix, and thus M = mn. For a multilayer perceptron, w is the Cartesian product of the weight matrices in each layer. For a general system with m outputs, the network computes a function G:RMxRURm, say v = G(w,x)
11 11.
where v E Rm. We equip RM, IRm and Rn with suitable norms Since these spaces are finite dimensional, it does not really matter which norms, but for definiteness say the For pattern xp, denote the corresponding output by vp. i.e. Euclidean norm
11 112.
vP = G ( w , x ~ ) . We assume that G is Frechet differentiable with respect to w, and denote by D = D(w,x) the mxM matrix representation of the derivative with respect to the standard basis. Readers unfamiliar with Frechet derivatives may prefer to think of this as the gradient
119
vector: for rn = 1 it is precisely the row vector representing the gradient when G is differentiated with respect to the elements of w. Thus, for a small change 6w and fixed x, we have (by the definition of the derivative) G(w+Gw,x) = G(w,x) + D(w,x)6w + o(l16wll) .
(2.10)
On the other hand for given w, corresponding to a particular pattern x,. we have a desired output yp and thus an error E~ given by
eP2 = cV,-v,~TcVp-vp~ = qpTq,,
say.
(2.11)
The total error is obtained by summing the EPZ's over the t available patterns, thus t
&2
=
CEp2. pi'
An ordinary descent algorithm will seek to minimise €2. However the class or methods we are considering generate, not a descent direction for &*, but rather successive steepest descent directions for E;. Now for a change 6qp in qp we have from from (2.11)
6ep' = (4,
+ %,)T(q,
= 2SqpTq,
+ 8qp) - q,Tq,
+ 6qpT6qp.
Since y, is fixed,
69, = - 6v,
=
- D(w,xp)'w + o(ll6wll)
by (2.10). Thus
6e: = - 2( D(w,xp)Gw)T( yp - G(w,xp) 1 + ~(IlSwll) =
- 26wT( D(W.X,) IT( yp - G(w,xp) ) + ~(IlSwll).
Hence, ignoring the o(l16wll) term, and for a fixed size of small change 6w, the largest decrease in &: is obtained by setting
6~ = T( D(w,xp) P(yp - G(w,xp) ) . This is the generalised delta rule, Compare this with the single output linear perceptron, for which the second term in this expression is scalar with G(w,xp) = wTxp,
and the derivative is the gradient vector (considered as a row vector) obtained by differentiating this with respect to w, i.e. xpT. Thus we indeed have a generalisation of (2.1).
120
Hence, given a kth weight vector wk, we have wk+1 = wk+ 6wk
= wk + q( D(Wk&p) IT( Yp - G(Wk+p)
.
(2.12)
The backpropagation rule mumelhart and McClelland, 1986, pp322-3281 is , of course, a special case of this. To proceed further, we need to make evident the connection between (2.12) and (2.1). However, there is a problem in that, guided by the linear case considered above, we actually expect a limit cycle rather than convergence to a minimum. Nevertheless it is ' , of (2.11): necessary to fix attention to some neighbourhood of a local minimum, say w clearly we cannot expect any global contractivity result as in general (2.12) may have many local minima, as is well known in the backpropagation case. Now from (2.10) and (2.12) we obtain (assuming continuity and uniform boundedness of D in a neighbowhood of w*), Wk+l
= wk + q( D(wk,xp) IT( yp - G(w*.xp) - D(W*.Xp)(Wk-W*)) + o(~lwk-w*l~) = (1 - qD(WkpXp)Q(W*,Xp))Wk + q( D(Wk.Xp) )T( yP - G(w*,Xp)+ D(W*,xp)W*) + o(~lwk-w*~~)
(2.13)
The connection between (2.13) and (2.1) is now clear. Observe that the update matrix (I - tlD(wk.xp)%(w*,xp))is not exactly symmetric in this case, although it will be nearly so if wk is close to w'. More precisely. let us assume that D(w,x) is Lipschitz continuous at w., uniformly over the space of pattern vectors x. Then we have wk+1 = (1 - qD(w*Jp)~(w*Jp))Wk + q( D(w*,Xp) IT( yp - G(W*Jp)+ D(W*,Xp)W*) + o(llwk-w*ll)
(2.14)
Suppose we apply the patterns xl,,..,xtcyclically, as for the linear case. If we can prove that the linearised part (i.e. what we would get if we applied (2.14) without 0 term) of the mapping Wk:+wk+t is contractive, it will follow by continuity that there is a neighbourhood of W' within which the whole mapping is contractive. This is because, by hypothesis, we have only a finite number of patterns. To establish contractivity of the linear part, we may proceed as follows. First observe that D(w',xp)~(w*,xp)is positive semi definite. Thus for 7 sufficiently , x 1.~ )We ~ ~ may ~ decompose the space of weight vectors small, 111 - ~ D ( w * , ~ ~ ) Q ( w * C into the span of the eigenvectors corresponding to zero and non-zero eigenvalues respectively. These spaces are orthogonal complements of each other, as the matrix is symmetric. On the former space, the iteration matrix does nothing. On the latter space it is contractive provided
q < l@( D(wf,xp)7D(w*,xp)1 .
(2.15)
We may then proceed in a similar manner to Lemma 2.3, provided the contractive subspaces for each pattern between them span the whole weight space. If this condition fails then a difficulty arises, since the linearised product mapping will have norm 1, so the
121
non-linear map could actually be expansive on some subspace. For brevity, we will not pursue this detail here. In the c89e of a single output network. D(w;x) is simply a row vector and any acceleration strategy for the linear algorithm based on Lemmas 2.1 and 2.3 should be fairly easy to generalise to the non-linear case. Even for the multi-output case, (2.15) suggests D(wk,xP)W(wk.xp) ) to control the choice of learning rate q. The matrix will only using have the same order as the number of outputs, which in many applications is quite small. It is regrettable that (2.14) has to be based on nearness to a local minimum w ' , but it is difficult to see how to avoid this. The fact that an algorithm generates a descent direction for some Liapunov function is not sufficient to force contractivity. The latter property in Lemma 2.3 arises from the convexity of the quadratic error surface. Nearness to a local minimum in (2.14) enforces this property locally. Nevertheless, the fact that the generalised delta rule only uses information local to a particular pattern xp, rather than the whole pattern matrix X. would still seem to be a very desirable property to have in view of the results of Sections 2.1 and 2.2. There is little to say about the special case of backpropagation, other than that for a multilayer perceptron. the derivative D(WJK)is relatively easy to calculate: this is what makes backpropagation of the error possible.
e(
3. FILTERING, PRECONDITIONING AND THE SINGULAR VALUE DECOMPOSITION Equation (2.14) shows that the non-linear delta rule can be locally linearised and then behaves like the linear case.For simplicity,.therefore. we shall largely restrict attention in this section to linear networks.
3.1. Singular values and principal components As is apparent from the Section 2, (see e.g. (2.7), (2.13) ), matrices of the form Yv are of importance in the context of non-linear least squares. We also pointed out after (2.7) that an analysis of the case when the matrix X S is rank deficient, or nearly so, is important in studying these problems. Not surprisingly therefore, this problem has received considerable attention in the literature of both numerical analysis and multivariate statistics. (See e.g. the chapters by Wilkinson and Dennis in [Jacobs, 19771, pages 3-53 and 269-312 respectively. Also chapter 6 of [Ben-Israel and Greville, 19741.) The exposition given here is based on that of Wilkinson. who deals, however, with complex matrices. We will consider only real matrices. The key construction is the singular value decomposition (SVD).
Theorem 3.1 (Singular Value Decomposition) Let Y be any mxn real matrix. Then a) Y may be factorised in the form
Y = PSQT
where P and Q are orthogonal matrices (mxm and nxn respectively), and S is an mxn matrix which is diagonal in the sense that sij = 0 if i # j. The diagonal elements (li = sii are non negative, and may be assumed ordered so that 'J1 2 U, ... 2 Clmin(m,n) 2 0. These a's are called the singular values of Y. (Some authors, including [Ben-Israel and Greville, 19741, define the singular values to be the non-zero a's.) The columns of Q are eigenvectors of V Y , the columns of P are eigenvectors of Y F ,and the non-zero singular values are the positive square roots of the non-zero eigenvalues of STS or equivalently of SST.
In fact, b) is used to prove a). We consider first the case n 2 m. The matrix V Y is nxn, symmetric and positve semi definite. Then with Q as defined in b), and assuming that the eigenvalues of Y are (J? arranged in non-increasing order, we have Q T W Q = diag(ai?) .
(3.2)
If Y is of rank r, then the last n - r of the Oi are zero. Let qi denote the ith column of Q, and pi = Yqi/ai
i = 1. ... r.
It follows from (3.2) that the pi form an orthonormal set. If' r c n. extend this set to an p " . for i = l,..,n, we orthonormal basis for IRn, by adding additional vectors ~ ~ + ~ , . . . , Then have ith column of YQ = Yqi = D i p i . Thus if P is the orthogonal matrix formed from the columns pi. and S is as defined in a),
Y Q = SP
or
Y
=
PSQT.
This completes the proof of a) for n 2 m. The final part of b) namely that the pi are. eigenvectors of Y, follows froni the observation that Y Y r = PSQTQST
=
PSSTPT.
Transposing (3.1) gives YT = QSVT, from which the case n < m may be deduced. c) is simply the observation [Isaacson and H.B.Keller, 1966, plO] that llYll3 = @(m), where Q denotes the spectral radius.
123
Note that the condition n C m is essential. Otherwise, v could be in the kernel of Y, even
if an * 0.
It is important to emphasise that Theorem 3.lb) should not be used to compute the singular value decomposition (3.1) numerically, as it is numerically unstable. A popular stable algorithm is due to [Golub and Reinsch, 19701. As in (2.4). we denote by W* a weight vector which minimises the total least squares error for a single output linear perceptron. We observed that a fixed point of the "epoch method'' (2.6) will satisfy the normal equations (2.4b). In the discussion of (2.6). we pointed out that in certain plausible leaming situations, the matrix X whose columns are the input patterns, may be rank deficient or nearly so, with the result that the iteration matrix R in (2.6) may have an eigenvalue of 1 or very close to it. The remark was made that this might not matter in practice: we can use the SVD to investigate this further. Replacing Y by X in Theorem 3.1, we write X = PSQTwhere P and Q are orthogonal and S is diagonal (but not necessarily square). To be consistent with notation used later, we will call the singular values of X (the diagonal elements of S) v~...v,If we here denote by y the column vector of outputs (y1,...yJT corresponding to the input patterns xl,....xt, then the total least squares error E (2.4a) may be re-written
since. Q is orthogonal. Now set z = p w , and u = QTy. Then if X has rank r, i.e. r non zero singular values, then
It is now obvious that the least squares problem is solved by setting q' = ui/Vi, i = 1,...,r, choosing the other q' arbitrarily (but sensibly zero), and then setting W* = Pz*. The minimum emor is then given by the second sum on the right hand side of (3.3). We note that if r # M,where M is the number of input patterns, then w' is not unique, but if the undetermined zi are set to zero as suggested, then we get as W* the solution with minimal two norm. In matrix form, let S# be the matrix defined by sii# = 1/Vi, i = 1,...r, and sij# = 0 otherwise. Then Z* = S%I or W* = PS#QTy . The matrix (P)# = PS#QT is called the Moore Penrose psuedoinverse of XT: its properties are discussed at length in pen-Israel and Greville, 19741. However, (3.3) makes apparent a fundamental problem with the least squares approach when X is nearly, but not exactly, rank deficient. As indicated in Section 2 on the
124
discussion of the epoch method, this is likely to occur even in problems that are “good” from the learning point of view. Very small, but non-zero singular values have a large effect both on W* itself and on the error as given by (3.3). although they correspond to “noise” in the pattern matrix X: i.e. they do not give us useful information about separating the input patterns. Indeed the small singular values correspond to similarities in the patterns, not dSfSerences. The decomposition of a data matrix by the SVD in this way, in order to determine the important differences, is called by statisticians principal component analysis. and sometimes in the medical field factor analysis. Now let us take another look at the iteration (2.6). namely
where fl = (I - qXXT). In terms of the notation developed here, we have
- qkx y . wk+1 = (1 - ~ P s s ~ T ) w
or with the obvious extension of notation,
zk+1 = (I - qssT)zk - qsu .
(3.4)
At this point the notation becomes a little messy: let us denote by (zk)i the ith element of zk. These elements are decoupled by the SVD, more specifically (3.4) when written elementwise gives
( z ~ + ~ ) (1 ~ =- vi2)(zk)i - qviui. for i = l,..,r and
( z ~ + ~ ) (zk)i ~ = for i = r+l, ...,M As expected, the iteration has no effect on components corresponding to zero singular values. Moreover the good news is that with a suitable choice of 7, the iteration will converge rapidly for those components that “matter”, i.e. those connected with large singular values. The bad news is that this will not be apparent from an examination of the least squares error (3.3). as this will be dominated by the slowly convergent terms. This is unfortunate, as many published software packages for backpropagation, for example that accompanying [Rumelhart and McClelland, 19871, use the least mean square error for termination. Various authors, e.g. [Sontag and Sussman, 19911, have suggested that the use of this criterion is not ideal for solving the pattern classification problem. An interesting question for further study might be to see if the delta rule classifies better if it is not iterated to convergence. but only for the principal components as defined in Section 3.2. If so. what stopping condition could be used?
125
3.2. Data Compression, Perceptrons, and Matrix Approximations Another important property of :he SVD is its use in solving matrix approximation problems. It is possible to use perceptrons for data compression: one of the few occasions on which cascading linear networks is useful. This approach has been discussed in a connectionist context by @3aldi and Horn&, 19891 using the language of principal component analysis. In fact their result is equivalent to a standard result of approximation theory, although the paper is of c o m e of value in pointing out the connection. As before, let us consider a perceptron with input vectors xl, ...,xt, and assume that these vectors are in RM. Instead of a single output, we now consider t output vectors yl, ....,yt in IR*. The weights thus form an nxM matrix W. The intention here is that n < M. For instance, the input vectors might be bit-mapped images, which we wish to store in compressed form. To decompress the y vectors, we feed them (without thresholding) into a further perceptron with Mxn weight matrix V. producing (unthresholded) output vectors ol. ....,oc The idea is that each oi should be a good approximation to the corresponding xk Of course, rank(VW) 6 n < M. There is no point in trying to make WV approximate I if there is no commonality in the input patterns the compression task is hopeless. So. let us again form the matrix X whose columns are the xi’s. Our aim is now to choose W and V to minimise IlX - VWXlls, where denotes the Schur norm. (The Schur norm of a matrix is simply the square root of the sum of the squares of all its elements.) Matrix approximation problems of this type are discussed in some detail in Chapter 6 of [Ben-Israel and. Greville, 19741: we will give the solution here just for this particular problem. We first observe that for any matrix A and compatible orthogonal matrix 0. llOAlls = since, indeed, multiplication of each column of A by 0 preserves the sum of squares. Similarly for postmultiplication. Now, as above, let X = PSQT be the SVD of X, and suppose rank(X = r. The crucial stage is to find a matrix H satisfying r a n k 0 C n and which minimises l/X - PHPTXII,. Once we have H it is not difficult to factorise PHPT to get W and V. But
11 Ils
llAlls
IlX - PHPTXIIs
=
110 - PHPT)Xlls = 110 - PHPT)PSQTJls
IIP - PH)Slls
= =
i vi2( i=1
(1
= Il(1
- hii)2 +
-
H)SllS
c hj? ) , j+i
where, as above, we denote the singular values of X by vi. Obviously, at the minimum H is diagonal. But we require rank(H) < n. Thus the minimising H is obtained by setting hi = 1, i = 1,...,min(r,n) and the other diagonal elements to zero. If r G n, there is no loss of information in the compression, and the patterns xp are reconstructed exactly. If r > n, then the total error over all patterns is given by r CVi’. i=n+l
126
It remains to perform the factorisation VW = PHPT. While the choice of H is unique, this is not so for the factorisation. However, since PHPT is symmetric, it makes sense to set
w. mT.
V = In fact we have H f l = H, whence PHPT = P m T = PH(PH)T . PH has (at most) n non zero columns: we may take these as V and make W = Vr = the f i s t n rows of Thus the rows of W are those eigenvectors of X f l corresponding to the largest singular values: the principal components. The effect of W is to project the input patterns xp onto the span of these vectors. Specifically, if Y is the matrix whose columns are the compressed patterns yI. i = l,..,t, and G is the matrix formed from the f i s t n rows of fl then
Y = WX = GPTX = GSQT. The importance of the matrices P, Q and S arising from the SVD is clearly illustrated here. Of course, calculation of the SVD is not a natural connectionist way to calculate V and W: as maldi and Hornik, 19891 point out, they can be computed using backpropagation. The importance of the SVD is not restricted to discrete time semilinear feedforward networks: see for example Keen, 19911 where it is shown to be the crucial factor in determining the behaviour of a continuous time representation of a neural feature detector. We remark in passing that the SVD can also be used to solve matrix approximation indeed the construction of the Moore Penrose pseudoinverse described problems in above can be regarded in this light.
11 112:
3.3. Filters, Preprocessing and Preconditioning Many authors have commented on the advisability of performing some preprocessing of the input patterns before feeding them to the network. Often the preprocessing suggested is linear. At f i s t sight this seems to be a pointless exercise, for if the raw input data vector is x, with dimension 1, say; the preprocessing operation is represented by the nxl matrix T, W is the input matrix of the net and we denote by the vector h the input to the next layer of the net. then
h =WTx.
(3.5)
Obviously, the theoretical representational power of the network is the same as one with unprocessed input and input matrix WT. However, this does not mean that these preprocessing operations are useless. We can identify at least the following three uses of preprocessing. i) To reduce work by reducing dimension and possibly using fast algorithms (e.g. the FFT or wavelet transform). (So we do not want to increase the contraction parameter in the delta rule iteration.)
ii) To improve the search geometry by removing principal components of the data and corresponding singular values that are irrelevant to the classification problem. iii) To improve the stability of the iteration by removing near zero singular values
127
(which correspond to noise) and clustering the other singular values near to 1: in the language of numerical analysis to precondition the iteration. We will not address all these three points here directly. Instead we will derive some theoretical principles with the aid of which the issues may be attacked.
3.3.1. Filters and stability of learning The first point to consider is the effect of the filter on the stability of the learning process. For simplicity, we consider only the linear case here. We hope, of course, that a suitable choice of filter will make the learning properties bettm, but the results here show that whatever choice we make, the dynamics will not be made much worse unless the filter has very bad singular values. In particular, we show that if the filter is an orthogonal projection, then the gradient descent mapping with filtering will be at least as contractive as the unfiltered case. Considering first the "epoch method" (2.6).we see from (2.6) that the crucial issue is the relationship between the unfiltered update matrix R = (I - qlXm and its filtered equivalent (I - 72TXX"P) = say. Note that these operators may be defined on spaces of different dimension: indeed for a sensible filtering process we would expect the filter T to involve a significant dimension reduction. Nole also that we have subscripted the learning rates '1 since they might be different. A natural question is to try to relate the norms of these two operators, and hence the rate of convergence of the corresponding iterations. As in Section 2, we suppose L = X p has eigenvalues (Ap j = l...n], with
(Note here we assume the x's span so
A,, #
0. In terns of the singular values vi of X,
v? = li)
The eigenvalues of R are (l-qlll) 6 (1-q112) .... < (l-q,ln),and
with a similar result for the filtered iteration matrix R'. Hence we need to relate the eigenvalues of X f l with those of T X m = L', say. Let L' have eigenvalues pl b p2 )...a pl A I and T have singular values ul 2 c2)....a 0, > 0. Note that we are assuming T has full rank n.
Proposition 3.2 With the notation above, pl 4 'J,2hl and p., 2 un2A.,,.
Proof The frst inequality is straightforward. Since the L and L'are symmetric
128
The second inequality is only slightly more difficult. Let u, be the normalised eigenvector of L'corresponding to p,. Then
But IImunl12 2 An%,,
by a double application of Theorem 3.ld). I -
This result means that Iln'l12cannot be much larger than llnllzif T has singular values close to 1. Many of the most useful filters are projections (although many others, e.g. edge detectors, are not). Projections can be defined in arbitrary vector spaces, and orthogonal projections in general inner product spaces, but for simplicity we will here consider only R n with its usual inner product.
Definition 3.3 a) A linear mapping PIRQIRn is said to be a projection if P(Pv) = Pv for all v E IRn.
b) A projection is said to be orthogonal if (v-Pv)'%
= 0 for all v E Rn.
I Given a subspace S of Rn, the mapping that takes each vector v to its best least squares approximation from S is an orthogonal projection onto S: in fact it is the only such orthogonal projection with respect to the standard inner product. The orthogonality condition in Definition 3.3b) is simply the normal equations for the least squares approximation problem. We list some properties of projections in the next lemma.
Lemma 3.4 For any projection (excluding the trivial one Pv = 0 for all v). a) All the eigenvalues of P are 0 or 1. b) For any norm
11 11 on IRn, we have for the corresponding operator norm, llPil 2
1,
c) If P is an orthogonal projection, llP112 = 1 and indeed all the non-zero singular values of the matrix representation of P with respect to an orthonormal basis are 1.
Proof a) An immediate consequence of Definition 3.3a) is that any eigenvalue A of P must satisfy A2 = A. The only solutions of this equation are 0 and 1. b) Recall llPll 2 llPwll for all w E IRn satisfying llwll = 1. Choose w = Pv/IIPvII for any v # 0.
129
c) Clearly it is sufficient to prove this for a particular orthonormal basis, since changes between such bases are represented by orthogonal matrices which will not change the singular values (compare Theorem 3.1). Let S be the image of P, and construct an orthonormal basis for S , say [sl. ...,s,.), where r is the dimension of S and r C n. Extend this basis to an orthonormal basis for IRn by adding extra vectors (sr+l, ...,%). With respect to this basis we have Psj = sj. j = 1,...,r so the first j columns of the matrix =presentation of P are the first r columns of the n dimensional identity matrix: indeed this is true for any projection onto S, not just the orthogonal projection. However, for an orthogonal projection and for j > r, we have
Since Psj E S and sj is orthogonal to all elements of S by construction, the first term on the right hand side is zero. Thus Psj = 0. Hence the last n - r columns of the matrix representation of P are zero vectors.
a
In practical applications, we do not retain the whole of Wn when using orthogonal projections. In fact the more usual approach is to start by selecting an orthonormal basis and deciding which r of the n basis vectors to “keep”. We may combine Proposition 3.2 and Lemma 3.44 as follows.
Corollary 3.5 Let Isl,...,s,,) be an orthonormal basis for R*. Suppose we express each pattern vector x in terms of Isl,...,s,,) and then discard the component corresponding to [s,.+,, ...,s,,). (Hence x is represented in compressed form by an r-vector. s say, and the problem dimension has been reduced from n to r.) If T is the matrix representing the filter which takes x to s, then
Proof Clearly T represents the non-zero part of an orthogonal projection. Thus all its singular values are 1. The result now follows from Proposition 3.2.
a
This result means that we can apply orthogonal projections to reduce the dimension of our problem without a deleterious effect on the contractivity of the steepest descent operator. We now give some concrete examples. Examples i) The discrete Fourier transform. The basis functions s, whose kth element is defined Thus we may smooth by (s,), = e W - l ) ( k - l U n , where i* = -1, are orthogonal in 0. and compress our patterns xp by expressing them in terms of this basis by means
130
of the Fast Fourier Transform algorithm, and then deleting terms involving high frequencies. (Complex numbers can, of course, be avoided by using sines and cosines.) ii) Other filters based on orthogonal expansions can in principle also be used: the wavelet basis is of course particularly attractive. In general, the basis functions will not be orthogonal with respect to the discrete inner product: since they are normally defied with respect to the inner product of functions defined on an interval. Nonetheless, it is reasonable to suppose that when the projection operators and inner product are discretised, we will land up with an filter matrix T with singular values close to 1, although a further study of this question would certainly be useful.
iii). A less obvious example concerns pixel averaging in image processing. A common way of reducing the size of a grey scale image is to take pixels in two-by-two blocks and average the grey levels for the four pixels, thus reducing the linear resolution by two and the number of pixels by four. This is not a mapping from Rn to itself, but we can make it one by replicating each averaged pixel four times, or equivalently, in each block of four pixels we replace each pixel by the average of the four grey levels in the block. Thus if a block of four pixels initially has grey levels g,.g2,g3,&, after averaging each pixel will have grey level (g1+g2+g3+&)/4. This is obviously a projection: less obviously it is an orthogonal projection. For in each block of pixels we have =
gi
-
(gl+g2+g3+g4)/4, i = 1.2.3.4.
Hence “(v
-
Pv)Tv”
= g1+gz+gs+g,
- 4(g1+gz+g3+&)/4
= 0.
Thus if correctly implemented, pixel averaging should not reduce the rate of convergence of the delta rule. This appears to be contrary to the results given in wand, Evans and Ellacott. 19911 in which it was reported that for a backpropagation net, the rate of convergence was degraded. This must be due either to the effect of the non-linearities, or, more likely, to the way in which the iteration was implemented. The authors intend to reconsider the experimental results in the light of this theoretical development. It is unfortunate that we have not as yet obtained similar results for the iteration (2.1). The very stability of (2. l), together with the fact that it contains information only about one pattern, makes it much more difficult to obtain corresponding bounds. However, (2.1) and (2.6) are. asymptotically the same for small q, it is to be expected that “good” filters for (2.6) will also be “good” for (2.1). There is no reason in principle why the results of this section cannot be extended to non-linear networks via (2.14), although they might be more difficult to apply in practice. The crucial issue is not the effect of the filter on a pattern x, but rather on the Frechet derivative matrix D(w,x). (2.13) shows that principal components of D(w,x) correspond to important directions in the topology of the search space.
131
3.3.2. Data compression and the choice of a filter How do we go about choosing a suitable preprocessing strategy? Usually we search for some insight as to what features of a particular problem are important. This insight may be based on biological analogy, such as attempts to mimic the processing carried out in the human visual cortex; on signal processing grounds (e.g. Fourier, Gaussian or wavelet filters) or on the basis of some mathematical model. However the key issues sre the effect on the learning geometry and the learning dynamics. What algebraic properties should the filter have? One such property has been addressed in the previous section: it should have non-zero singular values close to one unless an alternative choice can be shown to cluster the singular values of the iteration matrix. Another issue to consider is the reduction in work resulting from compressing the input data. Recall the situation described at (3.5). Our raw input data vector is x, with dimension 1, say; the preprocessing operation is represented by the nxl matrix T. W is the input matrix of the net and we denote by the vector h the input to the next layer of the net, then h
= WTx.
(3.5)
In a typical vision application. we may find that the dimension 1 of the input pattern x is much greater than the dimension of h which is the number of nodes in the next layer of the network. This means that WT has very small rank compared with the number of columns. We would like'to choose T so that there exists W such that
for any V (of appropriate size) and input pattern x, but for which the number of rows of T is not much greater than the rank of W. i.e. the dimension of h. Such a choice would mean that the number of free parameters in the learning system (the number of elements of W compared with V) has been drastically reduced, while the achievable least squares error and the representational power of the network is unchanged. Since (3.6) is to hold for each pattern x, we actually want VX = WTX
(3.7)
where, as previously, X is the matrix whose columns are the patterns. Obviously, we cannot, in general, satisfy (3.7) exactly. However, an approximate solution to this problem is once again provided by the singular value decomposition. Suppose we express X as PSQT where P and Q are orthogonal and S is the diagonal matrix of singular values of X, say V1 2 v2 )....a v1 >, 0. Then
Suppose we wish T to be an rxl matrix, with r < 1 so that rank T S r. It is natural to choose T = GPT where G is the rxl matrix with gi, = 1 if i = j and 0 otherwise. (T thus represents a truncated Karhunen Loeve expansion.) We obtain
132
Clearly the best possible choice of W here is the f i s t k columns of VP and with this choice
With this choice of T. the maximum error in replacing V by TX will be negligible provided r is sufficiently large that vW1is negligible compared with vl. Moreover, T is an orthogonal projection so corollary 3.5 applies. Tht Karhunen Locve expansion is thus the natural choice of linear filter for rank reduction. (Non linear filters such as edge detectors. radial basis networks, or adaptive filters such as that of [Lenz and Osterberg, 19901 could be better, of course). But the expansion does not “solve” the problem of linear filtering. As has already been pointed out, retaining just the principal components of X may not retain the information that is relevant to a particular classification problem: the components corresponding to smaller singular values may in fact be those that we require! We can only be confident in removing singular values at the level of random noise or rounding error. Even ignoring this problem, there are other objections to routine use of the Karhunen Loeve expansion: i) Although optimal for rank reduction, the Karhunen Loeve expansion is not optimal as a pnxonditioner. It does remove very small singular values, but leaves the others unchanged, while we wish to cluster the eigenvalues of X P . ii) Detennining the Karhunen Loeve expansion is difficult and expensive computationally. Moreover in a practical learning situation, we may not have the matrix X W y available anyway. Even assuming that we do have all the leaming patterns x available at once, the very large matrix X will be difficult to construct and store.
iii) The expansion q u i r e s a priori knowledge of the actual data: it does not give us a filter that could be hardwired into the leaming as part of (say) a robot vision system. The first point implies that even if we do compute the expansion, we may need to combine it with a preconditioner: we will consider preconditioning below. Now the second point. The standard algorithms for singular value decomposition are not easily implementable on neural networks. As we have seen. the principal component space of X can be computed by backpropagation, but in the linear case this is a computation as expensive as the problem we are trying to solve. In the non-linear case which is the one of most practical interest, computation of the expansion might be worthwhile but is still likely to prove challenging and costly. One possible way out of this difficulty has been suggested by wickerhauser, 19901. He recommends a preliminary compression in terms of a wavelet basis which is chosen optimally in the sense of an entropy-like measure called the theoretical dimension. While Wickerhauser uses this as an intermediate stage to compute the Karhunen Loeve expansion, in the neural net context it might be more sensible just to accept the degree of compmssion provided by the wavelets. A more simple minded approach is of course, is to compute the filter T on the basis of a representative subset of the data rather than all of it. But we then have the problem of how to pick such a subset.
133
The third point is the most fundamental. We do not really want a filter tied to a particular set of data: we want a fixed strategy for preprocessing that will cover a large class of different, albeit related, learning problems such as that used by mammals for vision. Obviously to have any hope of success, we must assume that there is some common structure in the data. What type of structure should we be looking for? As far as data compression is concerned, the information that is most likely to be available and usable is some condition that will guarantee the rapid convergence of some orthogonal expansion or similar approximation process. In practical applications, the pattern vectors x are likely to be spatial or temporal samples of some nondismte data. In vision applications, grey scale images come from A to D conversion of raster scan TV images. In speech recognition, the incoming analogue waveform is either sampled in time or passed through some frequency analysis process. In many cases, therefore, it should be possible to employ some measure of smoothness: differentiability or modulus of continuity in the space or time domain; or conditions on the rate of convergence to zero at fw of the Fourier transform. Given such conditions on the data, rate of convergence results abound in the approximation theory textbooks for the classical basis functions (polynomials, splines, Fourier series) and are beginning to appear for the newer bases such as wavelets or radial basis functions. ([Wickerhauser, 19901, Powell, 19901). (Note that this linear use of radial basis functions is different from the radial basis neural nets found in the literature, which use the basis in non-linear fashion to aid the classification problem.) Of course, the use of orthogonal expansions to preprocess neural net data is commonplace on an ad hoc basis, but a rigorous analysis of the data compression and preconditioning effects would Seem to be overdue.
3.3.3. The optimal preconditioner As in the case of data compression, an ideal choice of filter to act as a preconditioner would not require knowledge of the particular data set under consideration. But while the
requirement - rapid convergence - for a data compressor is obvious, there does not seem to be any clear handle on the preconditioning problem. Therefore we only consider preconditioners based on a known data matrix X. First observe that the theoretically optimal preconditioner for the iteration (2.6) is both easily described and completely useless! Suppose, as above, X has singular value decomposition PSQT. We set T to be the Moore Penrose inverse of X (see the remarks and definition after equation (3.3)). i.e.
T
= X* = QS#€’T.
Then TX
= QS#P‘PSQT
= QS#SQT.
Thus
L‘
= T X W = QS#SSTSflQT,
and S#SSTSm is a diagonal matrix with diagonal elements either 0 or 1. Thus all the eigenvalues of L‘ are either 0 or 1, and indeed, if the x’s span so that X x ’ has no zero
134
eigenvalues, then all the eigenvalues of L' are 1. With 7 = 1, the iteration (2.6) will converge in a single iteration. This is not surprising, since once we know X#, the least squares solution for w may be given explicitly. For the non-linear case we would need to compute the compute local pseudoinverses for the vectors D(w,x) (compare (2.13) and (2.14)). This amounts to local solution of the tangent linear least squares problem at each iterate, and if we are going to go to such lengths, we would be better off using a standard non-linear least squares solver. Moreover, in practice, as we have seen, X f l is likely to have small eigenvalues. so a stable computation of X# is likely to be difficult. A modification of the approach which might be slightly more practicable is just to remove the large eigenvalues of XX? based on computation of the dominant singular values, and corresponding singular vectors, of X. We will present an algorithm for removing the principal components one at a time. It should be emphasised that this algorithm has not been tested even for the linear case, and some care would be needed to make it work for non-linear networks. Moreover, whether an approach based on removal of individual singular values is going to be very useful is debatable: it may help if the data matrix X is dominated by a few principal components with large singular values but otherwise it it likely to be too inefficient. (Methods for simultaneous computation of more of the spectrum, e.g. Amoldi iteration, do exist. However they are of course more complicated.) In spite of these reservations, the algorithm here is presented as an indication that a trivially parallelizable method is at least in principle possible, and to indicate the tools with the help of which a practicable method might be constructed. The f i s t stage is to compute the largest eigenvalue and corresponding eigenvector of XX?. This may be carried out by the power method [Isaacson and Keller, 1966 p.1471 at the same time as the ordinary delta rule iteration: we start with an arbitrary vector u, and simply perform the iteration
Since
xxr
c. xpxp'. I
=
F=l
Thus the iteration can be performed by running through the patterns one at a time, just as for the delta rule itself. The sequence uk will converge to a normalised eigenvector p1 of X x ' corresponding to the largest eigenvalue I , of Xx'. A, itself is conveniently estimated from the Ruyleigh quotienr ukTxfl,k: see [Isaacson and Keller, 1969, p.142.1. Note that since X f l is symmetric and positive definite, repeated eigenvalues will not cause problems. Having determined pl and A,, we set
(3.8)
135
We have X = PDPT where, of course, p1 is the F i t column of P and h , the first element of the diagond matrix D. Since P is an orthogonal matrix,
and, of course, n
Xx’ = PDPT =
&PIPIT, i= 1
Hence, with T as in (3.8), we have
since the pi’s are orthonormal. By a similar calculation
Thus T X m has the same eigenvectors as Xx’, and the same eigenvalues but with A1 replaced by 1. As indicated by (3.5). each pattern xp should then be multiplied by T, and, since we are now iterating with different data, the current weight estimate w should be multiplied by It is easy to check that T-’
= I
+ (hi’/* -
T-l.
1)pIpIT.
Since )cl is a “large” eigenvalue, this process is well conditioned. Observe that calculation of Tx and T-1w each can be achieved with evaluation of only a single inner product. The process can of course be repeated to convert further eigenvalues to 1. Basically the same idea can be used for the iteration (2.2). However there is a problem in that the matrix A is not exactly symmetric, although it is nearly so for small 7. This could be overcome by computing the right as well as left eigenvectors of (A - I)m, but unfortunately this would require presenting the patterns in reverse order: somewhat inconvenient for a neural system. Another possibility is it perform two cycles of the patterns, with the patterns in reverse order on the second cycle. The composite iteration matrix ATA will then be symmetric. However, since we have seen that (3.8) processes
136
principal components of the data, it might be better just to use the preconditioner (3.8) instead. Although space and the requirements of simplicity do not permit a full discussion here, them is no reason in principle why this algorithm should not be applied to the non linear case. However, a considexation of (2.14) indicates that it would not be appropriute just to process the input data. preconditioning based on the entire derivative D(w,x) is required. For the general projected descent algorithm (2.4), this could be very complicated. However the special semi-linear structure of the multi-layer perceptron and similar architectures suggests that we think of the preconditioning as defining a “two-ply” network. (3.5) amounts to a factorisation of the input layer of an MLP which we could think of as being made up of two sub-layers-with units between whose activation function is simply the identity. Similarly we could factorise every layer into two sub-layers or “plies”. In the composite network, the top ply of each layer would be trained by nom-ial backpropagation, whereas the lower ply, trained by an algorithm such as that outlined above, could be thought of as a “slow learning” feature detector whose purpose is to “tune” the network to the general properties of the data being learnt. Note that its properties depend only on the input patterns and the network state, not explicitly on the ouput values. But, further consideration of this idea must be deferred to another occasion. 4. FUTURE DIRECTIONS
We have considered here only the simplest architectmes, in order to demonstrate the kind of results that can be proved and the tools that might be used to prove them. Still wanting are detailed analyses of the backpropagation algorithm and other learning algorithms, and the effect of various filters on non-linear leaming. Progress in this direction should yield much improved algorithms and filters, together with a better understanding of the dynamics of learning. Moreover in this paper we have largely restricted discussion to feedforward networks, although Section 2.3 is much more general. When discussing recursive networks, there are two dynamical processes to consider: the learning dynamics and the intrinsic dynamics of the network itself. It is not sufficient in practice to prove (e.g. using a Liapunov function) that a network is stable, if convergence is very slow. The tools presented in this paper can also be used to analyse and improve the asymptotic behaviour of the network itself, particularly for discrete time realisations. Thus the reader is hopefully convinced of the usefulness of the numerical analysis approach when discussing neural networks. The author does not wish to imply that this is the only mathematical technique of importance. There is a real need to weld the various techniques of analysis into a single coherent theory.
REFERENCES Baldi, P and Homik, K, 1989 “Neural networks and principal component analysis: learning from examples without local minima”, Neural Networks, v01.2, no. 1. Ben-Israel, A and Creville. T N E, 1974 “Generalised inverses, theory and applications”, Wiley.
137
Bunday, B D, 1984: “Basic optimisation methods”. Edward Amold, England. Ellacot, S W, 1990: “An analysis of the delta rule”. Proceedings of the International Neural Net Conference, Paris,pp 956-959. Kluwer Academic Publishers. Golub, G H, and Reinsch, C, 197OSingular value decomposition and least squares solutions”, Numerische Mathematik vol. 14, pp 403420. Hand, C, Evans, M and Ellacott, S W, 1991: “A neural network feature detector using a mullti-resolution pyramid” in “Neural networks for images, speech and natural language, eds. B Linggard, C Nightingale, in press. Isaacson, E and Keller, H B, 1%: “Analysis of numerical methods”, Wiley. Jacobs, D (4) 1977: . “The state of the art in numerical analysis”, Academic Press. Keen, T K, 1991: ”Dynamics of learning in linear feature discovery networks”, Network, V O ~ .2, pp 85-105. Lenz. R and Osterberg. M, 1990: “Learning filter systems”, Proceedings of the International Neural Net Conference, Paris,pp 989-992. Kluwer Academic Publishers . Martin, J M, 1990: “On the interpolation properties of feedforward layered neural networks”. Report NWC TP 7094, Naval Weapons Center, China Lake, CA 935556001, USA. Oja, E, 1983: “Subspace methods of pattern recognition”. Research Studies Press, Letchworth, England. Powell, M J D, 1990: “The theory of radial basis function approximation in 1990”, Report DAMTP 1990/NAll, Dept. of Applied Maths. and Theoretical Physics, Silver Street, Cambridge, CB3 9EW, England. Rumelhart. D E and McClelland, J L , 1986 “Parallel and distributed processing: explorations in the microstructure of cognition”, vols.1 and 2, MIT. Rumelhart, D E and McClelland, J L 1987: “Parallel and distributed processing: explorations in the microstructure of cognition”, vo1.3, MIT. Sontag, E D and Sussman, H J, 1991: “Backpropagation separates where perceptrons do”, Rutgers Center for Systems and Control, Dept. of Math., Rutgers University, New Brunswick, NJ 08903, USA. Vidyasagar, M, 1978: “Nonlinear systems analysis”, Prentice Hall. Wickerhauser, M V. 1990: “A fast approximate Karhunen Loeve expansion”, preprint, Dept. of Math., Yale University, New Haven, Connecticut 06520.
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B . V . All rights reserved.
139
SELF-ORGANIZING NEURAL NETWORKS FOR STABLE CONTROL OF AUTONOMOUS BEHAVIOR IN A CHANGING WORLD S. Grossberg# Department of Cognitive and Neural Systems, Boston University, Boston, MA, USA 1. INTRODUCTION: NONLINEAR MATHEMATICS FOR DESCRIBING AUTONOMOUS BEHAVIOR IN A NONSTATIONARY WORLD
The study of neural networks is challenging in part because the field embraces multiple goals. Neural networks to explain mind and brain are not evaluated by the same criteria as artificial neural networks for technology. Both are ultimately evaluated by their success in handling data, but data about behaving animals and humans may bear little resemblance to data that evaluates benchmark performance in technology. Although most artificial neural networks have been inspired by ideas gleaned from mind and brain models, technological applications can sometimes be carried out in an off-line setting with carefully selected data and complete external supervision. The living brain is, in contrast, designed to operate autonomously under real-time conditions in nonstationary environments that may contain unexpected events. Whatever supervision is available derives from the structure of the environment itself. These facts about mind and brain subserve much of the excitement and the intellectual challenge of neural networks, particularly because many important applications need to be run autonomously in nonstationary environments that may contain unexpected events. What sorts of intuitive concepts are appropriate for analysing autonomous behavior that is capable of rapid adaptation t o a changing world? What sorts of mathematics can express and analyse these concepts? I have been fortunate to be one of the pioneers who has participated in the discovery and development of core concepts and models for the neural control of real-time autonomous behavior. A personal perspective on these developments will be taken in this chapter. Such a perspective has much to recommend it a t this time. So many scientific communities and intellectual traditions have recently converged on the neural network field that a consistent historical viewpoint can simplify understanding. When I began my scientific work as an undergraduate student in 1957, the modern field of neural networks did not exist. My main desire was to better understand how we t This research was supported in part by the Air Force Office of Scientific Research (AFOSR F49620-92-5-0225), DARPA (AFOSR 90-0083 and ONR N00014-92-J-4015), and the Office of Naval Research (ONR N00014-91-J-4100). The authors wish to thank Cynthia Bradford and Diana J. Meyers for their valuable assistance in the preparation of the manuscript.
140
humans manage to cope so well in a changing world. This required study of psychological data to become familiar with the visible characteristics of our behavioral endowment. It required study of neurobiological data to better understand how the brain is organized. New intuitive concepts and mathematical models were needed whereby to analyse these data and to link behavior to brain. New mathematical methods were sought to analyse how very large numbers of neural components interact over multiple spatial and temporal scales via nonlinear feedback interactions in real time. These methods needed to show how neural interactions may give rise to behaviors in the form of emergent properties. Essentially no one at that time was trained to individually work towards all of these goals. Many experimentalists were superb at doing one type of psychological or neurobiological data, but rarely read broadly about other types of data. Few read across experimental disciplines. Even fewer knew any mathematics or models. The people who were starting to develop Artificial Intelligence favored symbolic mathematical methods. They typically disparaged the nonlinear differential equations that are needed to describe adaptive behavior in real time. Even the small number of people who used differential equations to describe brain or behavior often restricted their work to linear systems and avoided the use of nonlinear ones. It is hard to capture today the sense of overwhelming discouragement and ridicule that various of these people heaped on the discoveries of neural network pioneers. Insult was added to injury when their intellectual descendants eagerly claimed priority for these discoveries when they became fashionable years later. Their ability to do so was predicated on a disciplinary isolation of the psychological, neurobiological, mathematical, and computational communities that persisted for years after a small number of pioneers began their work to achieve an interdisciplinary synthesis. Some of the historical factors that influenced the development of neural network research are summarized in Carpenter and Grossberg (1991) and Grossberg (1982a, 1987, 1988). The present discussion summarizes several contributions to understanding how neural models function autonomously in a stable fashion despite unexpected changes in their environments. The content of these models consists of a small set of equations that describe processes such as activation of short term memory (STM) traces, associative learning by adaptive weights or long term memory (LTM) traces, and slow habituative gating or medium term memory (MTM) by chemical modulators and transmitters; a larger set of modules that organize processes such as cooperation, competition, opponent processing, adaptive categorization, pattern learning, and trajectory formation; and a still larger set of neural systems or architectures for achieving general-purpose solutions of modal problems such as vision, speech, recognition learning, associative recall, reinforcement learning, adaptive timing, temporal planning, and adaptive sensory-motor control. Each successive level of model organization synthesizes several units from the previous level. 2. THE ADDITIVE AND SHUNTING MODELS Two of the core neural network models that I introduced and mathematically analysed in their modern form are often called the additive model and the shunting model. These models were originally derived in 1957-1958 when I was an undergraduate at Dartmouth College. They describe how STM and LTM traces interact during network processes of activation, associative learning, and recall (Figure 1). It took ten years from their initial discovery and analysis to get them published in the intellectual climate of the 1960’s
141
‘i
i‘
e.. IJ
Figure 1. STM traces (or activities or potentials) z, at cells (or cell populations) D, emit signals along the directed pathways (or axons) e;j which are gated by LTM memory traces (or adaptive weights) z,, before they can perturb their target cells vj. (Reprinted with permission from Grossberg, 1982c.) (Grossberg, 1967, 1968a, 1968b). A monograph (Grossberg, 1964) that summarizes some of these results was earlier distributed to one hundred laboratories of leading researchers from the Rockefeller Institute where I was then a graduate student. Additive STM Equation
Equation (1) for the STM trace z, includes a term for passive decay ( - A i z , ) , positive fj(z,)Bj&)), negative feedback gj(z,)C&)), and input (Zi). feedback Each feedback term includes a state-dependent nonlinear signal (fj(zj),gj(zj)), a conIf the positive and nection, or path, strength (Bj,,Cj,), and an LTM trace (@),$)). negative feedback terms are lumped together and the connection strengths are lumped with the LTM traces, then the additive model may be written in the simpler form
(cy=l
(-xi”=,
Early applications of the additive model included computational analyses in vision, associative pattern learning, pattern recognition, classical and instrumental conditioning, and the learning of temporal order in applications to language and sensory-motor control (Grossberg, 1969a, 1969b, 1969c, 1970a, 1970b, 1971a, 1972a, 1972b, 1974; Grossberg and Pepe, 1971). The additive model has continued to be a cornerstone of neural network research to the present day; see, for example, Amari and Arbib (1982) and Grossberg (1982a). Some physicists unfamiliar with the classical status of the additive model in neural network theory erroneously called it the Hopfield model after they became acquainted with Hopfield’s first application of the additive model in Hopfield (1984), twenty-five years after its discovery; see Section 20. The classical McCulloch-Pitts (1943) model has also erroneously been called the Hopfield model by the physicists who became acquainted with the McCulloch-Pitts model in Hopfield (1982). These historical errors can ultimately be traced to the fact that many physicists and engineers who started studying neural networks in the 1980’s generally did not know the field’s scholarly literature. These errors are
142
gradually being corrected as new neural network practitioners learn the history of their craft. A related network equation was found to more adequately model the shunting dynamics of individual neurons (Hodgkin, 1964; Kandel and Schwartz, 1981; Katz, 1966; Plonsey and Fleming, 1969). In such a shunting equation, each STM trace is restricted to a bounded interval [-D;, B;] and automatic gain control, instantiated by multiplicative shunting terms, interacts with balanced positive and negative feedback signals and inputs to maintain the sensitivity of each STM trace within its interval. S h u n t i n g STM E q u a t i o n
Variations of the shunting equation (3) were also studied (Ellias and Grossberg, 1975) in which the reaction rate of inhibitory STM traces y, was explicitly represented, as in the system
and
Several LTM equations have been useful in applications. Two particularly useful variations have been: Passive Decay LTM E q u a t i o n
and Gated Decay LTM E q u a t i o n d
p ; j
= hj(Zj)[-K;jZ;j
+ L;jf;(zl)].
(7)
In both equations, a nonlinear learning term f i ( z i ) h j ( z j )often , called a Hebbian term after Hebb (1949), is balanced by a memory decay term. In (6), memory decays passively at a constant rate -K,,. In (7), memory decay is gated on and off by one of the nonlinear signals. When the gate opens, z;j tracks f;(z,) by steepest descent. A key property of
143
both equations is that the size of an LTM trace z;, can either increase or decrease due to learning. Neurophysiological support for an LTM equation of the form (7) was reported two decades after it was first introduced (Levy, 1985; Levy, Brassel, and Moore, 1983; Levy and Desmond, 1985; Rauschecker and Singer, 1979; Singer, 1983). Extensive mathematical analyses of these STM and LTM equations in a number of specialized circuits led gradually to the identification of a general class of networks for which one could prove invariant properties of associative spatiotemporal pattern learning and recognition (Grossberg, 1969a, 1971b, 1972c, 1982). These mathematical analyses helped to identify those features of the models that led to useful emergent properties. They sharpened intuition by showing the implications of each idea when it was realized within a complex system of interacting components. Some of these results are summarized below. 3. UNITIZED NODES, SHORT TERM MEMORY, AND AUTOMATIC
ACTIVATION The neural network framework and the additive laws were derived in several ways (Grossberg, 1969a, 1969b, 1969f, 1974). My first derivation in 1957-1958 was based on classical list learning data (Grossberg, 1961, 1964) from the serial verbal learning and paired associate paradigms (Dixon and Horton, 1968; Jung, 1968; McGeogh and Irion, 1952; Osgood, 1953; Underwood, 1966). List learning data force one to confront the fact that new verbal units are continually being synthesized as a result of practice, and need not be the obvious units which the experimentalist is directly manipulating (Young, 1968). All essentially stationary concepts, such as the concept of information itself (Khinchin, 1967) hereby became theoretically useless. By putting the self-organization of individual behavior in center stage, I realized that the phenomenal simplicity of familiar behavioral units, and the synthesis of these units into new representations which themselves achieve phenomenal simplicity through experience, should be made a fundamental property of the theory. To express the phenomenal simplicity of familiar behavioral units, I represented them by indecomposable internal representations, or unitized nodes, u,, i = 1,2,. . . ,n. This hypothesis gained support from the (now classical) paper of Miller (1956) on the Magic Number Seven, which appeared at around the time I was doing this derivation. In this work, Miller described how composites of familiar units can be “chunked”, or unitized, into new units via the learning process. Miller used the concept of information to analyse his results. This concept cannot, however, be used to explain how chunking occurs. A neural explanation of the Magic Number Seven is described in Grossberg (1978a, 1986); see also Cohen and Grossberg (1986). Data concerning the manner in which humans learn serial lists of verbal items led to the first derivation of the additive model. These data were particularly helpful because the different error distributions and learning rates at each list position suggested how each list item dynamically senses and learns from a different spatiotemporal context. It was, for example, known that practicing a list of items such as AB could also lead to learning of BA, a phenomenon called backward learning. A list such as ABC can obviously also be learned, however, showing that the context around item B enables forward learning of BC to supercede backward learning of BA.
144
To simplify the discussion of such interactive phenomena, I will consider only associative interactions within a given level in a coding hierarchy, rather than the problem of how coding hierarchies develop and interact between several levels. All of these conclusions have been generalized to a hierarchical setting (Grossberg, 1974, 1978a, 1980a). 4. BACKWARD LEARNING AND SEFUAL BOWING
Backward learning effects and, more generally, error gradients between nonadjacent, or remote, list items (Jung, 1968; McGeogh and Irion, 1952; Murdock, 1974; Osgood, 1953; Underwood, 1966) suggested that pairs of nodes vi and vj can interact via distinct directed pathways e,j and ej; over which adaptive signals can travel. An analysis of how a node v, could know where to send its signals revealed that no local information exists at the node itself whereby such a decision could be made. By the principle of sufficient reason, the node must therefore send signals towards all possible nodes v j with which it is connected by directed paths e i j . Some other variable must exist that discriminates which combination of signals can reach their target nodes based on past experience. These auxiliary variables turned out to be the long term memory traces. The concept that each node sends out signals to all possible nodes subsequently appeared in models of spreading activation (Collins and Loftus, 1975; Klatsky, 1980) to explain semantic recognition and reaction time data. The form that the signaling and learning laws should take was suggested by data about serial verbal learning. During serial learning, a subject is presented with one list item at a time and asked to predict the next item before it occurs. After a rest period, the list is presented again. This procedure continues until a fixed learning criterion is reached. A main paradox about serial learning concerns the form of the bowed serial position curve which relates cumulative errors to list positions (Figure 2a). This curve is paradoxical for the following reason. If all that happened during serial learning was a build-up of interference at each list position due to the occurrence of prior list items, then the error curve should be monotone increasing (Figure 2b). Because the error curve is bowed, and the degree of bowing depends on the length of the rest period, or intertrial interval, between successive list presentations, the nonoccurrence of list items after the last item occurs somehow improves learning across several prior list items. Internal events thus continue to occur during the intertrial interval. The nonoccurrence of future items can hereby reorganize the learning of a previously occurring list. The bowed serial position curve showed me that a real-time dynamical theory was needed to understand how these internal events continue to occur even after external inputs cease. It also showed that these internal events can somehow operate “backwards in time” relative to the external ordering of observable list items. These backward effects suggested that directed network interactions exist whereby a node v, could influence a node v j , and conversely. Many investigators attributed properties like bowing to one or another kind of rehearsal (Klatsky, 1980; Rundus, 1971). Just saying that rehearsal causes bowing does not explain it, because it does not explain why the middle of the list is less rehearsed. Indeed the middle of the list has more time to be rehearsed than does the end of the list before the next learning trial occurs. In the classical literature, the middle of the list was also said to experience maximal proactive interference (from prior items) and retroactive interference (from future items), but this just labels what we have to explain
145
CUMULATIVE ERRORS
LIST POSITION
(a)
Figure 2. (a) The cumulative error curve in serial verbal learning is a skewed bowed curve. Items between the middle and end of the list are hardest to learn. Items at the beginning of the list are easiest to learn. (b) If position-dependent difficulty of learning were a!l due to interference from previously presented items, the error curve would be monotone increasing. (Reprinted with permission from Grossberg, 1982b.) (Osgood, 1953; Underwood, 1966). The severity of such difficulties led the serial learning expert Young (1968) to write: “If an investigator is interested in studying verbal learning processes .. . he would do well to choose some method other than serial learning” (p.146). Another leading verbal learning expert Underwood (1966) wrote: “The person who originates a theory that works out to almost everyone’s satisfaction will be in line for an award in psychology equivalent to the Nobel prize” (p. 491). It is indicative of the isolated role of real-time modelling in psychology at that time that a theory capable of clarifying the main data effects was available but could not yet get published. Similar chunking and backward effects also occur in a wide variety of problems in speech, language, and adaptive sensory-motor control, so avoiding serial learning will not make the problem go away. Indeed these phenomena may all generally be analysed using the same types of mechanisms. 5. THE NEED FOR A REAL-TIME NETWORK THEORY The massive backward effect that causes the bowed serial curve forced the use of a real-time theory that can parameterize the temporal unfolding of both the occurrences and the nonoccurrences of events. The existence.of facilitative effects due to nonoccurring items also showed that traces of prior list occurrences must endure beyond the last item’s presentation time, so they can be influenced by the future nonoccurrences of items. This fact led to the concept of activations, or short term memory (STM) traces, z,(t) at the nodes v,, i = 1,2,. . .,n, which are turned on by inputs I,(t), but which decay at a rate slower than the input presentation rate. As a result, in response to serial inputs, patterns of STM activity are set up across the network’s nodes. The combination of serial inputs, distributed internodal signals, and spontaneous STM changes at each node changes the STM pattern as the experiment proceeds. A major task of neural network theory was thus to learn how to think in terms of distributed pattern transformations, rather than just in terms of distributed feature detectors or other local entities. When I first realized this, it was quite a radical notion. Now it is so taken for granted that most people do not
146
Figure 3. Suppose that items f l , r 2 , f 3 , 7 - 4 , .. . are presented serially to nodes t 1 1 , t 1 2 , ~ 3 , 214,. .., respectively. Let the activity of node 21i at time t be described by the height of the histogram beneath vi at time t. If each node is initially excited by an equal amount and its excitation decays at a fixed rate, then at every time (each row) the pattern of STM activity across nodes is described by a recency gradient. (Reprinted with permission from Grossberg, 1982b.) realize that is was once an exciting discovery.
6. EVENT TEMPORAL ORDER VS. LEARNED TEMPORAL ORDER The general philosophical interest of the bowed error curve can be appreciated by asking: What is the first time a learning subject can possibly know that item r, is the last list item in a newly presented list rlr2 ...r,, given that a new item is presented every w time units until r, occurs? The answer obviously is: not until at least w time units after vn has been presented. Only after this time passes and no item T , + ~ is presented can r, be correctly recIassified from the list’s “middle” to the list’s “end”. The nonoccurrence of future items reclassifies r , as the “end” of the list. Parameter w is under experimental control and is not a property of the list ordering per se. Spatiotemporal network interactions thus parse a list in a way that is fundamentally different from the parsing rules that are natural to apply to a list of symbols in a computer. Indeed, increasing the event presentation rate, or intratrial interval, w during serial learning can flatten the entire bowed error curve and minimize the effects of the intertrial interval between successive list presentations (Jung, 1968; Osgood, 1953). To illustrate further the difference between computer models and a real-time network approach, suppose that after a node v; is excited by an input I;, its STM trace gets smaller through time due to either internodal competition or to passive trace decay. Then in response to a serially presented list, the last item to occur always has the largest STM trace-in other words, at every time a recency gradient obtains in STM (Figure 3). Given this natural assumption-which, however, is not always true (Bradski, Carpenter, and Grossberg, 1992; Grossberg, 1978a, 1978b)-how do the generalization gradients of
147
Figure 4. At each node v,, the LTM pattern z, = (zjl,zj2,.. ., z . ) that evolves through 1". time is different. In a list of length n = L whose intertrial interval 1s sufficiently long, the LTM pattern at the list beginning ( j 1) is a primacy gradient. At the list end ( j L ) , a recency gradient evolves. Near the list middle ( j 2 $), a two-sided gradient is learned. These gradients are reflected in the distribution of anticipatory and perseverative errors in response to item probes at different list positions. (Reprinted with permission from Grossberg, 1982b.) errors at each list position get learned (Figure 4)? In particular, how does a gradient of anticipatory, or forward, errors occur at the beginning of the list, a gradient of perseverative, or backward, errors occur at the end of the list and a two-sided gradient of anticipatory and perseverative errors occur near the middle of the list (Osgood, 1953)? Otherwise expressed, how does a temporal succession of STM recency gradients generate an LTM primacy gradient at the list beginning but an LTM recency gradient at the list end? I call this STM-LTM order reversal. This property immediately rules out any linear theory, as well as any theory which restricts itself to nearest neighbor associative links. 7. MULTIPLICATIVE SAMPLING BY SLOWLY DECAYING LTM TRACES OF RAPIDLY EVOLVING STM PATTERNS The STM and LTM properties depicted in Figures 3 and 4 can be reconciled by positing the existence of STM traces and LTM traces that evolve according to different time scales and rules. Indeed, this reconciliation was one of the strongest arguments that I knew for these rules until neurobiological data started to support them during the 1980's. Suppose that the STM trace of each active node v, can send out a sampling signal Sj along each directed path e j k towards the node v k , k # j . Suppose that each path e,k contains LTM trace z,k at its terminal point, where z j k can compute, using only local operations, the product of signal Sj and STM trace xk. Also suppose that the LTM trace decays slowly, if at all, during a single learning trial. The simplest law for zjk that satisfies these constraints is d -2's = -czjk + d S j X k , (8)
dt 3 j # k; cf., equation ( 6 ) . To see how this rule generates an LTM primacy gradient at the list beginning, we need to study the LTM pattern (212,z13,.. .,zln)and to show that z12 > 213 > ,. . > zln.To see how the same rule generates an LTM recency gradient at the list end, we need to study the LTM pattern (zn1,zn2,.. ., . z , , ~ . - Iand ) to show that znl < zn2 < . . . < z,-1. The two-sided gradient at the list middle can then be understood as a combination of these effects.
148
By (8), node 01 sends out a sampling signal S1 shortly after item q is presented. After rapidly reaching peak size, signal S1 gradually decays as future list items r2,7-3,.. . are presented. Thus S l is largest when trace 5 2 is maximal, S1 is smaller when both traces 1 2 and 5 3 are active, S1 is smaller still when traces x2, 5 3 , and 2 4 are active, and so on. Consequently, the product 5’1x2 in row 2 of Figure 3 exceeds the product S 1 ~ 3in row 3 of Figure 3, which in turn exceeds the product S1q in row 4 of Figure 3, and so on. Due to the slow decay of each LTM trace z l t on each learning trial, 212 adds up to the products S1zz in successive rows of column 1, 213 adds up to the products S1q in successive rows of column 2, and so on. An LTM primacy gradient zl2 > 213 > . . . > zln is hereby generated. This gradient is due to the way signal S1 multiplicatively samples the successive STM recency gradients and the LTM traces zlk sum up the sampled STM gradients. By contrast, the signal S, of a node vn at the end of the list samples a different set of STM gradients. This is because vn starts to sample (viz., S n > 0) only after all past nodes q , v 2 , . . .,v,,-~ have already been activated on that trial. Consequently, the LTM traces ( z , , ~zn2, , . .., ~ ~ , ~ -of1 node ) vn encode a recency gradient q < 5 2 < 5 3 < . . . < x , - ~ at eoch time. When all the recency gradients are added up through time, the total effect is a recency gradient in vn’s LTM pattern. In summary, nodes at the beginning, middle, and end of the list encode different LTM gradients because they multiplicatively sample and store STM patterns at different times. Similar LTM gradients obtain if the sequences of nodes which are active at any time selectively excite higher-order nodes, or chunks, which in turn sample the field of excited nodes via feedback signals (Grossberg, 1974, 1978a). 8. MULTIPLICATIVE LTM GATING OF STM-ACTIVATED SIGNALS Having shown how STM patterns may be read into LTM patterns, we now need to describe how a retrieval probe rm can read urn’s LTM pattern back into STM on recall trials, whereupon some of the STM traces can be transformed into observable behavior. In particular, how can LTM be read into STM without distorting the learned LTM gradients? The simplest rule generates an STM pattern which is proportional to the LTM pattern that is being read out, and allows distinct probes to each read their LTM patterns into STM in an independent fashion. To achieve faithful read-out of the LTM pattern (zm1,zm2,.. .,zmn) by a probe rm that turns on signal S, let the product Smzm,determine the growth rate of z;. Then LTM trace zmi gates the signal Smalong em; before the gated signal reaches v,. The independent action of several probes implies that the gated signals Srnzmiare added, so that the total effect of all gated signals on vi is Ck=lSmzm,. The simplest equation for the STM trace xi that abides by this rule is the additive equation n d ; E i q = -uz, + b SmZm, I,, (9)
c
+
m=l
where -a is the STM decay rate, Sm is the mth sampling signal, zm, is the LTM trace of pathway em,, and I, is the ith experimental input; cf, equation (2). The reaction of equations (8) and (9) to serial inputs I, is much more complex than is their response to an isolated retrieval probe rm. Due to the fact that STM traces may decay slower than the input presentation rate, several sampling signals S, can be simultaneously active, albeit in different phases of their growth and decay. In fact, this in-
149
teraction leads to properties that mimmick list learning data, but first a technical problem needs to be overcome. 9. BEHAVIORAL CHOICE AND COMPETITIVE INTERACTIONS Once one accepts that patterns of STM traces are evolving through time, one also needs a mechanism for choosing those activated nodes which will influence observable behavior. Lateral inhibitory feedback signals were derived as a choice mechanism (Grossberg, 1968, 1969b, 1970a). The simplest extension of (9) which includes competitive interactions is n n d Bx, = -ax, S$b;,z,, SEb,, + I , (10)
+C
C
m=l
m=l
where S&b$ (S;bZi) is the excitatory (inhibitory) signal emitted from node vm along the excitatory (inhibitory) pathway e;, (e;,); cf., equation (1). Correspondingly equation (8) is generalized to d - 2 . k = -CZjk -k d j k S 7 x k . (11) dt 3
c%=l
The asymmetry between terms S$b;,zm, and ck=l SGb;, in (10) suggested a modification of (10) and a definition of inhibitory LTM traces analogous to the excitatory LTM traces (8), where such inhibitory traces exist (Grossberg, 1969d). Because lateral inhibition can change the sign of each x; from positive to negative in (lo), and thus change the sign of each z j k from positive to negative in (8), some refinements of (10) and (8) were needed to prevent absurdities like the following: S$ < 0 and z,< 0 implies z, > 0; and S$ < 0 and zmi < 0 implies z,> 0. Signal thresholds accomplished this in the simplest way. Letting [[I+ = m a ( [ , 0), define the threshold-linear signals.
sj’= [xj(t- TT)- r;]+ and
S-3 = [ z j ( t- T,-) - r
(12)
~+,
in (10) and ( l l ) , and modify (10) to read
Sigmoid, or S-shaped signals, were also soon mathematically shown to support useful computational properties (Grossberg, 1973). These additive equations and their variants have been used by many subsequent modellers. 10. THE SKEWED BOW: SYMMETRY-BREAKING BETWEEN FUTURE AND PAST
One of the most important contributions of neural network models has been to show how behavioral properties can arise as emergent properties due to network interactions. The bowed error curve is perhaps the first behaviorally important emergent property that was derived from a red-time neural network. It results from forward and backward interactions among all the STM and LTM variables across the network.
150
To explain the bowed error curve, we need to compare the LTM patterns z, = . .,z,”) that evolve at all list nodes 21%. In particular, we need to explain why the bowed curve is skewed; that is, why the list position where learning takes longest occurs nearer to the end of the list than to its beginning (Figure 2a). This skewing effect contradicts learning theories that assume forward and backward effects are equally strong, or symmetric (Asch and Ebenholtz, 1962; Murdock, 1974). This symmetry-breaking between the future and the past, by favoring forward over backward associations, makes possible the emergence of a global “arrow in time,” or the ultimate learning of long event sequences in their correct order, much as we learn the alphabet ABC ... Z despite the existence of backward learning. A skewed bowed error curve does emerge in the network, and predicts that the degree of skewing will decrease, and the relative learning rate at the beginning and end of the list will reverse, as the network’s arousal level increases or its signal thresholds r,’ decrease to abnormal levels (Grossberg and Pepe, 1971). The arousal and threshold predictions have not yet been directly tested to the best of my knowledge. Abnormally high arousal or low thresholds generate a formal network syndrome characterized by contextual collapse, reduced attention span, and fuzzy response categories that resemble aspects of simple schizophrenia (Grossberg and Pepe, 1970; Maher, 1977). To understand intuitively what is involved in this explanation of bowing, note that by , . .,~ “ - 1 ,that ~ is activated by list item equation (14), each correct LTM trace 212, ~ 2 3z34,. r1 may grow at a comparable rate, albeit w time units later than the previous correct LTM trace. However, the LTM patterns 21, z2,. ,.,zn differ at every list position, as in Figure 4. Thus when a retrieval probe T-,reads its LTM pattern z, into STM, the entire pattern must influence overt behavior to explain why bowing occurs. The relative size of the correct LTM trace z,,,+1 conpared to all other LTM traces in z, will influence its success in eliciting r,+1 after competitive STM interactions occur. A larger z3,,+]relative to the sum of all other z,k, k j , j 1, should yield better performance of r3.+] given r ] , other things being equal. To measure the distinctiveness of a trace zJkrelative to all traces in z J , I therefore defined the relative LTM traces (z11,z,2,.
+ + zjt
= Z,L(
C zjm)-’.
(15)
m#3
Equation (15) provides a convenient measure of the effect of LTM on STM after competition acts. By (15), the ordering within the LTM gradients of Figure 4 is preserved by the then 2 1 2 > 2 1 3 > . . . > 21, berelative LTM traces; for example, if z12 > 213 > . .. > zlnr cause all the Zlk% have the same denominator. Thus all conclusions about LTM gradients are valid for relative LTM gradients, which are also sometimes called stimulus sampling probabilities. In terms of the relative LTM traces, the issue of bowing can be mathematically formulated as follows. Define the bowing function B;(t)= Z,,,+l(t). Function B,(t)measures how distinctive the ith correct association is at time t. After a list of n items is presented with an intratrial interval w and a sufficiently long intertrial interval W elapses, does the function B,((n- l ) w W ) decrease and then increase as i increases from 1 to n? Does the minimum of the function occur in the latter half of the list? The answer to both of these questions is “yes.”
+
151
To understand why this happens, it is necessary to understand how the bow depends upon the ability of a node v, to sample incorrect future associations, such as r,r,+2,r,ri+3,.. . in addition to incorrect past associations, such as riri-1, riri-2,. . .. As soon as S, becomes positive, vi can sample the entire past field of STM traces at ~ 1 , 7 1 2 , .. ., v , - ~ . However, if the sampling threshold is chosen high enough, S, might shut off before r,+2 occurs. Thus the sampling duration has different effects on the sampling of past than of future incorrect associations. For example, if the sampling thresholds of all v; are chosen so high that S; shuts off before ri+2 is presented, then the function B,(m) decreases as i increases from 1 to n. In other words, the monotonic error curve of Figure 2b obtains because no node v, can encode incorrect future associations. Even if the thresholds are chosen so that incorrect future associations can be formed, the function B,((i+ 1)w) which measures the distinctiveness of z,,,+~just before r,+2 occurs is again a decreasing function of i. The bowing effect thus depends on threshold choices which permit sampling durations that are at least 2w in length. The shape of the bow also depends on the duration of the intertrial interval, because before the intertrial interval occurs, all nodes build up increasing amounts of associative interference as more list items are presented. The first effect of the nonoccurrence of items after r,, is presented is the growth through time of B,-l(t) as t increases beyond the time nw when item r,+l would have occurred in a larger list. The last correct association is hereby facilitated by the absence of interfering future items during the intertrial interval. This facilitation effect is a nonlinear property of the network. Bowing is also a nonlinear phenomenon in the theory, because it depends on a comparison of ratios of integrals of sums of products as they evolve through time. Mathematical theorems about the bowed error curve and other list learning properties were described in Grossberg (1969c) and Grossberg and Pepe (1971), and reviewed in Grossberg (1982a, 1982b). These results illustrated how STM and LTM processes interact as unitized events occur sequentially in time. Other mathematical studies analysed increasingly general constraints under which distributed STM patterns could be encoded in LTM without bias by arbitrary numbers of simultaneously active sampling nodes acting in parallel. Some of these results are summarized in the next section. 11. ABSOLUTELY STABLE PARALLEL PATTERN LEARNING
Many features of system (10) and (12)-(14) are special; for example, the exponential decay of STM and LTM and the signal threshold rule. Because associative processing is ubiquitous throughout phylogeny and within. functionally distinct subsystems of each individual, a more general mathematical framework was needed. This framework needed to distinguish universally occurring associative principles that guarantee essential learning properties from evolutionary variations that adapt these principles to realize specialized skills. I approached this problem from 1967 to 1972 in a series of articles wherein I gradually realized that the mathematical properties used to globally analyze specific learning examples were much more general than the examples themselves. This work culminated in my universal theorems on associative learning (Grossberg, 1969d, 1971a, 1972a). The theorems say that if certain associative laws were invented at a prescribed time during evolution, then they could achieve unbiased associative pattern learning in essentially any
152
later evolutionary specialization. To the question: Was it necessary to re-invent a new learning rule to match every perceptual or cognitive refinement, the theorems said “no”. They enabled arbitrary spatial patterns to be learned by arbitrarily many, simultaneously active sampling channels that are activated by arbitrary continuous data preprocessing in an essentially arbitrary anatomy. Arbitrary space-time patterns can also be learned given modest constraints on the temporal regularity of stimulus sampling. The universal theorems thus describe a type of parallel processing whereby unbiased associative pattern learning occurs despite mutual crosstalk between nonlinear feedback signals. These results obtain only if the network’s main computations, such as spatial averaging, temporal averaging, preprocessing, gating, and cross-correlation are computed in a canonical ordering. This canonical ordering constitutes a general purpose design for unbiased parallel pattern learning, as well as a criterion for whether particular networks are acceptable models for this task. The universality of the design mathematically takes the form of a classification of oscillatory and limiting possibilities that is invariant under evolutionary specializations. The theorems can also be interpreted in another way that is appropriate in discussions of self-organizing systems. The theorems are absolute stability or global content addressable memory theorems. They show that evolutionary invariants of associative learning obtain no matter how system parameters are changed within this class of systems. Absolutely stable learning is an important property in a self-organizing system because parameters may change in ways that cannot be predicted in advance, notably when unexpected environments act on the system. Absolute stability guarantees that the onset of self-organization does not subvert the very learning properties that make stable self-organization possible. The systems that I considered constitute the generalized additive model
where i and j parameterize arbitrarily large, not necessarily disjoint, sets of sampled and sampling cells, respectively. As in my equations for list learning, A, is an STM decay rate, Bki is a nonnegative performance signal, I i ( t ) is an input function, Cji is an LTM decay rate, and Dj, is a nonnegative learning signal. Unlike the list learning equations, A;, Bk;, C,i, and 0,;may be continuous functionals of the entire history of the system. Equations (16) and (17) are thus very general, and include many of the specialized associative learning models in the literature. For example, although (16) does not seem to include inhibitory interactions, such interactions may be lumped (say) into the STM decay functional A;. The choice n
A , = a, - (b; - C;Z,)G;(Z,)
+ k = l ffk(Zk)dk;
(18)
describes the case wherein system nodes compete via shunting, or membrane equation, interactions (Cole, 1968; Grossberg, 1973; Kandel and Schwartz, 1981; Plonsey and Fleming,
153
1969). The performance, LTM decay, and learning functionals may include slow threshold changes, nonspecific Now Print signals, signal velocity changes, presynaptic modulation, arbitrary continuous rules of dendritic preprocessing and axonal signaling, as well as many other possibilities (Grossberg, 1972a, 1974). Of special importance are the variety of LTM decay choices that satisfy the theorems. For example, a gated LTM law like
d dt
- z . , = [x,(t - .rj) - r j ( y t ) ] + ( - d j z j i 3
+e p , )
(19)
achieves an interference theory of forgetting, rather than exponential forgetting, since S z j , = 0 except when vj is sampling (Adams, 1967); cf., equation (7). Equation (19) also allows the vigor of sampling to depend on changes in the threshold rj(yt) that are sensitive to the prior history yt = ( x i , z,, : i E I , j E J ) * of the system before time t , as in the model of Bienenstock, Cooper, and Munro (1982). In this generality, too many possibilities exist to as yet prove absolute stability theorems. Indeed, if the performance signals Bj, from a fixed sampling node v, to all the sampled nodes v,, a E I , were arbitrary nonnegative and continuous functionals, then the irregularities in each Bji could override any regularities in zj; within the gated performance signal Bjizj; from v, to v,. One further constraint was used to impose some spatiotemporal regularity on the sampling process, as indicated in the next section. 12. LOCAL SYMMETRY, ACTION POTENTIALS, AND UNBIASED LEARNING Absolute stability obtains even if different functionals B,, C,,and D, are assigned to each node vj,j E J , just so long as the same functional is assigned to all pathways e,;, i E I . Where this is not globally true, one can often partition the network into maximal subsets where it is true, and then prove unbiased pattern learning in each subset. This restriction is called the property of local symmetry axes since each sampling cell vj can act as a source of coherent history-dependent waves of STM and LTM processing. Local symmetry axes still permit (say) each Bj to obey different history-dependent preprocessing, threshold, time lag, and path strength laws among arbitrarily many mutually interacting nodes v,. When local symmetry axes are imposed on the generalized additive model in (16) and (17), the resulting class of systems takes the form
A change of variables shows, moreover, that constant interaction coefficients b,, between pairs v, and v, of nodes can depend on i E I without destroying unbiased pattern learning in the systems d -xi = -AX, Bjbjizj, Zi(t) (22) dt i
+C
+
154
By contrast, the systems (22) and
are not capable of unbiased parallel pattern learning (Grossberg, 1972a). A dimensional analysis showed that (22) and (23) hold if action potentials transmit the network’s intercellular signals, whereas (22) and (24) hold if electrotonic propagation is used. The cellular property of an action potential was hereby formally linked to the network property of unbiased parallel pattern learning. 13. THE UNIT OF LTM IS A SPATIAL PATTERN These global theorems proved that “the unit of LTM is a spatial pattern”. This result was surprising to me, even though I had discovered the additive model. The result illustrates how rigorous mathematics can force insights that go beyond unaided intuition. In the present instance, it suggested a new definition of spatial pattern and showed how the network learns “temporally coherent spatial patterns” that may be hidden in its distributed STM activations through time. This theme of temporal coherence, first mathematically discovered in 1966, has shown itself in many forms since, particularly in recent studies of attention, resonance, and synchronous oscillations (Crick and Koch, 1990; Eckhorn, Bauer, Jordan, Brosch, Kruse, Munk, and Reitbock, 1988; Eckhorn and Schanze, 1991; Gray and Singer, 1989; Gray, Konig, Engel, and Singer, 1989; Grossberg, 1976c; Grossberg and Somers, 1991, 1992). To illustrate the global theorems that have been proved, I consider first the simplest case, wherein only one sampling node vg exists (Figure 5a). Then the network is called an outstar because it can be drawn with the sampling node at the center of outward-facing adaptive pathways (Figure 5b) such that the LTM trace zo, in the ith pathway samples the STM trace x i of the ith sampled cell, i E I . An outstar is thus a neural network of the form d = -AX, Bzo, I,(t) (25)
+
+
where A, B , C, and D are continuous functionals such that B and E are nonnegative. Despite the fact that the functionals A, B , C, and D can fluctuate in complex systemdependent ways, and the inputs I , ( t ) can also fluctuate wildly through time, an outstar can learn an arbitrary spatial pattern
where 0, 2 0 and &1Bg = 1, with a minimum of oscillations in its pattern variables Xi = xi(,&^ xg)-l and 2, = ~ i ( C g ~ , z g These ) - ~ . pattern variables learn the temporally coherent weights 0, in a spatial pattern and factor the input activation I ( t ) that energizes the process into the learning rate. The 2,’s are the relative LTM traces (15) that played such a central role in the explanation of serial bowing. The limiting and oscillatory behaviors of the pattern variables have a classification that is independent of particular
155
ucs
I bI
Figure 5. (a) The minimal anatomy capable of associative learning. For example, during classical conditioning, a conditioned stimulus (CS) excites a single node, or cell population, ?JO which thereupon sends sampling signals to a set of nodes u1, v2,. . .,u,. An input pattern representing an unconditioned stimulus (UCS) excites the nodes ~ 1 , 2 1 2 , .. .,u,, which thereupon elicit output signals that contribute to the unconditioned response (UCR). The sampling signals from uo activate the LTM traces zoi i = 1,2,. . .,n. The activated LTM traces can learn the activity pattern across ~1,212,. . ;,un that represents the UCS. (b) When the sampling structure in (a) is redrawn to emphasize its symmetry, the result is an outstar, whose sampling source is uo and whose sampled border is the set of nodes {q, u2,. . .,un}. (Reprinted with permission from Grossberg, 1982b.)
choices of A , B , C, D , and I . These properties are thus evolutionary invariants of outstar learning. The following theorem summarizes, albeit not in the most general known form, some properties of outstar learning. One of the constraints in this theorem is called a local flow condition. This constraint says that a performance signal B can be large only if its associated learning signal D is large. When local flow holds, pathways which have lost their plasticity can be grouped into the total input pattern that is registered in STM for encoding in LTM by other pathways. If the threshold of the performance signal B is no smaller than the threshold of the learning signal D , then local flow is assured. Such a threshold inequality occurs automatically if the LTM trace z j , is physically interpolated between the axonal signal and the postsynaptic target cell ui. That is why the condition is called a local flow condition.
156
Such a geometric interpretation of the location of the LTM trace gives unexpected support to the hypothesis that LTM traces are localized in the synaptic knobs or postsynaptic membranes of cells undergoing associative learning. Here again a network property gives new functional meaning to a cellular property. Theorem 1 (Outstar Pattern Learning) Suppose that (I) the functionals are chosen to keep system trajectories bounded; (11) a local flow condition holds:
LmD ( t ) d t
= m;
(111) the UCS is practiced sufficiently often, and there exist positive constants K1 and
ZC, such that for all T 2 0, f ( T ,T where
+ t ) 2 Zcl
if
V
f(U, V )= J,
Z ( 0 exp
t 2 Icz
(29)
[lV
A(dd44.
(30)
Then, given arbitrary continuous and nonnegative initial data in t 5 0 such that C j Zj(0) > 0, (A) practice makes perfect: The LTM ratios Z,(t) are monotonically attracted to the UCS weights 8; if [zi(O) - Xi(O)l[Xi(O)- Oil 2 0,
(31)
or may oscillate at most once due to prior learning if (31) does not hold, no matter how wildly A , B, C, D , and Z oscillate; (B) the UCS is registered in STM and partial learning occurs: The limits &; = limtWooX ; ( t ) and Pi = limt+rn Z,(t) exist with
Qi = B,,
for all i.
(32)
(C) If, moreover, the CS is practiced sufficiently often, then perfect learning occurs: if
LmD(t)dt
= 00,
then
Pi = B,,
for all i
(33)
Remarkably, similar global theorems hold for systems (20)-(21) wherein arbitrarily many sampling cells can be simultaneously active and mutually signal each other by complex feedback rules (Geman, 1981; Grossberg, 1969d, 1971a, 1972a, 1980b). This is because all systems of the form (20)-(21) can factorize information about how STM and LTM pattern variables learn pattern Bi from information about how fast energy l , ( t ) is being pumped into the system to drive the learning process. The pattern variables 2, therefore oscillate at most once even if wild fluctuations in input and feedback signal
157
energies occur through time. In the best theorems now available, only one hypothesis is not known to be necessary and sufficient (Grossberg, 1972a, 1982a). When many sampling cells v j , can send sampling signals to each sampled cell v,, the outstar property that each relative LTM trace Zj, = zj,(CsEr z j k ) - l oscillates at most once fails to hold. This is so because the Zji of all active nodes v j track X i = x i ( & x t ) - l , while X i tracks Bi and the Zj, of all active nodes vj. The oscillations of the functions Y; = max{Zj; : j E J } and y, = min{Zji : j E J } can, however, be classified much as the oscillations of each Z, can be classified in the outstar case. Since each Zji depends on all Zjk for variable k , each Y , and yi depends on all zjk for variable j and k. Since also each X i depends on all x k for variable k, the learning at each v; is influenced by all z k and z i t . No single cell analysis can provide an adequate insight into the dynamics of this associative learning process. The main computational properties emerge through interactions on the network level. Because the oscillations of all X i , Y;, and y, relative to 0, can be classified, the following generalization of the outstar learning theorem holds. Theorem 2 (Unbiased Parallel P a t t e r n Learning) Suppose that (I) the functionals are chosen to keep system trajectories bounded; (11) every sampling cell obeys a local flow condition: m
for every j, L"Bjdt=cn
onlyif
Djdt
=00;
(34)
(111) the UCS is presented sufficiently often: There exist positive constants K1 and K 2 such that (29) holds. Then given arbitrary nonnegative and continuous initial data in t 5 0 such that Cixi(0) > 0 and all Ciz j i ( 0 ) > 0, (A) the UCS is registered in STM and partial learning occurs: The limits Q,= limt,, X , ( t ) and Pji = limt+m Zji(t) exist with Qi = Bi,
for all i.
(35)
(B) If the j t h CS is practiced sufficiently often, then it learns the UCS pattern perfectly:
4
m
if
Djdt = 00
then
Pj, = B,, for all a.
(36)
Because LTM traces z j i gate the performance signals Bj which are activated by a retrieval probe T;, the theorem enables any and all nodes v j which sampled the pattern 0, during learning trials to read it out accurately on recall trials. The theorem does not deny that oscillations in overall network activity can occur during learning and recall, but shows that these oscillations merely influence the rates and intensities of learning and recall. In particular, phase transitions in memory can occur, and the nature of the phases can depend on a complex interaction between network rates and geometry (Grossberg, 1969g, 1982a).
158
Neither Theorem 1 nor Theorem 2 assumes that the CS and UCS are presented at correlated times. This is because the UCS condition keeps the baseline STM activity of sampled cells from ever decaying below the positive value K1 in (29). For purposes of space-time pattern learning, this UCS uniformity condition is too strong. In Grossberg (1972a), I used a weaker condition which guarantees that CS-UCS presentations are well enough correlated to guarantee perfect pattern learning of a given spatial pattern by certain cells v,, even if other spatial patterns are presented at irregular times when they are sampled by distinct cells vb. 14. PATTERN CALCULUS: RETINA, COMMAND CELL, REWARD, ATTENTION, MOTOR SYNERGY
Three simple but fundamental facts emerge from the mathematical analysis of pattern learning: the unit of LTM is a spatial pattern 6’ = (0, : i E I); suitably designed neural networks can factorize invariant pattern 0 from fluctuating energy; the size of a node’s sampling signal can render it adaptively sensitive or blind to a pattern 6’. These concepts helped me to think in terms of pattern transformations, rather than in terms of feature detectors, computer programs, linear systems, or other types of analysis. When I confronted data about other behavioral problems with these pattern processing properties, the conceptual pressure that was generated drove me into a wide-ranging series of specialized investigations. What is the minimal network that can discriminate 6’ from background input fluctuations? It looks like a retina, and the 6’’s became reflectances (Grossberg, 1970a, 1972b, 1976b, 1983). What is the minimal network that can encode and/or perform a spacetime pattern or ordered series of spatial patterns? Called an avalanche, it looks like an invertebrate command cell (Grossberg, 1969e, 1970b). How can one synchronize CS-UCS sampling if the time intervals between CS and UCS presentations are unsynchronized? This analysis led to psychophysiological mechanisms of reward, punishment, and attention (Grossberg, 1971b, 1972c, 1972d, 1975). What are the associative invariants of motor learning? Spatial patterns become motor synergies wherein fixed relative contraction rates across muscles occur, and temporally synchronized performance signals read-out the synergy as a unit (Grossberg, 1970a, 1974). 15. SHUNTING COMPETITIVE NETWORKS OR ADDITIVE NETWORKS?
These specialized investigations repeatedly led to consideration of competitive systems. For example, the same competitive normalization property that arose during modeling of receptor-bipolar-horizontal cell interactions in retina (Grossberg, 1970a, 197213) also arose in studies of the decision rules needed to release the right amount of incentive motivation in response to interacting drives and conditioned reinforcer inputs within midbrain reinforcement centers (Grossberg, 1972c, 1972d). Because these problems were approached from a behavioral perspective, I knew what interactive properties the competition had to have. I typically found that shunting competition had all the properties that I needed, whereas additive competition often did not. Additive networks approximate shunting networks when their activities are far from cell saturation levels ( B , and D, in equation (3)). When this is not the case, the automatic gain control properties of shunting networks play a major role, as the next section shows.
159
16. THE NOISE-SATURATION DILEMMA: PATTERN PROCESSING BY COMPETITIVE NETWORKS One basic property that was shared by all these systems concerned the manner in which cellular tissues process input patterns whose amplitudes may fluctuate over a much wider range than the cellular activations themselves. This theme is invisible to theories based on binary codes, feature detectors, or additive models. All cellular systems need to prevent sensitivity loss in their responses to both low and high input intensities. Mass action, or shunting, competition enables cells to elegantly solve this problem using automatic gain control by lateral inhibitory signals (Grossberg, 1970a, 1970b, 1973, 1980a). Additive competition fails in this task because it does not, by definition, possess an automatic gain control property. Suppose that the STM traces or activations q,z2,. . .,znat a network level fluctuate within fixed finite limits at their respective network nodes, as in (3). Setting a bounded operating range for each z, enables fixed decision criteria, such as output thresholds, to be defined. On the other hand, if a large number of intermittent input sources converge on the nodes through time, then a serious design problem arises, due to the fact that the total input converging on each node can vary wildly through time. I have called this problem the noise-saturation dilemma : If the z; are sensitive to large inputs, then why do not small inputs get lost in internal system noise? If the z, are sensitive to small inputs, then why do they not all saturate at their maximum values in response to large inputs? Shunting cooperative-competitive networks possess automatic gain control properties capable of generating an infinite dynamic range within which input patterns can be effectively processed, thereby solving the noise-saturation dilemma. The simplest feedforward network will be described to illustrate how its solves the sensitivity problem raised by the noise-saturation dilemma. Let a spatial pattern I, = 0,I of inputs be processed by the cells v,, i = 1,2,. . .,n. Each 0, is the constant relative size, or reflectance, of its input Z, and I is the variable total input size. In other words, I = C;=l I k , so that CFZlBk = 1. How can each cell v, maintain its sensitivity to 0, when I is parametrically increased? How is saturation each cell vi must have information about all the avoided? To compute 0, = I,(C;=, inputs Ik,k = 1,2,. . .,n. Moreover, since Bi = Zi(I, CkZiIk)-', increasing Ii increases 0, whereas increasing any Ik,k # i, decreases 0,. When this observation is translated into an anatomy for delivering feedforward inputs to the cells vi, it suggests that 1, excites v, and that all Z k , k # i, inhibit v,. This rule represents the simplest feedforward on-center off-surround anatomy. How does the on-center off-surround anatomy activate and inhibit the cells v, via mass action? Let each v, possess B excitable sites of which s i ( t ) are excited and B- z,(t) are unexcited at each time t. Then at vi, I, excites B - z, unexcited sites by mass action, and the total inhibitory input Ck., I k inhibits z, excited sites by mass action. Moreover, excitation 5, can spontaneously decay at a fixed rate A, so that the cell can return to an equilibrium point (arbitrarily set equal to 0) after all inputs cease. These rules say that
+
Equation (37) is perhaps the simplest example that illustrates the utility of shunting networks (3). If a fixed spatial pattern 1, = B,I is presented and the background input I
160
is held constant for awhile, each x i approaches an equilibrium value. This value is easily found by setting dxildt = 0 in (37). It is xi = ei-
BI A+Z‘
Equation (38) represents another example of the factorization of pattern (0,) and energy ( B I ( A + Z)-1). As a result, the relative activity Xi = x , ( c L I xk)-l equals 0i no matter how large I is chosen; there is no saturation. This is due to automatic gain control by the inhibitory inputs. In other words, &+i I k multiplies x i in (37). The total gain in (37) is found by writing Zd X ,= - ( A I ) z , + BI,. (39)
+ The gain is the coefficient of xi, namely - ( A + I ) , since if xi(0) = 0,
Both the steady state and the gain of x; depend on the input strength. This is characteristic of shunting networks but not of additive networks. The simple law (38) combines two types of information: information about pattern 0;, or “reflectances”, and information about background activity, or “luminance”. In visual psychophysics, the tendency towards reflectance processing helps to explain brightness constancy (Grossberg and TodoroviC, 1988). Another property of (38) is that the total activity n
is independent of the number of active cells. This normalization rule is a conservation law which says, for example, that a network that receives a fixed total luminance, making one part of the field brighter tends to make another part ,of the field darker. This property helps to explain brightness contrast, as well as brightness assimilation and the CraikO’Brien-Cornsweet effect (Grossberg and TodoroviC, 1988). Equation (38) can be written in another form that expresses a different physical intuition. If we plot the intensity of an on-center input in logarithmic coordinates K i , then Ki = Zn(I,) and Z, = exp(K,). Also write the total off-surround input as J , = Ck+I k . Then (38) can be written in logarithmic coordinates as
2,(Ka,J;) =
BeKi A+eKi + J;’
How does the response xi at v, change if we parametrically change the off-surround input J,? The answer is that xi’s entire response curve to K i is shifted. Its range of maximal sensitivity scales the off-surround intensity, but its dynamic range is not compressed. Such a shift occurs, for example, in the Weber-Fechner law (Cornsweet, 1970), in bipolar cells of the Necturus retina (Werblin, 1971) and in a modified form in the psychoacoustic data of Iverson and Pave1 (1981). The shift property says that
161
for all K , 2 0, where the amount of shift S caused by changing the total off-surround input from Jj') to J,!") is predicted to be
s = In(-
A + J(') A Jj2)
+
1-
(44)
Equation (37) is a special case of a law that occurs in vivo; namely, the membrane equation on which cellular neurophysiology is based. The membrane equation describes the voltage V ( t )of a cell by the law
c aVx = (V+ - V)g+ + ( V - - V)g- + (VP - V)gP.
(45)
In (45), C is a capacitance; V + , V - , and Vp are constant excitatory, inhibitory, and passive saturation points, respectively; and g + , g - , and g p are excitatory, inhibitory, and passive conductances, respectively. We will scale V + and V - so that V + > V - . Then in vivo V + 2 V ( t )2 V - and V + > VP 2 V - . Often V + represents the saturation point of a Na+ channel and V - represents the saturation point of a K+ channel. To see why (37) is a special case of (45), suppose that (45) holds at each cell v,. Then at v,, V = z,.Set C = 1 (rescale time), V + = B, V - = Vp = 0, g+ = I;,9- = CkZiI,., and g p = A. There is also symmetry-breaking in (45) because V + - VP is usually much larger than VP - V - . This symmetry-breaking operation, which is usually mentioned in the experimental literature without comment, achieves an important noise-suppression property when it is coupled to an on-center off-surround anatomy. For example, in the network
both depolarized potentials (0 < xi 5 B ) and hyperpolarized potentials (-C 5 2; < 0) can occur. The equilibrium activity in response to spatial pattern I , = B,I is
Parameter C(B+C)-' is an adaptation level which 8, must exceed in order to depolarize zi and thereby generate an output signal from vi. In order to inhibit uniform input patterns that do not carry discriminative featurd information, we would want Bi = for all i to imply that all z, = 0. This occurs if B = (n - 1)C, so that B B C and thus
v+- v p > v p - v-.
The reflectance processing and Weber law properties, the total activity normalization property, and the adaptation level property of (46) set the stage for the design and classification of more complex feedforward and feedback on-center off-surround shunting networks during the early 1970's.
17. SHORT TERM MEMORY STORAGE AND CAM Feedback networks are capable of storing memories in STM for far longer than a passive decay rate, such as A in (37), would allow, yet can also be rapidly reset. The
162
simplest feedback competitive network capable of solving the noise-saturation dilemma is defined by the equations
i = 1 , 2 , . . .,n. Suppose that the inputs 1, and J, acting before t = 0 establish an arbitrary initial activity pattern (z1(0), zz(O), . . .,zn(0))before being shut off at t = 0. How does the choice of the feedback signal function f(w) control the transformation and storage of this pattern as t -+ co? The answer is determined by the choice of function g(w) = w-lf(w), which measures how much f(w) deviates from linearity at prescribed activity levels w. The network’s responses to these choices may be summarized using the functions X ; = S;(C;!~ zk)-l and z = C!,, xk. The relative activity Xi of the ith node computes how the network transforms the input pattern through time. The functions Xi play the role for feedback networks that the reflectances 8; in (38) play for feedforward networks; also recall Theorems 1 and 2. The total activity z measures how well the network normalizes the total network activity and whether the pattern is stored (z(co) = limt-,ooz(t) > 0) or not ( ~ ( 0 0 )= 0). Variable z plays the role of the total input I in (38). In Grossberg (1973) the following types of results were proved about these systems: Linear signals lead to perfect pattern storage and noise amplification. Slower-than-linear signals lead to pattern uniformization and noise amplification. Faster-than-linear signals lead to winner-take-all choices, noise suppression, and total activity quantization in a network that behaves like an emergent finite state machine. Sigmoid signals lead to partial contrast enhancement, tunable filtering, noise suppression, and normalization. See Grossberg (1981, 1988) for reviews. All of these networks function as a type of global content addressable memory, or CAM, since all trajectories converge to equilibrium points through time. The equilibrium point to which the network converges in response to ari input pattern plays the role of a stored memory. Both linear and sigmoid signals can be chosen to create networks with infinitely many, indeed nondenumerably many, equilibria. Faster-than-linear signals give rise to only finitely many equilibria as part of their winner-take-all property. In summary, several factors work together to generate desirable pattern transformation and STM storage properties. The dynamics of mass action, the geometry of competition, and the statistics of competitive feedback signals work together to define a unified network module whose several parts are designed in a coordinated fashion through development. 18. EVERY COMPETITIVE SYSTEM INDUCES A DECISION SCHEME As solutions to specialized problems involving competition accumulated, networks capable of normalization, sensitivity changes via automatic gain control, attentional biases, developmental biases, pattern matching, shift properties, contrast enhancement, edge and curvature detection, tunable filtering, multistable choice behavior, normative drifts, traveling waves, synchronous oscillations, hysteresis, and resonance began to be classified within the framework of additive or shunting feedforward or feedback competitive networks. As in the case of associative learning, the abundance of special cases made it
163
seem more and more imperative to find a mathematical framework within which these results could be unified and generalized. I also began to realize that many of the pattern transformations and STM storage properties of specialized examples were instances of an absolute stability property of a general class of networks. This unifying mathematical theme can be summarized intuitively as follows: every competitive system induces a decision scheme that can be used to prove global limit and oscillation theorems, notably absolute stability theorems (Grossberg, 1978c, 1978d, 1980~).This decision scheme interpretation provides a geometrical way to think about a Liapunov functional that is naturally associated with every competitive system. A competitive dynamical system is, for present purposes, defined by a system of differential equations such that d Zxi=fi(xl,x2....,Zn) (49) where
afi.0,
axj -
i#j,
and the f,are chosen to generate bounded trajectories. By (50), increasing the activity xj of a given population can only decrease the growth rates of other populations, i # j , or may not influence them at all. No constraint is placed upon the sign of $!&. Typically, cooperative behavior occurs within a population and competitive behavior occurs between populations, as in the on-center off-surround networks (48). The method makes mathematically precise the intuitive idea that a competitive system can be understood by keeping track of who is winning the competition. The decision scheme makes this intuition precise. To define it, write (49) in the form
ix,
d
-xi dt = a,(zi)M,(z),
x = ( q , x Z , . . .,xn),
(51)
which factors out the amplification function .,(xi) 2 0. Then define
and
M + ( x ) = m a x { M , ( x ) : i = 1,2, ...,n }
(52)
M - ( x ) = min{M,(x) : i = 1,2,. . .,n}.
(53)
These variables track the largest and smallest rates of change, and are used to keep track of who is winning. Using these functions, it is easy to see that there exists a property of ignition: Once a trajectory enters the positive ignition region
R+ = {x : M + ( x ) 2 0}
(54)
R- = {x : M - ( z ) 5 O},
(55)
or the negative ignition region
it can never leave it. If x ( t ) never enters the set
R* = R+ n R-,
(56)
164
then each variable z,(t) converges monotonically to a limit. The interesting behavior in a competitive system occurs in R*. In particular, if ~ ( tnever ) enters R+, each z,(t) decreases to a limit; then the competition never gets started. The set
s+= {z : M + ( Z ) = 0)
(57)
acts like a competition threshold, which is called the positive ignition hypersurface. We therefore consider a trajectory after it has entered R*. For simplicity, redefine the time scale so that the trajectory is in R* at time t = 0. The Liapunov functional for any competitive system is then defined as
L ( z t )=
/1
M+(z(u))dv.
0
The Liapunov property is a direct consequence of positive ignition:
This functional provides the “energyn that forces trajectories through a series of competitive decisions, which are also called jumps. Jumps keep track of the state which is undergoing the maximal rate of change at any time (“who’s winning”). If M + ( x ( t ) )= M , ( z ( t ) ) for times S 5 t < T but M + ( z ( t ) )= M j ( z ( t ) )for times T 5 t < U , then we say that the system jumps from node vi to node v, at time t = T . A jump from u, to u, can only occur on the jump set Jij = {ZE R* : M + ( z ) = M;(z) = Mi(”)}. (60) The Liapunov functional L ( z t ) moves the system through these decision hypersurfaces through time. The geometry of S+, S-, and the jump sets J,,, together with the energy defined by L(zt),can be used to globally analyse the dynamics of the system. In particular, due to the positive ignition property (59), the limit
1
m
lim L ( z t ) =
t-oo
0
M+(z(u))du
always exists, and is possibly infinite. 19. LIMITS A N D OSCILLATIONS: CONSENSUS A N D CONTRADICTION
The following results illustrate the use of these concepts (Grossberg, 1 9 7 8 ~ ) : Theorem 3: Given any initial data z(O), suppose that
Jm
M + ( z ( v ) ) d u< 00.
Then the limit ~ ( c o = ) limt-oo z ( t )exists. Corollary 1: If in response to initial data z(O), all jumps cease after some time T < CO, then z(00) exists. Speaking intuitively, this result means that after all local decisions, or jumps, have been made in response to an initial state z(O), then the system can settle down to a
165
global decision, or equilibrium point z(00). In particular, if z(0) leads to only finitely many jumps because there exists a jump tree, or partial ordering of decisions, then z(00) exists. This fact led to the analysis of circumstances under which no jump cycle, or repetitive series of jumps, occurs in response to s(O), and hence that jump trees exist. These results included examples of nonlinear Volterra-Loth equations with asymmetric interaction equations all of whose trajectories approach equilibrium points (Grossberg, 1978~).Thus symmetric coefficients were shown not to be necessary for global approach to equilibrium, or a global CAM property, to obtain. Further information may be derived from (62). Since M + ( z ( t ) )2 0 for all t 2 0, it also follows that limt+m M + ( z ( t ) )= 0. This tells us to look for the equilibrium points z(00) on the positive ignition hypersurface S+ in (57): Corollary 2: If M+(s(t))dt< 00, then z(00) E S+. Thus the positive ignition surface is the place where the competition both ignites and its memories are stored if no jump cycle exists. Using this result, an analysis was made of conditions under which no jump cycle exists in response to any initial vector z(O), and hence all trajectories approach an equilibrium state. The same method was also used to prove that a competitive system can generate sustained oscillations if it contains globally inconsistent decisions. These results provide examples where asymmetric coefficients do lead to oscillations. Here, in response to initial data s(O),
J,"
M + ( z ( v ) ) d v= 00)
(63)
thus infinitely many jumps occur, hence a jump cycle occurs, and the trajectory undergoes undamped oscillations. This method was used to provide a global analysis of the oscillations taking place in a variety of competitive systems, including the Volterra-Lotka systems that model the voting paradox (Grossberg, 1978c, 1980c; May and Leonard, 1975). Using this method, a large new class of nonlinear competitive networks was identified all of whose trajectories converge to one of possibly infinitely many equilibrium points (Grossberg, 1978d). These are the adaptation level systems
d dt
-z,
= a,(z)[b,(s,)- c(z)]
(64)
which were identified through an analysis of many specialized networks. In system (63), each state-dependent amplification function a;(%) and self-signal function b,(zi) can be chosen with great generality without destroying the system's ability to reach equilibrium because there exists a state-dependent adaptation level ~ ( z against ) which each bi(z;)is compared. Such an adaptation level C(Z) defines a strong type of long-range symmetry within the system. Equation (64) is a feedback analog of the feedforward adaptation level equation (47). The examples which motivated the analysis of (64) were additive networks
166
and shunting networks
k
in which the symmetric coefficients i?k,, Ck,, and Ek, took on different values when k = i and when k # i. Examples in which the symmetric coefficients varied with 1 k - i I in a graded fashion were also studied through computer simulations (Ellias and Grossberg, 1975; Levine and Grossberg, 1976). An adequate global mathematical convergence proof was announced in Grossberg (1982b) and elaborated in Cohen and Grossberg (1983). A special case of my theorem concerning these adaption level systems is the following. T h e o r e m 4 (Absolute Stability of Adaptation Level Systems) Suppose that (I) Smoothness: ) continuously differentiable; The functions a i ( z ) , b,(xi),and ~ ( xare (11) Positivity: u , ( x ) > 0 if xi > 0, xj 2 0, j # i; a;(.) = O
if
xi
xj 2 0,
=0,
j # i;
for sufficiently small X > 0, there exists a continuous function ";(xi) such that i ~ ; ( x ,2) a,(x) if
and
J" 0
2E
[ O , A]"
-+
= 03;
a (w)
(111) Boundedness: for each z = 1,2,. ..,n, limsupbi(xi) < c(0,O ,..., m,O ,...,0) 2/-00
where 00 is in the ith entry of (O,O, (IV) Competition: *>O,
dXi
,m, 0,. . . , O ) ;
%..
Z E R T , i = 1 , 2 ,...,n;
(V) Decision Hills: The graph of each bi(x,) possesses at most finitely many maxima in every compact interval. Then the pattern transformation is stored in STM because all trajectories converge to equilibrium points; that is, given any x ( 0 ) > 0, the limit x(00) = limt-OOz ( t )exists. This theorem intuitively means that the decision schemes of adaptation level systems are globally consistent and give rise to a global CAM.
167
In the proof of Theorem 4, it was shown that each z ; ( t )gets trapped within a sequence of decision boundaries that get laid down through time at the abscissa values of the highest peaks in the graphs of the functions b, in (64). The size and location of these peaks reflect the statistical rules, which can be chosen extremely complex, that give rise to the output signals from the totality of cooperating subpopulations within each node v,. In particular, a b, with multiple peaks can be generated when a population’s positive feedback signal function is a multiple-sigmoid function which adds up output signals from multiple randomly defined subpopulations within D,. After all the decision boundaries get laid down, each z, is trapped within a single valley of its b, graph. This valley acts, in some respects, like a classical potential. After all the z, get trapped in such valleys, the function B[z(t)] = max{b,(z(t)) :i = 1,2,. . . ,n} (73) is a Liapunov function. This Liapunov property was used to complete the proof of the theorem. Adaptation level systems exclude distance-dependent interactions. To overcome this gap, Michael Cohen and I (Cohen and Grossberg, 1983; see also Grossberg, 1982b) studied the absolute stability of the symmetric networks
where F,, = Fj,. The adaptation level model (64) is in some ways more general and in some ways less general than model (74). Cohen and I began our study of (74) with the hope that we could use the symmetric coefficients in (74) to prove that no jump cycles exist, and thus that all trajectories approach equilibrium as a consequence of Theorem 3. Such a proof would be part of a more general theory and, by using geometrical concepts such as jump set and ignition surface, it would clarify how to perturb off the symmetric coefficients without generating oscillations. As it turned out, the global Liapunov method that 1developed in the 1970’s sensitized us to think in that direction. We soon discovered a general class of symmetric models and a global Liapunov function for every model in the cla,ss. In each of these models, the Liapunov function was used to prove that all trajectories approach equilibrium points. This CAM model, which is now often called the Cohen-Grossberg model, was designed to include additive networks ( 6 5 ) and shunting networks (66) with symmetric coefficients. 20. COHEN-GROSSBERG CAM MODEL AND THEOREM The Cohen-Grossberg model includes any dynamical system that can be written in the form n d = ~ ; ( z ; ) [ b ;-( ~ ;c); j d j ( z j ) ] . (75)
C
j=1
Each such model admits the global Liapunov function
168
if the coefficient matrix C =((c;j (1 and the functions a,, bi, and d , obey mild technical conditions, including Symmetry: c..=c.. (77) $3 11’
Positivity:
ai(xi) L 0 Monotonicity: 4(Xj)
2 0.
(79)
Integrating V along trajectories implies that
If (78) and (79) hold, then $V 5 0 along trajectories. Once this basic property of a Liapunov function is in place, it is a technical matter to rigorously prove that every trajectory approaches one of a possibly large number of equilibrium points. For expository vividness, the functions in the Cohen-Grossberg model (75) are called the amplification function a , , the self-signal function bi, and the other-signal functions d j . Specialized models are characterized by particular choices of these functions. A. Additive Model Cohen and Grossberg (1983, p.819) noted that “the simpler additive neural networks . . . are also included in our analysis”. The additive equation (2) can be written using the coefficients of the standard electrical circuit interpretation (Plonsey and Fleming, 1969)
as
n -_ + c f j ( X j ) Z j i + Ii. j=l
C . 3= 1 xi ’ dt Ri
(81)
Substitution into (75) shows that 1 Ci
u;(z,)= -
1
b,(z;) = -2, Ri
(constant!)
+ 1,
c . .= -T.. 11
13
(linear!)
(82)
(83)
(84)
Thus in the additive case, the amplification function (82) is a positive constant, hence satisfies (78), and the self-signal term (83) is linear. Substitution of (82)-(83) into (76) leads directly to the equation
169
This Liapunov function for the additive model was later published by Hopfield (1984). In Hopfield's treatment, & is written as an inverse j ; ' ( K ) . Cohen and Grossberg (1983) showed, however, that although f,(s,)must be nondecreasing, as in (79), it need not have an inverse in order for (86) to be valid. B. Shunting Model All additive models lead to constant amplification functions a;(.;) and linear selffeedback functions bi(zi).The need for the more general model (75) becomes apparent when the shunting STM equation (3) is analysed. Consider, for example, a class of shunting models. n
Bd x =~-A;x, + (Bi - ~ i ) [ l+i f i ( ~ i )-] (x,+ Ci)[J,+ C D , j g j ( ~ j ) ] .
(87)
j=1
In (87), each 5;can fluctuate within the finite interval [-Ci, B,] in response to the constant inputs I; and J;, the state-dependent positive feedback signal f;(z;), and the negative feedback signals D ; j g j ( z j ) . It is assumed that Dij
= Dj, 2 0
(88)
and that gi(Xj)2
(89)
0.
In order to write (87) in Cohen-Grossberg form, it is convenient to introduce the variables y, =Xi
+ c,.
(90)
In applications, C, is typically nonnegative. Since X, can vary within the interval
[-C,, B;],y , can vary within the interval [0, B, + C,] of nonnegative numbers. In terms of these variables, (87) can be written in the form
where ai(yj) =y ,
1 b i ( Y i ) = -[AiCi - ( A , Xi
(nonconstant!),
+ J ; ) z ~+ ( B ,+ C, - z,)(I,+
fi(rj
- C;))] (nonlinear!),
C'.3.-- D 13, .. and d j ( y i ) = g j ( y j - C,) (noninvertible!).
Unlike the additive model, the amplification function a , ( y i ) in (92) is not a constant. In addition, the self-signal function b ; ( y i ) in (93) is not necessarily linear, notably because the feedback signal f i ( q - C,) is often nonlinear in applications of the shunting model; in particular it is often a sigmoid or multiple sigmoid signal function.
170
I. 2.
Norrnalixo Total Activity Contrast Enhone.
3. S T M
LTM IN PLASTIC
SYNAPTIC
STRENGTHS
1.Compul.
Tirn.-Av.rag. Pr.rynaphc Signal and
01
Postsynaptic STMi k. Product Gat. Signals
53
2.Mullipli~otir.ly
v,
I.Narrnalizm Total Activity
Iilt)
Input
Panern
Figure 6. The basic computational rules of self-organizing feature maps were established by 1976. (Reprinted with permission from Grossberg, 1976b.)
Property (78) follows from the fact that a,(y,) = y, 2 0. Property (79) follows from the assumption that the negative feedback signal function gj is monotone nondecreasing. Cohen and Grossberg (1983) proved that gj need not be invertible. A signal threshold may exist below which g . - 0 and above which g, may grow in a nonlinear way. The inclusion 3 -. of nonlinear signals with thresholds better enables the model to deal with fluctuations due to subthreshold noise. These results show that adaptation level and distance-dependent competitive networks represent stable neural designs for competitive decision-making and CAM. The fact that adaptation level systems have been analyzed using Liapunov functionals whereas distancedependent, and more generally, symmetric networks have been analyzed using Liapunov functions shows that the global convergence theory of competitive systems is still incomplete. Global limit theorems for cooperative systems were also subsequently discovered (Hirsch, 1982, 1985, 1989), as were theorems showing when closely related cooperativecompetitive systems could oscillate (Cohen, 1988, 1990). Major progress has also been made on explicitly constructing dynamical systems with prescribed sets of equilibrium points, and only these equilibrium points (Cohen, 1992). This is an exciting area for intensive mathematical investigation. Additive and shunting networks have also found their way into many applications. Shunting networks have been particularly useful in understanding biological and machine vision, from the earliest retinal detection stages through higher cortical filtering and grouping processes (Gaudiano, 1992a, 1992b: Grossberg and Mingolla, 1985a, 1985b; Nabet and Pinter, 1991), as well as perceptual and motor oscillations (Cohen, Grossberg, and Pribe, 1993; Gaudiano and Grossberg, 1991; Grossberg and Somers, 1991, 1992; Somers and Kopell, 1993).
171
21. COMPETITIVE LEARNING AND SELF-ORGANIZING FEATURE MAPS
Once mathematical results were available that clarified the global dynamics of associative learning and competition, the stage was set to combine these mechanisms in models of cortical development, recognition learning, and categorization. One major source of interest in such models came from neurobiological experiments on geniculocortical and retinotectal development (Gottlieb, 1976; Hubel and Wiesel, 1977; Hunt and Jacobson, 1974). My own work on this problem was stimulated by such neural data, and by psychological data concerning perception, cognition, and motor control. Major constraints on theory construction also derived from my previous results on associative learning. During outstar learning, for example, no learning of a sampled input pattern 19; in (27) occurs, i = 1,2). ..,n,when the learning signal D ( t ) = 0 in equation (26). This property was called stimulus sampling. It showed that activation of an outstar source cell enables it to selectively learn spatial patterns at prescribed times. This observation led to the construction of more complex sampling cells and networks, called avalanches, that are capable of learning arbitrary space-time patterns, not merely spatial patterns, and to a comparison of avalanche networks with moperties of command cells in invertebrates (GrossberE, - - 1969e, 1970b, 1974). Activation of outstars and avalanches needs to be selective, so as not to release, or recall, learned responses in unappropriate contexts. Networks were needed that could selectively filter input patterns so as to activate outstars and avalanches only under appropriate stimulus conditions. This work led to the introduction of instar networks in Grossberg (1970a, 1972b), to the description of the first self-organizing feature map in Malsburg (1973), and to the development of the main equations and mathematical properties of the modern theory of competitive learning, self-organizing feature maps, and learned vector quantization in Grossberg (1976a, 1976b, 1976c, 1978a). Willshaw and Malsburg (1976) and Malsburg and Willshaw (1977, 1981) also made a seminal contribution at this time to the modelling of cortical development using self-organizing feature maps. In addition, the first self-organizing multilevel networks were constructed in 1976 for the learning of multidimensional maps from P" to Pm, for any n , m 2 1 (Grossberg, 1976a, 1976b, 1976~). . The first two levels Fl and F2 constitute a self-organizing feature map such that input patterns to F1 are categorized at Fz. Levels F2 and F3 are built out of outstars so that categorizing nodes at F2 can learn output patterns at F3. Hecht-Nielsen (1987) later called such networks counterpropagation networks and claimed that they were a new model. The name instar-outstar map has been used for these maps since the 1970's. Recent popularizers of back propagation have also claimed that multilevel neural networks for adaptive mapping were not available until their work using back propagation in the last half of the 1980's. Actually, back propagation was introduced by Werbos (1974) and self-organizing mapping networks that were proven to be stable in sparse environments were available in 1976. An account of the historical development of self-organizing feature maps is provided in Carpenter and Grossberg (1991). The main processing levels and properties of self-organizing feature maps are summarized in Figure 6, which is reprinted from Grossberg (197613). In such a model, an input pattern is normalized and registered as a pattern of activity, or STM, across the feature detectors of level F l . Each Fl output signal is multiplied or gated, by the adaptive weight, or LTM trace, in its respective pathway, and all these LTM-gated inputs are added up
172
at their target F2 nodes, as in equations (1)-(3). Lateral inhibitory, or competitive, interactions within F2 contrast-enhance this input pattern; see Section 17. Whereas many F2 nodes may receive inputs from 4 , lateral inhibition allows a much smaller set of F2 nodes to store their activation in STM. Only the F2 nodes that win the competition and store their activity in STM can influence the learning process. STM activity opens a learning gate at the LTM traces that abut the winning nodes, as in equation (7). These LTM traces can then approach, or track, the input signals in their pathways by a process of steepest descent. This learning law has thus often been called gated steepest descent, or instar learning. As noted in Section 2, it was introduced into neural network models in the 1960’s (e.g. Grossberg, 1969d). Because such an LTM trace can either increase or decrease to track the signals in its pathway, it is not a Hebbian associative law (Hebb, 1949). It has been used to model neurophysiological data about hippocampal LTP (Levy, 1985; Levy and Desmond, 1985) and adaptive tuning of cortical feature detectors during the visual critical period (Rauschecker and Singer, 1979; Singer, 1983), lending support to the 1976 prediction that both systems would employ such a learning law (Grossberg, 1976b, 1978a). Hecht-Nielsen (1987) has called the instar learning law Kohonen learning after Kobonen’s use of the law in his applications of self-organizing feature maps in the 1980’s, as in Kohonen (1984). The historical development of this law, including its use in self-organizing feature maps in the 1970’s, does not support this attribution. Indeed, after self-organizingfeature map models were introduced and computationally characterized in Grossberg (1976b, 1978a), Malsburg (1973), and Willshaw and Malsburg (1976), these models were subsequently applied and specialized by many authors (Amari and Takeuchi, 1978; Bienenstock, Cooper and Munro, 1982; Commons, Grossberg, and Staddon, 1991; Grossberg, 1982a, 1987; Grossberg and Kuperstein, 1986; Kohonen, 1984; Linsker, 1986; Rumelhart and Zipser, 1985). They exhibit many useful properties, especially if not too many input patterns, or clusters of input patterns, perturb level Fl relative to the number of categorizing nodes in level F2. It was proved that under these sparse environmental conditions, category learning is stable, with LTM traces that track the statistics of the environment, are self-normalizing, and oscillate a minimum number of times (Grossberg, 1976b, 1978a). Also, the category decision rule, as in a Bayesian classifier, tends to minimize error. It was also proved, however, that under arbitrary environmental conditions, learning becomes unstable. Such a model could forget your parents’ faces. Although a gradual switching off of plasticity can partially overcome this problem, such a mechanism cannot work in a recognition learning system whose plasticity is maintained throughout adulthood. This memory instability is due to basic properties of associative learning and lateral inhibition. An analysis of this instability, together with data about categorization, conditioning, and attention, led to the introduction of Adaptive Resonance Theory, or ART, models that stabilize the memory of self-organizing feature maps in response to an arbitrary stream of input patterns (Grossberg, 1976~).A central prediction of ART, from its inception, has been that adult learning mechanisms share properties with the adaptive mechanisms that control developmental plasticity, in particular that “adult attention is a continuation on a developmental continuum of the mechanisms needed to solve the stability-plasticity dilemma in infants” (Grossberg, 1982b, p. 335). Recent experimental results concerning the neural control of learning have provided increasing support for this
173
hypothesis (Kandel and O’Dell, 1992). 22. ADAPTIVE RESONANCE THEORY
In an ART model, as shown in Figure 7a, an input vector I registers itself as a pattern X of activity across level F1. The Fl output vector S is then transmitted through the multiple converging and diverging adaptive filter pathways emanating from Fl. This transmission event multiplies the vector S by a matrix of adaptive weights, or LTM traces, to generate a net input vector T to level F2. The internal competitive dynamics of F2 contrast-enhance vector T. Whereas many F2 nodes may receive inputs from F1, competition or lateral inhibition between F2 nodes allows only a much smaller set of F2 nodes to store their activation in STM. A compressed activity vector Y is thereby generated across F2. In the ART 1 and ART 2 models (Carpenter and Grossberg, 1987a, 1987b), the competition is tuned so that the F2 node that receives the maximal F1 -, F2 input is selected. Only one component of Y is nonzero after this choice takes place. Activation of such a winner-take-all node defines the category, or symbol, of the input pattern I. Such a category represents all the inputs I that maximally activate the corresponding node. So far, these are the rules of a self-organizing feature map. In a self-organizing feature map, only the F2 nodes that win the competition and store their activity in STM can immediately influence the learning process. In an ART model (Carpenter and Grossberg, 1987a, 1992), learning does not occur as soon as some winning F2 activities are stored in STM. Instead activation of F2 nodes may be interpreted as “making a hypothesis” about an input I. When Y is activated, it rapidly generates an output vector U that is sent top-down through the second adaptive filter. After multiplication by the adaptive weight matrix of the top-down filter, a net vector V inputs to Fl (Figure 7b). Vector V plays the role of a learned top-down expectation. Activation of V by Y may be interpreted as “testing the hypothesis” Y ,or “reading out the category prototype” V. An ART network is designed to match the “expected prototype” V of the category against the active input pattern, or exemplar, I. Nodes that are activated by I are suppressed if they do not correspond to large LTM traces in the prototype pattern V. Thus F1 features that are not “expected” by V are suppressed. Expressed in a different way, the matching process may change the F1 activity pattern X by suppressing activation of all the feature detectors in I that are not “confirmed” by hypothesis Y. The resultant pattern X* encodes the cluster of features in I that the network deems relevant to the hypothesis Y based upon its past experience. Pattern X* encodes the pattern of features to which the network “pays attention.” If the expectation V is close enough to the input I, then a state of resonance develops as the attentional focus takes hold. The pattern X* of attended features reactivates hypothesis Y which, in turn, reactivates X*. The network locks into a resonant state through the mutual positive feedback that dynamically links X* with Y . In ART, the resonant state, rather than bottom-up activation, drives the learning process. The resonant state persists long enough, at a high enough activity level, to activate the slower learning process; hence the term adaptive resonance theory. ART systems learn prototypes, rather than exemplars, because the attended feature vector X*,rather than the input I itself, is learned. These prototypes may, however, also be used to encode individual exemplars, as described below.
174
23. MEMORY STABILITY AND 2/3 RULE MATCHING This attentive matching process is realized by combining three different types of inputs at level F1 (Figure 7): bottom-up inputs, top-down expectations, and attentional gain control signals. The attentional gain control channel sends the same signal to all F1 nodes; it is a “nonspecific”, or modulatory, channel. Attentive matching obeys a 2/3 Rule (Carpenter and Grossberg, 1987a): an F1 node can be fully activated only if two of the three input sources that converge upon it send positive signals at a given time. The 2/3 Rule allows an ART system to react to bottom-up inputs, since an input directly activates its target F1 features and indirectly activates them via the nonspecific gain control channel to satisfy the 2/3 Rule (Figure 7a). After the input instates itself at F1, leading to selection of a hypothesis Y and a top-down prototype V, the 2/3 Rule ensures that only those F1 nodes that are confirmed by the top-down prototype can be attended at F1 after an Fz category is selected. The 2/3 Rule enables an ART network to realize a self-stabilizing learning process. Carpenter and Grossberg (1987a) proved that ART learning and memory are stable in arbitrary environments, but become unstable when 2/3 Rule matching is eliminated. Thus a type of matching that guarantees stable learning also enables the network to pay attention. 24. PHONEMIC RESTORATION AND PRIMING 2/3 Rule matching in the brain is illustrated by experiments on phonemic restoration (Repp, 1991; Samuel, 1981a, 1981b; Warren, 1984; Warren and Sherman, 1974). Suppose that a noise spectrum replaces a letter sound in a word heard in an otherwise unambiguous context. Then subjects hear the correct letter sound, not the noise, to the extent that the noise spectrum includes the letter formants. If silence replaces the noise, then only silence is heard. Top-down expectations thus amplify expected input features while suppressing unexpected features, but do not create activations not already in the input. 2/3 Rule matching also shows how an ART system can be primed. This property has been used to explain paradoxical reaction time and error data from priming experiments during lexical decision and letter gap detection tasks (Grossberg and Stone, 1986; Schvaneveldt and MacDonald, 1981). Although priming is often thought of as a residual effect of previous bottom-up activation, a combination of bottom-up activation and top-down 2/3 Rule matching was needed to explain the complete data pattern. This analysis combined bottom-up priming with a type of top-down priming; namely, the top-down activation that prepares a network for an expected event that may or may not occur. The 2/3 Rule clarifies why top-down priming, by itself, is subliminal (and in the brain unconscious), even though it can facilitate supraliminal processing of a subsequent expected event. 25. SEARCH, GENERALIZATION, AND NEUROBIOLOGICAL CORRELATES The criterion of an acceptable 2/3 Rule match is defined by a parameter p called vigilance (Carpenter and Grossberg, 1987a, 1992). The vigilance parameter is computed in the orienting subsystem A. Vigilance weighs how similar an input exemplar must be to a top-down prototype in order for resonance to occur. Resonance occurs if plIl- IX*l< 0. This inequality says that the Fl attentional focus X* inhibits A more than the input I excites it. If A remains quiet, then an F1 +.+ F2 resonance can develop.
175
* I+
I
I
Figure 7. ART search for an F2 recognition code: (a) The input pattern I generates the specific STM activity pattern X at F1 as it nonspecifically activates the orienting subsystem A. X is represented by the hatched pattern across Fl. Pattern X both inhibits A and generates the output pattern S. Pattern S is transformed by the LTM traces into the input pattern T, which activates the STM pattern Y across F2. (b) Pattern Y generates the top-down output pattern U which is transformed into the prototype pattern V. If V mismatches I at F1, then a new STM activity pattern X*is generated at Fl. X* is represented by the hatched pattern. Inactive nodes corresponding to X are unhatched. The reduction in total STM activity which occurs when X is transformed into X*causes a decrease in the total inhibition from F1 to A. (c) If the vigilance criterion fails to be met, A releases a nonspecific arousal wave to F2, which resets the STM pattern Y at FZ. (d) After Y is inhibited, its top-down prototype signal is eliminated, and X can be reinstated at Fl. Enduring traces of the prior reset lead X to activate a different STM pattern Y *at Fz. If the top-down prototype due to Y *also mismatches I at F1, then the search for an appropriate Fz code continues until a more appropriate Fz representation is selected. Then an attentive resonance develops and learning of the attended data is initiated. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.)
176
ART 1 (BINARY)
FUZZY ART (ANALOG)
CATEGORY CHOICE
MATCH CRITERION
intersection
minimum
Figure 8. Comparison of ART 1 and Fuzzy ART. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.) Vigilance calibrates how much novelty the system can tolerate before activating A and searching for a different category. If the top-down expectation and the bottom-up input are too different to satisfy the resonance criterion, then hypothesis testing, or memory search, is triggered. Memory search leads to selection of a better category at level F2 with which to represent the input features at level F l . During search, the orienting subsystem interacts with the attentional subsystem, as in Figures 7c and 7d, to rapidly reset mismatched categories and to select other F2 representations with which to learn about novel events, without risking unselective forgetting of previous knowledge. Search may select a familiar category if its prototype is similar enough to the input to satisfy the vigilance criterion. The prototype may then be refined by 2/3 Rule attentional focussing. If the input is too different from any previously learned prototype, then an uncommitted population of F2 cells is selected and learning of a new category is initiated. Because vigilance can vary across learning trials, recognition categories capable of encoding widely differing degrees of generalization or abstraction can be learned by a single ART system. Low vigilance leads to broad generalization and abstract prototypes. High vigilance leads to narrow generalization and to prototypes that represent fewer input exemplars, even a single exemplar. Thus a single ART system may be used, say, to recognize abstract categories of faces and dogs, as well as individual faces and dogs. A single system can learn both, as the need arises, by increasing vigilance just enough to activate A if a previous categorization leads to a predictive error (Carpenter and Grossberg, 1992; Carpenter, Grossberg, and Reynolds, 1991; Carpenter, Grossberg, Markuzon, Reynolds, and Rosen, 1992). ART systems hereby provide a new answer to whether the brain learns
177
prototypes or exemplars. Various authors have realized that neither one nor the other alternative is satisfactory, and that a hybrid system is needed (Smith, 1990). ART systems can perform this hybrid function in a manner that is sensitive to environmental demands. These properties of ART systems have been used to explain and predict a variety of cognitive and brain data that have, as yet, received no other theoretical explanation (Carpenter and Grossberg, 1991; Grossberg, 1987a, 1987b). For example, a formal lesion of the orienting subsystem creates a memory disturbance that remarkably mimics properties of medial temporal amnesia (Carpenter and Grossberg, 1987c, 1993; Grossberg and Merrill, 1992). These and related data correspondences to orienting properties (Grossberg and Merrill, 1992) have led to a neurobiological interpretation of the orienting subsystem in terms of the hippocampal formation of the brain. In applications to visual object recognition, the interactions within the Fl and F2 levels of the attentional subsystem are interpreted in terms of data concerning the prestriate visual cortex and the inferotemporal cortex (Desimone, 1992), with the attentional gain control pathway interpreted in terms of the pulvinar region of the brain. The ability of ART systems to form categories of variable generalization is linked to the ability of inferotemporal cortex to form both particular (exemplar) and general (prototype) visual representations. 26. A CONNECTION BETWEEN ART SYSTEMS AND FUZZY LOGIC Fuzzy ART is a generalization of ART 1 that incorporates operations from fuzzy logic (Carpenter, Grossberg, and Rosen, 1991). Although ART 1 can learn to classify only binary input patterns, Fuzzy ART can learn to classify both analog and binary input patterns. Moreover, Fuzzy ART reduces to ART 1 in response to binary input patterns. As shown in Figure 8, the generalization to learning both analog and binary input patterns is achieved by replacing appearances of the intersection operator (n) in ART 1 by the MIN operator (A) of fuzzy set theory. The MIN operator reduces to the intersection operator in the binary case. Of particular interest is the fact that, as parameter a approaches 0, the function T, which controls category choice through the bottom-up filter reduces to the operation of fuzzy subsethood (Kosko, 1986). T, then measures the degree to which the adaptive weight vector w, is a fuzzy subset of the input vector I. In Fuzzy ART, input vectors are normalized at a preprocessing stage (Figure 9). This normalization procedure, called complement coding, leads to a symmetric theory in which the MIN operator (A) and the MAX operator (v) of fuzzy set theory (Zadeh, 1965) play complementary roles. The categories formed by Fuzzy ART are then hyper-rectangles. Figure 10 illustrates how MIN and MAX define these rectangles in the 2-dimensional case. The MIN and MAX values define the acceptable range of feature variation in each dimension. Complement coding uses on-cells (with activity a in Figure 9) and off-cells (with activity ac in Figure 9) to represent the input pattern, and preserves individual feature amplitudes while normalizing the total on-cell/off-cell vector. The on-cell portion of a prototype encodes features that are critically present in category exemplars, while the off-cell portion encodes features that are critically absent. Each category is then defined by an interval of expected values for each input feature. For instance, Fuzzy ART would encode the feature of “hair on head” by a wide interval ([A, 11) for the category “man”, whereas the feature “hat on head” would be encoded by a wide interval ([O, B]). On the other hand, the category “dog” would be encoded by two narrow intervals, [C, I] for hair and [O, D] for hat, corresponding to narrower ranges of expectations for these two features.
178
III = M ( 1-al,
... , l-aM)
Figure 9. Complement coding uses on-cell and off-cell pairs to normalize input vectors. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.) Learning in Fuzzy ART is stable because all adaptive weights can only decrease in time. Decreasing weights correspond to increasing sizes of category “boxes”. This theorem is proved in Carpenter, Grossberg, and Rosen (1991). Smaller vigilance values lead to larger category boxes. Learning stops when the input space is covered by boxes. The use of complement coding works with the property of increasing box size to prevent a proliferation of categories. With fast learning, constant vigilance, and a finite input set of arbitrary size and composition, it has been proved that learning stabilizes after just one presentation of each input pattern. A fast-commit slow-recode option combines fast learning with a forgetting rule that buffers system memory against noise. Using this option, rare events can be rapidly learned, yet previously learned memories are not rapidly erased in response to statistically unreliable input fluctuations. The equations that define the Fuzzy ART algorithm are listed in Section 29. 27. FUZZY ARTMAP AND FUSION ARTMAP: SUPERVISED INCREMENTAL LEARNING, CATEGORIZATION, AND PREDICTION Individual ART modules typically learn in an unsupervised mode. ART systems capable of supervised learning, categorization, and prediction have also recently been introduced (Asfour, Carpenter, Grossberg, and Lesher, 1993; Carpenter and Grossberg, 1992; Carpenter, Grossberg, and Reynolds, 1991; Carpenter, Grossberg, Markuzon, Reynolds, and Rosen, 1992; Carpenter, Grossberg, and Iizuka, 1992). Unlike many supervised learning networks, such as back propagation, these ART systems are capable of functioning in either an unsupervised or supervised mode, depending on whether environmental feedback is available. When supervised learning of Fuzzy ART controls category formation, a predictive error can force the creation of new categories that could not otherwise be learned due to monotone increase in category size through time in the unsupervised case. Supervision permits the creation of complex categorical structures without a loss of stability. The main additional ingredients whereby Fuzzy ART modules are combined into a supervised ART architectures are now summarized.
179
A Fuzzy AND (conjunction)
V Fuzzy OR (disjunction) I
Y m..........X.*V Y
x = (XlJ2) (x A Y ) ~= min(x1,yl) (x v y)1 = max(x1,yl)
Y = (Y17Y2) (x A y)2 = min(x~y2) (x v y)2 = max(x2,~2)
Figure 10. Fuzzy AND and OR operations generate category hyper-rectangles. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.)
The simplest supervised ART systems are generically called ARTMAP. An ARTMAP that is built up from Fuzzy ART modules is called a Fuzzy ARTMAP system. Each Fuzzy ARTMAP system includes a pair of Fuzzy ART modules (ART, and ART,), as in Figure 11. During supervised learning, ART, receives a stream {a(p)} of input patterns and ART, receives a stream {b(p)} of input patterns, where b(p) is the correct prediction given a(p). These modules are linked by an associative learning network and an internal controller that ensures autonomous system operation in real time. The controller is designed to create the minimal number of ART, recognition categories, or “hidden units,” needed to meet accuracy criteria. As noted above, this is accomplished by realizing a Minimax Learning Rule that conjointly minimizes predictive error and maximizes predictive generalization. This scheme automatically links predictive success to category size on a trial-by-trial basis using only local operations. It works by increasing the vigilance parameter pa of ART, by the minimal amount needed to correct a predictive error at ART, (Figure 12). Parameter pa calibrates the minimum confidence that ART, must have in a recognition category, or hypothesis, that is activated by an input a(P) in order for ART, to accept that category, rather than search for a better one through an automatically controlled process of hypothesis testing. As in ART 1, lower values of p , enable larger categories to form. These lower pa values lead to broader generalization and higher code compression. A predictive failure at ARTb increases the minimal confidence pa by the least amount needed to trigger hypothesis testing at ART,, using a mechanism called match trucking (Carpenter, Grossberg, and Reynolds, 1991). Match tracking sacrifices the minimum amount of generalization necessary to correct the predictive error. Speaking intuitively,
180
map field Fab ......................................................
ART, ..........: .............................
Xab 4 ...............
..........
w
Fa2
ART,
4-
reset
t
match tracking
" ' ...................................................... :F
r% Figure 11. Fuzzy ARTMAP architecture. The ART, complement coding preprocessor transforms the Ma-vector a into the 2M,-vector A = (a,ac) at the ART, field F f . A is the input vector to the ART, field Fp. Similarly, the input to FI is the 2Mb-vector (b,bC). When a prediction by ART, is disconfirmed at ARTb, inhibition of map field activation induces the match tracking process. Match tracking raises the ART, vigilance pa to just above the Ff to F," match ratio Ix"l/lAl. This triggers an ART, search which leads to activation of either an ART, category that correctly predicts b or to a previously uncommitted ART, category node. (Reprinted with permission from Carpenter, Grossberg, Markuzon, Reynolds, and Rosen, 1992.)
match tracking operationalizes the idea that the system must have accepted hypotheses with too little confidence to satisfy the demands of a particular environment. Match tracking increases the criterion confidencejust enough to trigger hypothesis testing. Hypothesis testing leads to the selection of a new ART, category, which focuses attention on a new cluster of a(p) input features that is better able to predict b(p). Due to the combination of match tracking and fast learning, a single ARTMAP system can learn a different prediction for a rare event than for a cloud of similar frequent events in which it is embedded. A generalization of Fuzzy ARTMAP, called Fusion ARTMAP, has also recently been introduced to handle multidimensional data fusion, classification, and prediction problems (Asfour, Carpenter, Grossberg, and Lesher, 1993). In Fusion ARTMAP, multiple data channels process different sorts of input vectors in their own ART modules before all
181
n MATCH TRACKING
(a)
PRED ICTlO N
L
t Figure 12. Match tracking: (a) A prediction is made by ART. when the baseline vigilance pa is less than the analog match value. (b) A predictive error at ARTb increases the baseline vigilance value of ART, until it just exceeds the analog match value, and thereby triggers hypothesis testing that searches for a more predictive bundle of features to which to attend. (Reprinted with permission from Carpenter and Grossberg, 1992.) the ART modules cooperate to form a global classification and prediction. A predictive error simultaneously raises the vigilance parameters of all the component ART modules. The module with the poorest match of input to prototype is driven first to reset and search. As a result, the channels whose data are classified with the least confidence are searched before more confident classifications are reset. Channels which provide good data matches may thus not need to create new categories just because other channels exhibit poor matches. Using this parallel match tracking scheme, the network selectively improves learning where it is poor, while sparing the learning that is good. Such an automatic credit assignment has been shown in benchmark studies to generate more persimonious classifications of multidimensional data than are learned by a one-channel Fuzzy ARTMAP. Two benchmark studies using Fuzzy ARTMAP are summarized below to show that even a one-channel network has powerful classification capabilities. 28. TWO BENCHMARK STUDIES: LETTER AND WRITTEN DIGIT RECOGNITION As summarized in Table 1 , Fuzzy ARTMAP has been benchmarked against a variety of machine learning, neural network, and genetic algorithms with considerable success.
182
ARTMAP BENCHMARK STUDIES
1. Medical database - mortality following coronary bypass grafting (CABG) surgery FUZZY ARTMAP significantly outperforms LOGISTIC REGRESSION ADDITIVE MODEL BAYESIAN ASSIGNMENT CLUSTER ANALYSIS CLASSIFICATION AND REGRESSION TREES EXPERT PANEL-DERIVED SICKNESS SCORES PRINCIPAL COMPONENT ANALYSIS 2. Mushroom database DECISION TREES (90-95% correct) ARTMAP (100% correct) Training set an order of magnitude smaller 3. Letter recognition database GENETIC ALGORITHM (82% correct) FUZZY ARTMAP (96% correct) 4. Circle-in-the-Square task BACK PROPAGATION (90% correct) FUZZY ARTMAP (99.5% correct) 5 . Two-Spiral task
BACK PROPAGATION (10,000-20,000 training epochs) FUZZY ARTMAP (1-5 training epochs)
Table 1 An illustrative study used a benchmark machine learning task that Frey and Slate (1991) developed and described as a “difficult categorization problem” (p. 161). The task requires a system to identify an input exemplar as one of 26 capital letters A-Z. The database was derived from 20,000 unique black-and-white pixel images. The difficulty of the task is due to the wide variety of letter types represented: the twenty “fonts represent five different stroke styles (simplex, duplex, complex, and Gothic) and six different letter styles (block, script, italic, English, Italian, and German)” (p. 162). In addition each image was randomly distorted, leaving many of the characters misshapen. Sixteen numerical feature attributes were then obtained from each character image, and each attribute value was scaled to a range of 0 to 15. The resulting Letter Image Recognition file is archived in the UCI Repository of Machine Learning Databases and Domain Theories, maintained by David Aha and Patrick Murphy (
[email protected]). Frey and Slate used this database to test performance of a family of classifiers based on Holland’s genetic algorithms (Holland, 1980). The training set consisted of 16,000 exemplars, with the remaining 4,000 exemplars used for testing. Genetic algorithm classifiers having different input representations, weight update and rule creation schemes, and system parameters were systematically compared. Training was carried out for 5 epochs, plus a sixth “verification” pass during which no new rules were created but a large number
183
of unsatisfactory rules were discarded. In Frey and Slate’s comparative study, these systems had correct prediction rates that ranged from 24.5% to 80.8% on the 4,000-item test set. The best performance (80.8%) was obtained using an integer input representation, a reward sharing weight update, an exemplar method of rule creation, and a parameter setting that allowed an unused or erroneous rule to stay in the system for a long time before being discarded. After training, the optimal case, that had 80.8% performance rate, ended with 1,302 rules and 8 attributes per rule, plus over 35,000 more rules that were discarded during verification. (For purposes of comparison, a rule is somewhat analogous to an ART, category in ARTMAP, and the number of attributes per rule is analogous to the size of ART, category weight vectors.) Building on the results of their comparative study, Frey and Slate investigated two types of alternative algorithms, namely an accuracy-utility bidding system, that had slightly improved performance (81.6%) in the best case; and an exemplar/hybrid rule creation scheme that further improved performance, to a maximum of 82.7%, but that required the creation of over 100,000 rules prior to the verification step. Fuzzy ARTMAP had an error rate on the letter recognition task that was consistently less than one third that of the three best Frey-Slate genetic algorithm classifiers described above. In particular, after 1 to 5 epochs, individual Fuzzy ARTMAP systems had a robust prediction rate of 90% to 94% on the 4,000-item test set. A voting strategy consistently improved this performance. This voting strategy is based on the observation that ARTMAP fast learning typically leads to different adaptive weights and recognition categories for different orderings of a given training set, even when overall predictive accuracy of all simulations is similar. The different category structures cause the set of test items where errors occur to vary from one simulation to the next. The voting strategy uses an ARTMAP system that is trained several times on input sets with different orderings. The final prediction for a given test set item is the one made by the largest number of simulations. Since the set of items making erroneous predictions varies from one simulation to the next, voting cancels many of the errors. Such a voting strategy can also be used to assign confidence estimates to competing predictions given small, noisy, or incomplete training sets. Voting consistently eliminated 25%-43% of the errors, giving a robust prediction rate of 92%-96%. Moreover Fuzzy ARTMAP simulations each created fewer than 1,070 ART, categories, compared to the 1,040-1,302 final rules of the three genetic classifiers with the best performance rates. Most Fuzzy ARTMAP learning occurred on the first epoch, with test set performance on systems trained for one epoch typically over 97% that of systems exposed to inputs for five epochs. Rapid learning was also found in a benchmark study of written digit recognition, where the correct prediction rate on the test set after one epoch reached over 99% of its best performance (Carpenter, Grossberg, and Iizuka, 1992). In this study, Fuzzy ARTMAP was tested along with back propagation and a self-organizing feature map. Voting yielded Fuzzy ARTMAP average performance rates on the test set of 97.4% after an average number of 4.6 training epochs. Back propagation achieved its best average performance rates of 96% after 100 training epochs. Self-organizing feature maps achieved a best level of 96.5%, again after many training epochs. In summary, on a variety of benchmarks (see also Table 1, Carpenter, Grossberg, and Reynolds, 1991, and Carpenter et al., 1992), Fuzzy ARTMAP has demonstrated either much faster learning, better performance, or both, than alternative machine learning,
184
ARTMAP ARTMAP can autonomously learn about (A) RARE EVENTS Need FAST learning (B) LARGE NONSTATIONARY DATABASES Need STABLE learning (C) MORPHOLOGICALLY VARIABLE EVENTS Need MULTIPLE SCALES of generalization (fine/coarse)
(D) ONETO-MANY AND MANY-TO-ONE RELATIONSHIPS Need categorization, naming, and expert knowledge To realize these properties ARTMAP systems: (E) PAY ATTENTION Ignore masses of irrelevant data (F) TEST HYPOTHESES Discover predictive constraints hidden in data streams (G) CHOOSE BEST ANSWERS Quickly select globally optimal solution at any stage of learning (H) CALIBRATE CONFIDENCE Measure on-line how well a hypothesis matches the data (I) DISCOVER RULES Identify transparent IF-THEN relations at each learning stage
(J) SCALE Preserve all desirable properties in arbitrarily large problems Table 2 genetic, or neural network algorithms. Perhaps more importantly, Fuzzy ARTMAP can be used in an important class of applications where many other adaptive pattern recognition algorithms cannot perform well (see Table 2). These are the applications where very large nonstationary databases need to be rapidly organized into stable variable-compression categories under real-time autonomous learning conditions. 29. SUMMARY OF THE FUZZY ART ALGORITHM ART field activity vectors: Each ART system includes a field Fb of nodes that represent a current input vector; a field Fl that receives both bottom-up input from FO and top-down input from a field F2 that represents the active code, or category. The FO activity vector is denoted I = (11,. ..,IM), with each component Z; in the interval [0,1], i = 1,. . .,M . The Fl activity vector is denoted x = ( q .,. .,ZM)and the F2 activity vector is denoted y = (yl,...,yN). The number of nodes in each field is arbitrary. Weight vector: Associated with each F2 category node j ( j = 1 , . . .,N ) is a vector
185
w, G ( w , ~ ., .., W
~ M of )
adaptive weights, or LTM traces. Initially Wjl(0)=
. . . = W j M ( 0 ) = 1;
(96)
then each category is said to be uncommitted. After a category is selected for coding it becomes committed. As shown below, each LTM trace 'wj, is monotone nonincreasing through time and hence converges to a limit. The Fuzzy ART weight vector w, subsumes both the bottom-up and top-down weight vectors of ART 1. Parameters: Fuzzy ART dynamics are determined by a choice parameter cr > 0; a learning rate parameter E [0,1]; and a vigilance parameter p E [0,1]. Category choice: For each input I and F2 node j , the choice function T, is defined
IIAwjl Tj(1)= CY+ Iwjl' where the fuzzy AND operator
A
(97)
is defined by
and where the norm 1 . I is defined by
for any M-dimensional vectors p and q. For notational simplicity, Tj(I)in (97) is often written as Tj when the input I is fixed. The system is said to make a category choice when at most one F2 node can become active at a given time. The category choice is indexed by J , where
TJ = max{Tj :j = 1 . . . N } .
(100)
If more than one T, is maximal, the category j with the smallest index is chosen. In particular, nodes become committed in order j = 1,2,3,. .. . When the Jth category is chosen, y J = 1; and yj = 0 for j # J . In a choice system, the F1 activity vector x obeys the equation
I = {I A w J
if F2 is inactive if the J t h F2 node is chosen.
Resonance or reset: Resonance occurs if the match function 11 A wjI/III of the chosen category meets the vigilance criterion:
that is, by (6), when the J i h category is chosen, resonance occurs if
186
Learning then ensues, as defined below. Mismatch reset occurs if
that is, if 1x1 = 11 A WJI < PlII. (105) Then the value of the choice function TJ is set to 0 for the duration of the input presentation to prevent the persistent selection of the same category during search. A new index J is then chosen, by (100). The search process continues until the chosen J satisfies (102). Learning: Once search ends, the weight vector W J is updated according to the equation W(new) J = p(I A W y ) ) + (1 - p)WYld). (106)
Fast learning corresponds to setting /3 = 1. The learning law used in the EACH system of Salzberg (1990) is equivalent to equation (106) in the fast-learn limit with the complement coding option described below. Fast-commit slow-recode option: For efficient coding of noisy input sets, it is useful to set p = 1 when J is an uncommitted node, and then to take p < 1 after the = I the first time category J becomes active. Moore category is committed. Then (1989) introduced the learning law (106), with fast commitment and slow recoding, to investigate a variety of generalized ART 1 models. Some of these models are similar to Fuzzy ART, but none includes the complement coding option. Moore described a category proliferation problem that can occur in some analog ART systems when a large number of inputs erode the norm of weight vectors. Complement coding solves this problem. Input normalization/complement coding option: Proliferation of categories is avoided in Fuzzy ART if inputs are normalized. Complement coding is a normalization rule that preserves amplitude information. Complement coding represents both the onresponse and the off-response to an input vector a (Figure 8). To define this operation in its simplest form, let a itself represent the on-response. The complement of a, denoted by ac, represents the off-response, where
WY)
a: I 1 - a ; .
The complement coded input I to the field
F1
(107)
is the 2M-dimensional vector
I = (a,ac)= (al,. . . ,aM,a?, . .. , a b ) .
(108)
Note that
=M,
so inputs preprocessed into complement coding form are automatically normalized. Where complement coding is used, the initial condition (96) is replaced by Wjl(0)
= . . . = Wj,1M(O) = 1.
(110)
187
30. FUZZY ARTMAP ALGORITHM The Fuzzy ARTMAP system incorporates two Fuzzy ART modules ART, and ART, that are linked together via an inter-ART module Fabcalled a map field. The map field is used to form predictive associations between categories and to realize the match tracking rule whereby the vigilance parameter of ART, increases in response to a predictive mismatch at ARTb. The interactions mediated by the map field Fab may be operationally characterized as follows. ART, and ART& Inputs to ART, and ARTb are in the complement code form: for ART,, I = A = (a,ac);for ARTb, I = B = (b,bc) (Figure 10). Variables in ART, or ARTb are designated by subscripts or superscripts 'a" or "b". For ART,, let xa E (x! . . denote the Ff output vector; let y" = (yp .. .Y $ ~ ) denote the F; output vector; and let w; i (w;~,w ; ~.,. .,w ~ , ~denote ~ , ) the j t h ART, weight vector. For ARTb, let xb i (x! . . . x ; ~ * ) denote the Fi' output vector; let yb = (y! . ..yk,) denote the F,b output vector; and let w$ = (wf,,wf2, . . ., denote the kth ARTb weight vector. For the map field, let xab= ( x y b , .. . ,x$ denote the Faboutput vector, and let wYb = (w;,", .. .,w;$~) denote ybrand xabare set to the weight vector from the jihF; node to Fab. Vectors xa,ya,xb, 0 between input presentations. Map field activation The map field Fabis activated whenever one of the ART, or ARTb categories is active. If node J of F," is chosen, then its weights wybactivate Fnb. If node Ii' in F,b is active, then the node K in Fab is activated by I-to-1 pathways between F,b and Fab. If both ART, and ARTb are active, then Fab becomes active only if ART, predicts the same category as ARTb via the weights wpb.The Fab output vector xabobeys
xab=lw;b
ybA wyb if the Jth F," node is active and F,b is active
0
Yb
if the Jth F; node is active and F,b is inactive if F," is inactive and F i is active if F," is inactive and F,b is inactive.
(111)
By ( l l l ) , xab= 0 if the prediction w5b is disconfirmed by yb. Such a mismatch event triggers an ART, search for a better category, as follows. Match tracking At the start of each input presentation the ART, vigilance parameter pa equals a baseline vigilance 6. The map field vigilance parameter is Pab. If
babl< pablYbl,
(112)
then pa is increased until it is slightly larger than [AA wyllAl-l, where A is the input to Ff,in complement coding form. Then Ix"I = IA A wDJl< pal-41,
(113)
where J is the index of the active F," node, as in (105). When this occurs, ART, search leads either to activation of another F; node J with
I~~l=lAAw;l IpalAl
(114)
188
and or, if no such node exists, to the shut-down of F; for the remainder of the input present ation. Map field learning Learning rules determine how the map field weights w$ change through time, as Fob paths initially satisfy follows. Weights w$ in F;
-
W$(O) = 1.
During resonance with the ART, category J active, w5b approaches the map field vector xab.With fast learning, once J learns to predict the ART, category K , that association is permanent; i.e., wppK = 1 for all time.
189
REFERENCES
Adams, J.A. (1967). Human memory. New York: McGraw-Hill. Amari, S.-I. and Arbib, M. (Eds.) (1982). Competition and cooperation in neural networks. New York, NY: Springer-Verlag. Amari, S.-I. and Takeuchi, A. (1978). Mathematical theory on formation of category detecting nerve cells. Biological Cybernetics, 29, 127-136. Asch, S.E. and Ebenholtz, S.M. (1962). The principle of associative symmetry. Proceedings of the American Philosophical Society, 106,135-163. Asfour, Y.R., Carpenter, G.A., Grossberg, S., and Lesher, G. (1993). Fusion ARTMAP: A neural network architecture for multi-channel data fusion and classification. Technical Report CAS/CNS TR93-004, Boston, MA: Boston University. Submitted for publication. Bienenstock, E.L., Cooper, L.N., and Munro, P.W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 3248. Bradski, G., Carpenter, G.A., and Grossberg, S. (1992). Working memory networks for learning multiple groupings of temporal order with application to 3-D visual object recognition. Neural Computation, 4, 270-286. Carpenter, G.A. and Grossberg, S. (1987a). A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37,54-115. Carpenter, G.A. and Grossberg, S. (1987b). ART 2: Stable self-organization of pattern recognition codes for analog input patterns. Applied Optics, 26, 49194930. Carpenter, G.A. and Grossberg, S. (1987~). Neural dynamics of category learning and recognition: Attention, memory consolidation, and amnesia. In S. Grossberg (Ed.), The adaptive brain, I: Cognition, learning, reinforcement, and rhythm. Amsterdam: Elsevier/North Holland, pp. 238-286. Carpenter, G.A. and Grossberg, S. (Eds.) (1991). Pattern recognition by selforganizing neural networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1992). Fuzzy ARTMAP: Supervised learning, recognition, and prediction by a self-organizing neural network. IEEE Communications Magazine, 30,38-49. Carpenter, G.A. and Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Technical Report CASfCNS TR-92-021. Boston, MA: Boston University. Trends in Neurosciences, in press. Carpenter, G.A., Grossberg, S., Markuzon, M., Reynolds, J.H., and Rosen, D.B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Network, 3,698-713. Carpenter, G.A., Grossberg, S., and Reynolds, J.H. (1991). ARTMAP: Supervised realtime learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4, 565-588.
190
Carpenter, G.A., Grossberg, S., and Rosen, D.B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759-771. Carpenter, G.A., Grossberg, S., and Iizuka, K. (1992). Comparative performance measures of Fuzzy ARTMAP, learned vector quantization, and back propagation for handwritten character recognition. Proceedings of the international joint conference on neural networks, I, 794-799. Piscataway, NJ: IEEE Service Center. Cohen, M.A. (1988). Sustained oscillations in a symmetric cooperative-competitive neural network: Disproof of a conjecture about a content addressable memory. Neural Networks, 1, 217-221. Cohen, M.A. (1990). The stability of sustained oscillations in symmetric cooperativecompetitive networks. Neural Networks, 3,609-612. Cohen, M.A. (1992). The construction of arbitrary stable dynamics in nonlinear neural networks. Neural Networks, 5 , 83-103. Cohen, M.A. and Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 815-826. Cohen, M.A. and Grossberg, S. (1986). Neural dynamics of speech and language coding: Developmental programs, perceptual grouping, and competition for short term memory. Human Neurobiology, 5 , 1-22. Cohen, M.A., Grossberg, S., and Pribe, C. (1993). A neural pattern generator that exhibits frequency-dependent bi-manual coordination effects and quadruped gait transitions. Technical Report CAS/CNS TR-93-004. Boston, MA: Boston University. Submitted for publication. Cole, K.S. (1968). Membranes, ions, and impulses. Berkeley, CA: University of California Press. Collins, A.M. and Loftus, E.F. (1975). A spreading-activation theory of semantic memory. Psychological Review, 82, 407-428. Commons, M.L., Grossberg, S., and Staddon, J.E.R. (Eds.) (1991). Neural network models of conditioning and action. Hillsdale, NJ: Lawrence Erlbaum Associates. Cornsweet, T.N. (1970). Visual perception. New York, NY: Academic Press. Crick, F. and Koch, C. (1990). Some reflections on visual awareness. Cold Spring Harbor symposium on quantitative biology, LV, The brain, Plainview, NY: Cold Spring Harbor Laboratory Press, 953-962. Desimone, R. (1992). Neural circuits for visual attention in the primate brain. In G.A. Carpenter and S. Grossberg (Eds.), Neural networks for vision and image processing. Cambridge, MA: MIT Press, pp. 343-364. Dixon, T.R. and Horton, D.L. (1968). Verbal behavior and general behavior theory. Englewood Cliffs, NJ: Prentice-Hall. Eckhorn, R. Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitbock, H.J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biological Cybernetics, 1988, 60, 121-130. Eckhorn, R. and Schanze, T. (1991). Possible neural mechanisms of feature linking in the visual system: Stimulus-locked and stimulus-induced synchronizations. In A.
191
Babloyantz (Ed.), Self-organization, emerging properties, a n d learning. New York, NY: Plenum Press, pp. 63-80. Ellias, S. and Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short term memory of shunting on-center off-surround networks. Biological Cybernetics, 20, 69-98. Frey, P.W. and Slate, D.J. (1991). Letter recognition using Holland-style adaptive classifiers. Machine Learning, 6, 161-182. Gaudiano, P. (1992a). A unified neural model of spatio-temporal processing in X and Y retinal ganglion cells. Biological Cybernetics, 67, 11-21. Gaudiano, P. (1992b). Toward a unified theory of spatio-temporal processing in the retina. In G. Carpenter and S. Grossberg, (Eds.). Neural networks for vision and image processing. Cambridge, MA: MIT Press, pp. 195-220. Gaudiano, P. and Grossberg, S. (1991). Vector associative maps: Unsupervised realtime error-based learning and control of movement trajectories. Neural Networks, 4, 147- 183. Geman, S. (1981). The law of large numbers in neural modelling. In S. Grossberg (Ed.), Mathematical psychology a n d psychophysiology. Providence, RI: American Mathematical Society, pp. 91-106. Gottlieb, G. (Ed.) (1976). Neural a n d behavioral specificity (Vol. 3). New York, NY: Academic Press. Gray, C.M., Konig, P., Engel, A.K., and Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338,334-337. Gray, C.M. and Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences, 86, 1698-1702. Grossberg, S. (1961). Senior Fellowship thesis, Dartmouth College. Grossberg, S. (1964). T h e t h e o r y of embedding fields w i t h applications t o psychology a n d neurophysiology. New York: Rockefeller Institute for Medical Research. Grossberg, S. (1967). Nonlinear difference-differential equations in prediction and learning theory. Proceedings of the National Academy of Sciences, 58, 1329-1334. Grossberg, S. (1968a). Some physiological and biochemical consequences of psychological postulates. Proceedings of the National Academy of Sciences, 60, 758-765. Grossberg, S. (1968b). Some nonlinear networks capable of learning a spatial pattern of arbitrary complexity. Proceedings of the National Academy of Sciences, 59, 368-372. Grossberg, S. (1969a). Embedding fields: A theory of learning with physiological implications. Journal of Mathematical Psychology, 6, 209-239. Grossberg, S. (1969b). On learning, information, lateral inhibition, and transmitters. Mathematical Biosciences, 4, 255-310. Grossberg, S., (1969~). On the serial learning of lists. Mathematical Biosciences, 4, 201-253. Grossberg, S. (1969d). On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. Journal of Statistical Physics, 1, 319-350.
192
Grossberg, S. (1969e). Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, I. Journal of Mathematics and Mechanics, 19, 53-91. Grossberg, S. (1969f). On the production and release of chemical transmitters and related topics in cellular control. Journal of Theoretical Biology, 22, 325-364. Grossberg, S. (1969g) On variational systems of some nonlinear difference-differential equations. Journal of Differential Equations, 6,544-577. Grossberg, S. (1970a). Neural pattern discrimination. Journal of Theoretical Biology, 27, 291-337. Grossberg, S. (1970b). Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, 11. Studies in Applied Mathematics, 49, 135-166. Grossberg, S. (1971a). Pavlovian pattern learning by nonlinear neural networks. Proceedings of the National Academy of Sciences, 68, 828-831. Grossberg, S. (1971b). On the dynamics of operant conditioning. Journal of Theoretical Biology, 33,225-255. Grossberg, S. (1972a). Pattern learning by functional-differential neural networks with arbitrary path weights. In K. Schmitt (Ed.), Delay a n d functional-differential equations a n d their applications. New York: Academic Press. Reprinted in S. Grossberg (1982), Studies of mind a n d brain, pp. 157-193, Boston, MA: Reidel Press. Grossberg, S. (1972b). Neural expectation: Cerebellar and retinal analogs of cells fired by learnable or unlearned pattern classes. Kybernetik, 10, 49-57. Grossberg, S. (1972~). A neural theory of punishment and avoidance, I: Qualitative theory. Mathematical Biosciences, 15, 39-67. Grossberg, S. (1972d). A neural theory of punishment and avoidance, 11: Quantitative theory. Mathematical Biosciences, 15, 253-285. Grossberg, S. (1973). Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52,217-257. Reprinted in S. Grossberg (1982), Studies of mind and brain, pp. 332-378, Boston, MA: Reidel Press. Grossberg, S. (1974). Classical and instrumental learning by neural networks. In R. Rosen and F. Snell (Eds.), Progress in theoretical biology. New York: Academic Press. Reprinted in S. Grossberg (1982), Studies of mind a n d brain, pp. 65-156, Boston, MA: Reidel Press. Grossberg, S. (1975). A neural model of attention, reinforcement, and discrimination learning. International Review of Neurobiology, 1975, 18, 263-327. Reprinted in S. Grossberg (1982), Studies of mind a n d brain, pp. 229-295, Boston, MA: Reidel Press. Grossberg, S. (1976a). On the development of feature detectors in the visual cortex with applications to learning and reaction-diffusion systems. Biological Cybernetics, 21, 145-159. Grossberg, S. (1976b). Adaptive pattern classification and universal recoding, I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121-
193
134. Grossberg, S. (1976~).Adaptive pattern classification and universal recoding, 11: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23, 187-202. Grossberg, S. (1976d). On the Development of feature detectors in the visual cortex with applications to learning and reaction-diffusion systems. Biological Cybernetics, 21, 145-159. Grossberg, S. (1978a). A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. In R. Rosen and F. Snell (Eds.), Progress in theoretical biology, Vol. 5. New York: Academic Press. Reprinted in S. Grossberg (1982), Studies of mind and brain, pp. 498-639, Boston, MA: Reidel Press. Grossberg, S. (1978b). Behavioral contrast in short term memory: Serial binary memory models or parallel continuous memory models? Journal of Mathematical Psychology, 3, 199-219. Grossberg, S. (1978~). Decisions, patterns, and oscillations in nonlinear competitive systems with applications to Volterra-Lotka systems. Journal of Theoretical Biology, 73, 101-130. Grossberg, S. (1978d). Competition, decision, and consensus. Journal of Mathematical Analysis and Applications, 66,470-493. Grossberg, S. (1980a). How does a brain build a cognitive code? Psychological Review, 1, 1-51. Grossberg, S. (1980b). Intracellular mechanisms of adaptation and self-regulation in self-organizing networks: The role of chemical transducers. Bulletin of Mathematical Biology, 42, 365-396. Grossberg, S. (1980~). Biological competition: Decision rules, pattern formation, and oscillations. Proceedings of the National Academy of Sciences, 77, 2338-2342. Grossberg, S. (Ed.) (1981). Adaptive resonance in development, perception, and cognition. In S. Grossberg (Ed.), Mathematical psychology and psychophysiology. Providence, RI: American Mathematical Society. Grossberg, S. (1982a). Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control. Boston, MA: Reidel Press. Grossberg, S. (1982b). Associative and competitive principles of learning and development: The temporal unfolding and stability of STM and LTM patterns. In 5-1. Amari and M. Arbib (Eds.), Competition and cooperation in neural networks. New York: Springer-Verlag. Grossberg, S. (1982~).A psychophysiological theory of reinforcement, drive, motivation, and attention. Journal of Theoretical Neurobiology, 1, 286-369. Grossberg, S. (1983). The quantized geometry of visual space: The coherent computation of depth, form, and lightness. Behavioral and Brain Sciences, 6,625-657. Grossberg, S. (1984). Some psychophysiological and pharmacological correlates of a developmental, cognitive, and motivational theory. In J. Cohen, R. Karrer, and P. Tueting (Eds.), Brain and information: Event related potentials, 425, 58-151, Annals of the New York Academy of Sciences. Reprinted in S. Grossberg (Ed.), The adaptive brain, Volume I, 1987, Amsterdam: Elsevier/North-Holland.
194
Grossberg, S. (1986). The adaptive self-organization of serial order in behavior: Speech, language, and motor control. In E.C. Schwab and H.C. Nusbaum (Eds.), Pattern recognition by humans and machines, Volume 1: Speech perception, pp. 187-294, New York, NY: Academic Press. Reprinted in S. Grossberg (Ed.), The adaptive brain, Volume 11, 1987, Amsterdam: Elsevier/North-Holland. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17-61. Grossberg, S. and Kuperstein, M. (1986). Neural dynamics of adaptive sensorymotor control. Amsterdam: Elsevier/North-Holland; expanded edition, 1989, Elmsford, NY: Pergamon Press. Grossberg, S. and Merrill, J.W.L. (1992). A neural network model of adaptively timed reinforcement learning and hippocampal dynamics. Cognitive Brain Research, 1, 3-38. Grossberg, S. and Mingolla, E. (1985a). Neural dynamics of form perception: Boundary completion, illusory figures, and neon color spreading. Psychological Review, 92, 173211. Grossberg, S. and Mingolla, E. (1985b). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Perception and Psychophysics, 1985, 38, 141-171. Grossberg, S. and Pepe, J. (1970). Schizophrenia: Possible dependence of associational span, bowing, and primacy versus recency on spiking threshold. Behavioral Science, 15, 359-362. Grossberg, S. and Pepe, J. (1971). Spiking threshold and overarousal effects in serial learning. Journal of Statistical Physics, 3, 95-125. Grossberg, S. and Somers, D. (1991). Synchronized oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks, 4, 453-466. Grossberg, S. and Somers, D. (1992). Synchronized oscillations for binding spatially distributed feature codes into coherent spatial patterns. In G.A. Carpenter and S. Grossberg, (Eds.), Neural networks for vision and image processing. Cambridge, MA: MIT Press, 385406. Grossberg, S. and Stone, G.O. (1986). Neural dynamics of word recognition and recall: Attentional priming, learning, and resonance. Psychological Review, 93, 46-74. Grossberg, S. and TodoroviC, D. (1988). Neural dynamics of 1-D and 2-D brightness perception: A unified model of classical and recent phenomena. Perception and Psychophysics, 43, 241-277. Hebb, D.O. (1949). The organization of behavior. New York, NY: Wiley Press. Hecht-Nielsen, R. (1987). Counterpropagation networks. Applied Optics, 26,4979-4984. Hirsch, M.W. (1982). Systems of differential equations which are competitive or cooperative, I: Limit sets. SIAM Journal of Mathematical Analysis, 13, 167-179. Hirsch, M.W. (1985). Systems of differential equations which are competitive or cooperative, 11: Convergence almost everywhere. SIAM Journal of Mathematical Analysis, 16, 423-439. Hirsch, M.W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2 , 331-350.
195
Hodgkin, A.L. (1964). The conduction of the nervous system. Liverpool, UK: Liverpool University. Holland, J.H. (1980). Adaptive algorithms for discovering and using general patterns in growing knowledge bases. International Journal of Policy Analysis and Information Systems, 4, 217-240. Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 25542558. Hopfield, J.J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3058-3092. Hubel, D.H. and Wiesel, T.N. (1977). Functional architecture of macaque monkey visual cortex. Proceedings of the Royal Society of London (B), 198, 1-59. Hunt, R.K. and Jacobson, M. (1974). Specification of positional information in retinal ganglion cells of Xenopus laevis: Intraocular control of the time of specification. Proceedings of the National Academy of Sciences, 71, 3616-3620. Iverson, G.J. and Pavel, M. (1981). Invariant properties of masking phenomena in psychoacoustics and their theoretical consequences. In S. Grossberg (Ed.), Mathematical psychology and psychophysiology. Providence, RI: American Mathematical Society, pp. 17-24. Jung, J. (1968). Verbal learning. New York: Holt, Rinehart, and Winston. Kandel, E.R. and O’Dell, T.J. (1992). Are adult learning mechanisms also used for development? Science, 258, 243-245. Kandel, E.R. and Schwartz, J.H. (1981). Principles of neural science. New York, NY: Elsevier/North-Holland. Katz, B. (1966). Nerve, muscle, and synapse. New York, NY: McGraw-Hill. Khinchin, A.I. (1967). Mathematical foundations of information theory. New York, NY: Dover Press. Klatsky, R.L. (1980), Human memory: Structures and processes. San Francisco, CA: W.H. Freeman. Kohonen, T. (1984). Self-organization and associative memory, New York, NY: Springer-Verlag. Kosko, B. (1986). Fuzzy entropy and conditioning. Information Sciences, 40, 165-174. Levine, D. and Grossberg, S. (1976). On visual illusions in neural networks: Line neutralization, tilt aftereffect, and angle expansion. Journal of Theoretical Biology, 61, 477-504. Levy, W.B. (1985). Associative changes at the synapse: LTP in the hippocampus. In W.B. Levy, J. Anderson and S. Lehmkuhle, (Eds.), Synaptic modification, neuron selectivity, and nervous system organization. Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 5-33. Levy, W.B., Brassel, S.E., and Moore, S.D. (1983). Partial quantification of the associative synaptic learning rule of the dentate gyrus. Neuroscience, 8, 799-808. Levy, W.B. and Desmond, N.L. (1985). The rules of elemental synaptic plasticity. In W.B. Levy, J . Anderson and S. Lehmkuhle, (Eds.), Synaptic modification, neuron
196
selectivity, and nervous system organization. Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 105-121. Linsker, R. (1986). From basic network principles to neural architecture. Proceedings of the National Academy of Science, 83,7508-7512, 8390-8394, 8779-8783. Maher, B.A. (1977). Contributions to the psychopathology of schizophrenia. New York, NY: Academic Press. Malsburg, C. von der (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85-100. Malsburg, C. von der and Willshaw, D.J. (1981). Differential equations for the development of topological nerve fibre projections. In S. Grossberg (Ed.), Mathematical psychology and psychophysiology. Providence, RI: American Mathematical Society, pp. 39-48. May, R.M. and Leonard, W.J. (1975). Nonlinear aspects of competition between three species. SlAM Journal on Applied Mathematics, 29,243-253. McCulloch, W.S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of the Mathematical Biophysics, 5, 115-133. McGeogh, J.A. and Irion, A.L. (1952). The psychology of human learning, Second edition. New York: Longmans and Green. Miller, G.A. (1956). The magic number seven plus or minus two. Psychological Review, 63, 81. Moore, B. (1989). ART 1 and pattern clustering. In D. Touretzky, G. Hinton, and T. Sejnowski (Eds.), Proceedings of the 1988 connectionist models summer school. San Mateo, CA: Morgan Kaufmann, pp. 174-185. Murdock, B.B. (1974). Human memory: Theory and data. Potomac, MD: Erlbaum Press. Nabet, B. and Pinter, R.B. (1991). Sensory neural networks: Lateral inhibition. Boca Raton, FL: CRC Press. Norman, D.A. (1969). Memory and attention: An introduction to human information processing. New York, NY: Wiley and Sons. Osgood, C.E. (1953). Method and theory in experimental psychology. New York, NY: Oxford Press. Plonsey, R. and Fleming, D.G. (1969). Bioelectric phenomena. New York, NY: McGraw-Hill. Rauschecker, J.P. and Singer, W. (1979). Changes in the circuitry of the kitten’s visual cortex are gated by postsynaptic activity. Nature, 280, 58-60. Repp, B.H. (1991). Perceptual restoration of a “missing” speech sound: Auditory induction or illusion? Haskins Laboratories Status Report on Speech Research, SR-107/108, 147-170. Rumelhart, D.E. and Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9,75-112. Rundus, D. (1971). Analysis of rehearsal processes in free recall. Journal of Experimental Psychology, 89,63-77.
197
Salzberg, S.L. (1990). Learning w i t h nested generalized exemplars. Boston, MA: Kluwer Academic Publishers. Samuel, A.G. (1981a). Phonemic restoration: Insights from a new methodology. Journal of Experimental Psychology: General, 110,474-494. Samuel, A.G. (1981b). The rule of bottom-up confirmation in the phonemic restoration illusion. Journal of Experimental Psychology: Human Perception and Performance, 7, 1124-1 131. Schvaneveldt, R.W. and MacDonald, J.E. (1981). Semantic context and the encoding of words: Evidence for two modes of stimulus analysis. Journal of Experimental Psychology: Human Perception and Performance, 7, 673-687. Singer, W., Neuronal activity as a shaping factor in the self-organization of neuron assemblies. In E. Basar, H. Flohr, H. Haken, and A.J. Mandell (Eds.) (1983). Synergetics of t h e brain. New York, NY: Springer-Verlag, pp. 89-101. Smith, E.E. (1990). In D.O. Osherson and E.E. Smith (Eds.), A n invitation to cognitive science. Cambridge, MA: MIT Press. Somers, D. and Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biological Cybernetics, in press. Underwood, B.J. (1966). Experimental psychology, Second edition. New York: Appleton-Century-Crofts. Warren, R.M. (1984). Perceptual restoration of obliterated sounds. Psychological Bulletin, 96,371-383. Warren, R.M. and Sherman, G.L. (1974). Phonemic restorations based on subsequent context. Perception and Psychophysics, 16,150-156. Werblin, F.S. (1971). Adaptation in a vertebrate retina: Intracellular recordings in Necturus. Journal of Neurophysiology, 34, 228-241. Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. Thesis, Cambridge, MA: Harvard University. Willshaw, D.J. and Malsburg, C. von der (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London (B), 194, 431445. Young, R.K. (1968). Serial learning. In T.R. Dixon and D.L. Horton (Eds.), Verbal behavior a n d general behavior theory. Englewood Cliffs, NJ: Prentice-Hall. Zadeh, L. (1965). Fuzzy sets. Information Control, 8, 338-353.
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) Q
1993 Elsevier Science Publishers B.V. All rights reserved.
199
On-line learning processes in artificial neural networks Tom M. Heskes and Bert Kappen Department of Medical Physics and Biophysics, University of Nijmegen, Geert Grooteplein 21, 6525 EZ Nijmegen, T h e Netherlands.
Abstract We study on-line learning processes in artificial neural networks from a general point of view. On-line learning means that a learning step takes place a t each presentation of a randomly drawn training pattern. I t can be viewed as a stochastic process governed by a continuous-time master equation. On-line learning is necessary if not all training patterns are available all the time. This occurs in many applications when the training patterns are drawn from a time-dependent environmental distribution. Studying learning in a changing environment, we encounter a conflict between the adaptability and the confidence of the network’s representation. Minimization of a criterion incorporating both effects yields an algorithm for on-line adaptation of the learning parameter. The inherent noise of on-line learning makes i t possible t o escape from undesired local minima of t h e error potential on which t h e learning rule performs (stochastic) gradient descent. We try t o quantify these often made claims by considering the transition times between various minima. We apply our results on the transitions from ”twists” in two-dimensional self-organizing maps to perfectly ordered configurations. Finally, we discuss t h e capabilities of on-line learning for global optimization.
1 1.1
Introduction Why a theory for on-line learning?
In neural network models, learning plays an essential role. Learning is the mechanism by which a network adapts itself t o its environment. T h e result of this adaptation process, in both natural as well as in artificial systems, is that the network obtains a representation of its environment. This representation is encoded in its plasticities, such as synapses and thresholds. T h e function of a neural network can be described in terms of its input-output relation, which in turn is fully determined by the architecture of the network and by t h e learning rule. Examples of such functions may be classification (as in multi-layered perceptrons), feature extraction (as in networks that perform a principle component analysis), recognition, transformation for motor tasks, or memory. The representation that the network has learned of the environment enables the network t o perform its function in a way that is ”optimally” suited for the environment on which it is taught. Despite the apparent differences in their functionalities, most learning rules in the current uetwork literature share the following properties.
200 1. Neural networks learn from examples. An example may be a picture that must be memorized or a combination of input and desired output of the network that must be learned. The total set of examples or stimuli is called the training set or the environment of the neural network. 2. The learning rule contains a global scale factor, the ”learning parameter”. It sets the typical magnitude of the weight changes at each learning step. In this chapter, we set up and work out a theoretical framework based on these two properties. It covers both supervised learning (learning with ”teacher”, e.g., backpropagation [55],for a review see [33, 651) and unsupervised learning (learning without ”teacher”, e.g., Kohonen learning [37], for a review see (61). The approach taken in this chapter is therefore quite general. I t includes and extends results from studies on specific learning rules (see e.g. [3,53, 9,481).
1.2
Outline of this chapter
In artificial neural networks, on-line learning is modeled by randomly drawing examples from the environment. This introduces stochasticity in the learning process. The learning process becomes a discrete-time Markov process’, which can be transformed into a continuous-time master equation. The study of learning processes becomes essentially a study of a particular class of master equations. In section 2 we point out the correct way t o approximate this master equation by a Fokker-Plank equation in the limit of small learning parameters. We discuss the consequences of this approach in the case of just one fixed point of the (average) learning dynamics. Section 3 is more like an intermezzo. Here we discuss two other approaches. The Langevin approach, which leads t o an equilibrium Gibbs distribution, has become very popular in neural network literature. However, on-line learning, as we define it, cannot be formulated in terms of a Langevin equation, does not lead t o a Gibbs distribution, and is therefore more difficult to study. We will also discuss the more ”mathematical” approach which describes on-line learning using techniques from stochastic approximation theory. The mathematical approach has led to many important and rigorously proven theorems, some of which will be mentioned in section 3. On-line learning, if compared with batch-mode learning where a learning step takes place on account of the whole training set, is necessary if not all training patterns are available all the time. This not only the case for biological learning systems, but also in many practical applications, especially in applications such as financial modeling, economic forecasting, robot control, etcetera, when the training patterns are drawn from a time-dependent environmental distribution. This notion leads to the study of on-line learning in a changing environment in section 4. Using the same techniques as in section 2, we encounter a conflict between the adaptability and the confidence or accuracy of the network’s representation. Minimization of a suitable criterion, the so-called ”misadjustment”, leads t o an optimal learning parameter for learning in a changing environment. The derivation of the optimal learning parameter in section 4 is nice, but of little practical use. To calculate this learning parameter, one needs detailed information about the neural network and its environment, information that is usually not available. In section 5 we try to solve this problem by considering the statistics of the weights. This yields an autonomous algorithm for learning-parameter adjustment. ‘The underlying assumption is that subsequent stimuli are uncotrelated. This is the case for almost all artificial neural network learning rules. However, for biological learning processes and for some applications subsequent stimuli may be correlated. Then the results of our analysis do not apply.
20 1 Another argument in favor of on-line learning, is the possibility t o escape from undesired local minima of the energy function or error potential on which the learning rule performs (stochastic) gradient descent. In section 6 we try to quantify these often made claims by considering the transition times between various minima of the error potential. Starting from two hypotheses, based on experimental observations and theoretical arguments, we show that these transition times scale exponentially with some constant, the so-called ”reference learning parameter”, divided by the learning parameter. Well-known examples of undesired fixed points of the average learning dynamics are topological defects in self-organizing maps. Using the theory of section 6, we calculate in section 7.1 the reference learning parameters for the transitions from ”twists” in two-dimensional maps t o perfectly ordered configurations. We compare the theoretically obtained results with results obtained from straightforward simulations of the learning rule. Finally, we discuss in section 8 to what extent on-line learning might be used as a global optimization method. We derive cooling schedules that guarantee convergence to a global minimum. In these cooling schedules, the reference learning parameters discussed in section 6 play an important role. We compare the optimization capabilities of on-line backpropagation and ”Langevin-type” learning for a specific example with profound local minima.
Learning processes and their average behavior
2 2.1
From random walk to master equation
Let the adaptive elements of a neural network, such as synapses and thresholds, be given by a weight vectorZ w = (q,. . . , w ~ E) lRN. ~ At distinct iteration times w is changed due to the presentation of a training pattern i = ( z l , .. .,x , ) ~E IR”,which is drawn at random according to a probability distribution p(.’). The new weight vector w’ = w A w depends on the old weight vector and on the training pattern:
+
Aw = v f ( w , Z ) .
(1)
The function f is called the learning rule, 1) the learning parameter. Because of the random pattern presentation, the learning process is a stochastic process. We have to talk in terms of probabilities, averages, and fluctuations. The most obvious probability to start with is the probability p,(w) to be in state w after i iterations. This probability obeys a random walk equation pi(+
=
J dNw ~ ( w ’ l wpi-l(w)j )
(2)
with T(w’1w) the transition probability to ”walk” in one learning step from state w to state w’: T(w’1w) =
/
d l p(i)P
( W ’
-w
- qf(w,5)).
(3)
The random walk equation (2) gives a description in discrete time steps. Bedeaux, Lakatos-Lindenberg, and Shuler [7] showed, that a continuous-time description can be obtained through the assignment of random values At to the time interval between two succeeding iteration steps. If these At are drawn from a probability density
5-
@(At)= -exp
[--:I>
’We use the notation AT t o deliole the transpose of the matrix or vector A
202 the probability #(i,t), tnat after time t there have been exactly i transitions, follows a Poisson process. The probability P(w, t ) , that a network is in state w at time t , reads 05
C d(i7 t)pi(w).
~ ( wt ), =
i=O
This probability function can be differentiated with respect t o time, yielding the master equation 8P(W', t ) ___ = / d N w [W(w'lw)P(w,t) - W(wlw')P(w',t)], at
(4)
with the transition probability per unit time 1 Pl'(w'lw) = -T(w'lw).
(5)
T
Through T we have introduced a physical time scale. Here we have presented a nice mathematical trick to transform a discrete time random walk equation into a continuous time master equation. It is valid for all values of T and 7. For the rest of this chapter we will choose T = 1, i.e., the average time between two learning steps is our unit of time. For notational convenience we introduce the averages over the ensemble of learning networks Z(t)
/
(@(w))~ '%~ ~ ,d N v P ( w , t ) cP(w), and over the set of training patterns R (@(Z))n
sf/ d " r p ( Z ) @ ( Z ) ,
for arbitrary function Q ( w )and '@(Z). The dynamics of equation (4) cannot be solved in general. We will point out the incorrect (section 2.2) and the correct (section 2.3) way t o approximate this master equation for small learning parameters 7 . To simplify the notation, we will only consider the one-dimensional case. In our discussion of the asymptotic dynamics (section 2.4), we will generalize t o N dimensions.
2.2
The Fokker-Planck approximation of the Kramers-Moyal expansion
A totally equivalent description of the master equation is given by its full Kramers-Moyal exDansion
with the so-called jump moments an(w)
erJ dw' (w
-
w')"T(w\u') =
l)n
def
( f " ( w , r ) ) n = 7"6*(w),
(7)
where all iLn are of order 1, i.e., independent of 7. By terminating this series at the second term, one obtains the Fokker-Planck equation
203 In one dimension, the equilibrium distribution of the Fokker-Planck equation can be written in closed form:
with N a normalization constant. Because of the convenience and the simplicity of the result, the Fokker-Planck approach is very popular, also in neural network literature on on-line learning processes [23, 44, 50, 531. However, it is incorrect! Roughly speaking, this approximation is possible if and only if the average step size (Aw) and the variance of the step size ((Aw - (Aw))’) are proportional to the same small parameter 1141. Learning rules of the type (1) have (Aw) = O(9) but ((Aw - (Aw))’) = O ( v 2 )and thus do not satisfy this so-called ”scaling assumption”. To convince ourselves, we substitute the equilibrium distribution (9) into the Kramers-Moyal expansion (6) and notice that the third, fourth, . . ., 00 terms are all of the same order as the first and second order terms: formally there is no reason to break off the Kramers-Moyal series after any number of terms.
2.3
A small fluctuations expansion
Intuitively, a stochastic process can often be viewed as an average, deterministic trajectory, with stochastic fluctuations around this trajectory. Using Van Kampen’s system size expansion [63] (see also [14]),it is possible to obtain the precise conditions under which this intuitive picture is valid. We will refer to this as the small fluctuations expansion. I t consists of the following steps.
1. Following Van Kampen, we make the ”small fluctuations Ansatz”, i.e., we choose a new variable ( such that ‘w = 9(t) -I-f i t (10) with d ( t ) a function to be determined. Equation (10) says that the time-dependent stochastic variable w is given by a deterministic part 4(t) plus a term of order Jii containing the (small) fluctuations. A posteriori, this Ansatz should be verified. The function I I ( ( , t ) is the probability P( w,t ) in terms of the variable [:
2. Using simple chain rules for differentiation, we transform the Kramers-Moyal expansion (G) for P ( w , t ) into a differential equation for II(6, t ) :
3. We choose the function + ( t ) such that the lowest order terms on the left- and righthandside cancel, i.e.,
This is called the deterministic equation.
204 4. We make a Taylor expansion of 6,#($(t)+f i t ) in powers of f i .After some rearrangements
we obtain
5. In the limit 'I -+ 0 only the term m = 2 survives on the righthandside. This is called the linear noise approximation. The remaining differential equation for II((, t ) is the FokkerPlanck equation
where the prime denotes differentiation with respect to the argument. 6. From equation (12) we calculate the dynamics of the average fluctuations (():(,) size of the fluctuations ( ( 2 ) z ( , ) :
and the
is of order 1. From equation (13) we conclude that the final result is consistent with the Ansatz, provided that both evolution equations converge, i.e., that
7. We started with the Ansatz that
4(4(t)) < 0 . So, there are regions of weight space where the small fluctuations expansion is valid (u: and where it is invalid (u; 2 0).
(14)
< 0)
Let us summarize what we have done so far. We have formulated the learning rule (1) in ternis of a discrete time Markov process (2). Introducing Poisson distributed time steps we have hailsformed this discrete random walk equation into a continuous time master equation (4). Making a small fluctuations Ansatz for small learning parameters 9, we have derived equation (11) for the deterministic behavior and equation (12) for the probability distribution of the flucluations around this deterministic behavior At the same time we have derived the condition (14) which musl be satisfied for this description to be valid in the limit of small learning paiameters 17 Now that we have made a rigorous expansion of the master equation, we can refine our lmld statement that the Fokker-Planck approximation i s incorrect If we substitute the small flu
205 2.4
Asymptotic results in N dirhensions
The first two jump moments defined in equation (7) play an important role and are therefore given special names: the drift vector, which is just the average learning rule
f(w)
s (f(w,q)n ,
and the diffusion matrix
D % (f(w,F)fT(W,z))5(t), containing - the fluctuations in the learning rule. Furthermore, we define the Hessian matrix H(w) with components Hij(W) = --afi(w) dWj
If and only if the Hessian matrix is symmetric3, an energy function or error potential E(w) can be defined such that the learning rule performs (stochastic) gradient descent on this error:
f(w) = -VE(w),
(17)
where V stands for differentiation with respect t o the weight vector w. The Hessian matrix gives the curvature of the error potential in the different directions. The condition (14) says that the "small fluctuations expansion" is valid in regions of weight space with positive definite Hessian H(w). These regions will be called attraction regions. The deterministic equation (11) reads in N dimensions
The attractive fixed point solutions of this "average learning dynamics" will be denoted by w*. If there exists an error potential, then these fixed points are simply the (local) minima. At a fixed point w* we have no drift, i.e., f(w*) = 0 and a positive definite Hessian H(w*). The typical local relaxation time towards these fixed points is
with X,,(w*) the smallest eigenvalue of the Hessian H(w*). To study the asymptotic convergence, we can make an expansion around the minimum w*. In [28, 631 it is shown that, after linearization of d ( t ) around the fixed point w*, the evolution equations (11) and (13) are equivalent with 1 dm(t) -- -Hm(t) 'I dt
1d C z ( t ) -- - H C z ( t ) - C * ( t ) H T + 'ID, 'I
dt
where the Hessian and the diffusion matrix are both evaluated at the fixed point w* and with definitions for the bias m(t)and covariance matrix C 2 ( t )
31n the literature, the matrix H ( w ) is only called Hessian if it is indeed symmetric.
206 I t can be shown t h a t this linearization is allowed for 7 small enough [28]. ;From the linear Fokker-Planck equation (12) and t h e asymptotic evolution equations (20) we conclude t h a t the asymptotic probability distribution for small learning parameters 7 is a simple Gaussian, with its average a t the fixed point w’ and a covariance matrix C2 obeying
H E 2 t C2H = 7D.
(22)
So, there are persistent fluctuations of order 7 that will only disappear in the limit 7 + 0. These theoretical predictions are in good agreement with simulations (see [28] and simulations in the following sections).
3 3.1
Intermezzo: other approaches The Langevin approach
In this section we will point out the difference between the ”intrinsic” noise due t o the random presentation of training patterns and the ”artificial” noise in studies on t h e generalization capabilities of neural networks (see e.g. [57, 641). In the latter case, the noise is added to the deterministic equation (18), i.e., the weights evolve according t o the Langevin equation dw(t) -
- -VE(w(t))
dt where ( ( 1 ) is white noise obeying (ti(t)(j(t’)) =
6;j
+ @<(t), 6 ( t - 1’).
T h e Langevin equation (23) is equivalent t o the Fokker-Planck equation [63]
aP(w,t) = -V{f(w)P(w,t)} at
+ TVZP(w,t).
The equilibrium distribution is [compare with equation (9)]
with 2 a normalization constant. The existence of this Gibbs distribution raises the idea to put learning in the framework of statistical mechanics [45, 57, 641. In these studies, the Langevin equation (23) is more an ”excuse” to arrive a t the Gibbs distribution (24) than an attempt t o study the dynamics of learning processes in artificial neural networks. T h e equilibrium distribution of the master equation for on-line learning processes is not a simple Gibbs distribution (see also [23]), which makes the analysis of on-line learning processes much more difficult. Because of the equilibrium Gibbs distribution (24), the Langevin equation (23) has also been proposed as a method for global optimization [2, 15, 17, 211. The discrete-time version
w(t+At)-w(t) = f(w)At t
<
@<m,
(25)
with Gaussian white noise of variance 1 , can be simulated easily. T h e smaller At, the closer the correspondence with t h e continuous Langevin equation. We will call this ”Langevin-type learning” and we will come back on it in section 8.3. Note that equation (25) does indeed satisfy the ”scaling assumption” mentioned in section 2.2: both the average step size and the variance of the step size are proportional t o At. This scaling property explains why equation (25) can iiidced be approximated by a globally valid Fokker-Planck equation, and the learning rule (1) 1101.
207
3.2
Mathematical approach
Besides t h e "physical" approach which starts from the master equation, there is t h e "mathematical" approach which treats on-line learning in the context of stochastic approximation theory. T h e starting point in this approach is the so-called interpolated process. With w, t h e network the learning parameter after n iterations, t h e interpolated process w(t) is defined state and
v,
hv -J
w(t) =
tn
-
t
-wn-1+ '7n
t - t,_l w,,
for t,-l
5 t < 1, ,
'7n
with t o 2 ' 0 and t, = 71 +. ..+ qn. This approach has led to many important, rigorously proven theorems. For example, if q,, tends to zero a t a suitable rate, i.e., such t h a t def
then the interpolated process of w, eventually follows the solution trajectory of the ordinary differential equation (18) with probability 1 [43, 471. Furthermore, if these and some additional technical requirements are satisfied, the learning process will always converge t o one of the fixed points w* of this differential equation. In neural network literature this method has been applied t o the analysis of feature extraction algorithms (34, 561. Similar techniques have been used to study the convergence of general learning algorithms for small constant learning parameters (401. In the context of global optimization (see section 8) the work of Kushner [41, 42) is worth mentioning. In particular he shows that convergence t o the global optimum occurs almost surely, provided t h a t in the limit n -+ 03 the learning parameter decreases proportional t o l / l o g n 142, 65).
4 4.1
A conflict in a changing environment Motivation and mathematical description
Equation (22) states t h a t we must drop the learning parameter to zero in order t o prevent asymptotic fluctuations in the network state. This has been the usual strategy in the training of artificial neural networks. But this is certainly not the kind of behavior one would expect from a true adaptive system that a neural network, based on real biological systems, should be. A true adaptive system can always adapt itself to changes in the environment. Biological neural systems are famous for their ability t o correct for the lengthening of limbs during growth, or their ability t o recover (at least partially) after severe damage or surgery. This kind of adaptability is also desirable for artificial neural networks, e.g., for networks for the control of robots that suffer from wear and tear, or for neural networks for the modeling of economic processes. In this section we will therefore discuss the performance of neural networks learning in a changing environment (281. Mathematically speaking, a changing environment corresponds t o a time-dependent input probability p(.',i.). T h e probability density of network states w still follows a continuous-time master equation, but now with time-dependent transition probability Tt(w'lw):
208 where Q ( t )stands for the set of training patterns, the "environment", at time t. The fixed points w*(t) of the deterministic equation
may depend on time. In equation (27) it is important t o note the difference between the variables s and t . We define the "misadjustment" & as the average squared Euclidian distance with respect to this fixed point w'(t) [ll]:
The bias m(t), defined in equation (21) but now with time-dependent w'(t) instad of fixed w', is a measure of how well the ensemble of learning networks follows the environmental change on the average. It gives the typical delay between what the average network state is, (w):(~),and what it should be, w*(t). The covariance matrix C 2 ( t )gives the width of the distribution and thus a measure of "confidence". T is a time window to be discussed later. 4.2
An example: Grossberg learning in a changing environment
Let us first discuss a simple model that can be solved without any approximations. We consider the Grossberg learning rule [19] AW = V ( Z - w ) , in a time-dependent environment. The input distribution is moving along the axis with a constant velocity v, i.e., p ( z , t ) = i j ( z - vt). We choose
p(.) =
1 21
-q i +
e(i - z ) ,
with @ ( I=) 1 for z > 0 and B(z) = 0 for I < 0. So, z is drawn with equal probability from the is constant. The aim of this interval [vt - 1, vt 11. The input standard deviation = 1/& learning rule is to make w coincide with the mean value of the probability distribution p(z,t), i.e., the fixed point w*(t) of the deterministic equation (27) obeys
+
x
w * ( t ) = ( " ) n ( t ) = vt .
So, V' = v , the rate of change of the fixed point solution is equal to the rate of change of the environment. Straightforward calculations show that the evolution of the bias m ( t ) and the variance Cz(t) is governed by
This set of differential equations decays exponentially t o the stationary solution
m = y 17'
72x2
~ 2 = - .
+ v2
7(2 - 9 )
209
P
w - w*(t)
Figure 1: Probability distribution for time-dependent Grossberg learning. Learning parameter 7 = 0.05, standard deviation x = 1.0. The input probability p(z,t) is drawn for reference (solid box). Zero velocity (solid line). Small velocity: v = 0.01 (dashed line). Large velocity: v = 0.1 (dash-dotted line). Note that this behavior is really different from the behavior in a fixed environment. In a fixed environment ( v = 0) the asymptotic bias is negligible if compared with the variance4. However, in a changing environment ( v > 0), the bias is inversely proportional to the learning parameter 7,and can become really important if this learning parameter is chosen too small. In figure 1 we have shown the (simulated) probability density P ( w - w*(l)) for three different values of the speed v . For zero velocity the bias is zero and the distribution is sharply peaked. For a relatively small velocity, the influence on the width of the distribution is negligible, but the effect on the bias is clearly visible. For a relatively large speed, the variance is also affected and can get pretty large. A good measure for the learning performance is the misadjustment defined in equation (28). In the limit T + 00, we can neglect the exponential transients t o the stationary state (29). We obtain 73x2 2v2 & = T2(2 - 17) . This misadjustment is sketched in figure 2, together with simulation results. For small learning parameters the bias dominates the misadjustment and we have
+
On the other hand, for larger learning parameters the variance yields the most important contribution: & %'' -7 for ( v / x ) 2 / 3<< 7 << 2 . 2
Somewhere in between these two limiting cases, the misadjustment has a minimum at the optimal 'For the linear learning rule discussed i n this example it is even zero. I n general, the nonlinearity of the learning rule leads to a bias of O(q) whereas the standard deviation is of O(J?).
210
0.21
"
0
\ 0.02
0.04
0.06
0.08
0.1
fl
Figure 2: Misadjustment as a function of the learning parameter for Grossberg learning in a changing environment. Squared bias (computed, dashed line; simulated, +), variance (computed, dash-dotted line; simulated, x) and error (computed, solid line; simulated, *). Simulations were done with 5000 neural networks. Standard deviation input: x = 1.0. learning parameter voptimdwhich is for this particular example the solution of the cubic equation
Reasonable performance of the learning systems can only be expected if 'u << x, i.e., if the displacement of the input probability distribution per learning step is much smaller than its width. In this limit, we obtain
This optimal learning parameter gives the best compromise between fast adaptability, which asks for a large learning parameter, and high confidence, which requires a small (but not too small!) learning parameter. A similar "accuracy conflict" is noted by Wiener in his work on linear prediction theory [67]. 4.3
N o n l i n e a r l e a r n i n g rules i n nonstationary environments
The Grossberg learning rule is linear and therefore exactly solvable. Of course, most practical learning rules in neural networks are nonlinear and high dimensional. For nonlinear highdimensional learning rules the basic idea is still the same: there is a conflict between fast adaptability ( a small bias) and high confidence ( a small variance). In order t o calculate a learning parameter that yields a good compromise between these two competing goals, we have to make approximations, similar to the ones made in section 2. So, we have to require that the learning parameter is so small that it is allowed to make the usual small fluctuations expansion. To linearize around t o fixed point, we must now also require that the rate of change v is much smaller than the typical weight change qf. Provided these requirements are fulfilled, the evolution of the bias m(t) and the covariance Z2(t) is governed by [28]
e'w'
21 1
with notation H ( t ) kfH(w*(t)), and so on. Let us furthermore assume that the changes in the "speed" v, the diffusion D, and the curvature H are so slow that they can be considered constant on the local relaxation time qocd[see equation (19)]. Then the bias and covariance matrix tend t o stationary values. The stationary bias is inversely proportional t o the learning parameter q and proportional to the speed v , whereas the variance is proportional to the learning parameter and more or less independent of the speed. So, for nonlinear learning rules we also obtain a misadjustment of the form [28]
with a and p constants that depend on the diffusion D and the curvature H at the fixed point. and smaller than Here the time window T must he larger than the local relaxation time qOcd the time in which at least one of the quantities v, D , or If, changes substantially. Again, the For slow changes v and learning parameters optimal learning parameter is proportional to near the optimal learning parameter all conditions for the validity of the evolution equations (30) are fulfilled [28]. 4.4
An example: Oja learning in a changing environment
A simple example of a nonlinear learning rule is the Oja learning rule [49]
aw
= qw=x[x-(wTx)w]
This learning rule searches for the principal component of the input distribution, i.e., the eigenvector of the covariance matrix C = xx with the largest eigenvalue. The network structure def( '>a ' and input space is pictured in figure 3. We take a network with one output neuron, two input neurons and two weights. The inputs are drawn with equal probability from a two-dimensional box with sides 211 and 212:
The covariance matrix of this input distribution is diagonal:
with A, 'kf1%/3 for CI = 1,2. If we choose Il > 12, then the two fixed point solutions of the differential equation (27) are w*(t) = f ( l , O ) * . So, the fixed point solution is normalized, but is still free to lie along the positive or negative axis. To model learning in a changing environment, the box is rotated around an axis perpendicular to the box, going through the origin, with angular velocity w . The principal component of this time-dependent input distribution obeys
212
Figure 3: Oja learning. A unit is taught with two-dimensional examples from a rectangle which is rotating around the origin. The principal component of the covariance matrix lies parallel to the longest side of the rectangle. For small angular velocities w and small learning parameters 9, we can apply the approximations discussed above t o calculate the squared bias and the variance. We obtain
The sum of these terms yields the misadjustment C. Within this approximation, the minimum of the misadjustment is found for the optimal learning parameter
The "theoretical" misadjustment is compared with results from simulations in figure 4 . Especially in the vicinity of the optimal learning parameter, the approximations seem to work quite well.
5 5.1
Learning-parameter adjustment E s t i m a t i n g the misadjustment
The method described above t o calculate the optimal learning parameter looks simple and elegant and may work fine for the small examples discussed there, but is in practice useless since it requires detailed information about the environment (the diffusion and the curvature at the fixed point) that is usually not available. In this section we will point out how this information can be estimated from the statistics of the network weights and can be used to yield an autonomous algorithm for learning-parameter adaptation (291. Suppose we have estimates for the bias and the variance, Mestimate and C~stimalsr respectively, while learning with learning parameter 9. We know that (in a gradually changing environment) the bias is inversely proportional to the learning parameter, whereas the variance is proporour estimate for the tional to the learning parameter. So, with a new learning parameter 9n7new, misadiustment & is
213 0.4 &
0.3 0.2
0.1
0 0
0.02
0.04
0.06
0.1
0.08 9
Figure 4: Misadjustment as a function of the learning parameter for Oja learning in a changing environment. Squared bias (computed, dashed line; simulated, +), variance (computed, dashdotted line; simulated, x) and error (computed, solid line; simulated, *). Simulations were done with 5000 neural networks. Eigenvalues of the covariance matrix of the input distribution, A1 = 2.0 and A2 = 1.0. Angular velocity, w = 2r/1000. Minimization of this misadjustment with respect to the new learning parameter qnewyields
How do we obtain these estimates for the bias and the variance? First, we set the lefthandside of the evolution equations (30) equal t o zero, i.e., we assume that the bias and the variance are more or less stationary. Then, t o calculate the bias we must have an idea of the curvature H . To estimate it, we can use the asymptotic solution of equation (30) that relates the covariance matrix (the fluctuations in the network state) to the diffusion (the fluctuations in the learning rule). Since we can calculate both the diffusion and the covariance, we might try t o solve the remaining matrix equation t o compute the curvature. This seems t o solve the problem but leads us directly to another one: solving an N x N-matrix equation, where N is the number of weights, is computationally very expensive. Kalman-filtering, when applied t o learning i n neural networks (591, and other second-order methods for learning [5] have similar problems. Here, i t seems even worse since we are only interested in updating one global learning parameter. Therefore, we will not consider all weights, but only a simple (global) function of the weights, e.g., N
w dl' C a ; w , , i=l
with a a random vector that is kept fixed after it is chosen. During the learning process, we keep and (14"). From these averages, we can estimate a new learning tra.ck of (AW), (AlY'), (W), parameter. The last problem concerns the averaging. In theory, the average must be over an ensemble of learning networks. Yet, it seems very unprofitable to learn with say 100 networks if one is just interested in the performance of one of them. Some authors do suggest t o train a.n ensemble of networks for reasons of cross-validation [24], but although it would certainly improve
214
the accuracy of the algorithm, it seems too much effort for simple learning-parameter adaptation. Instead, we estimate the averages by replacing the ensemble averages by time averages over a period T for the network that is trained. The time period T must be large enough to obtain accurate averages, but cannot be much larger than the typical time scale on which the diffusion, the curvature, or the "speed" changes significantly (see the discussion in section 4.3). The final algorithm for learning-parameter adjustment consists of the following steps [29]. 1. Gather statistics from learning with learning parameter 7 during time T , yielding ( W ) T , ( W 2 ) T (, A W T ,and ((Awl')),.
2. Estimate the variance from
where the last term is a correction for the average change of W , and the bias from
which can be obtained directly from the stationary solution of the evolution equations (30) for a one-dimensional system. 3. Calculate the new learning parameter T~~~ from equation (32). 5.2
Updating the learning parameter of a perceptron
As an example, we apply the adjustment algorithm t o a perceptron [54] with two input units, one output unit, two weights (wl and w2), and a threshold ( w o ) . The output of the network reads
with the input vector d %'(Z~,ZZ)~and zo = -1. The learning rule is the so-called delta rule or Widrow-Hoff learning rule [66] Aw; = 17 [Ydesired - Y(W7 d)][I - Y 2 ( W ~ z ) ] . Backpropagation [55] is the generalization of this learning rule for neural networks with hidden units. The desired output Ydesired depends on the class from which a particular input vector is drawn. There are two classes of inputs: "diamonds" corresponding to positive desired outputs l/d&.ed = 0.9 and "crosses" corresponding to negative desired outputs Ydesired = -0.9. w e draw the input vectors d from Gaussian distributions with standard deviation o around the center points F* %'*(Jzsind,JZcosd)T:
In the optimal situation, the weights and the threshold yield a decision boundary going through the origin and perpendicular to the line joining the two center points. In other words, the fixed point solution w* of the differential equation (27) corresponds t o a decision boundary that is described by the line s j n @ t i 2 cosd = 0 .
215
We can model learning in a changing environment by choosing a time-dependent angle q5(t),i.e., by rotating the center points. Figures 5(a)-(c) show snapshots of the perceptron learning in a fixed, a suddenly changing, and a continuously changing environment, respectively. All simulations start with random weights, input standard deviation o = 1, angle d(0) = 7r/4, a constant time window T = 500, and an initial learning parameter 7 = 0.1. After this initialization, the algorithm described in section 5.1 takes care of the recalibration of the learning parameter. In a fixed environment [figure 5(a)], i.e., with a time-independent input probability density p(ydesired,Z), the weights of the network rapidly converge towards their optimal values. So, after a short while the bias is small and the decision boundary wiggles around the best possible separatrix. Then the algorithm decreases the learning parameter to reduce the remaining fluctuations. Theoretical considerations show that in a fixed environment the algorithm tends t o decrease the learning parameter as [29] q(t) cc
1
;
for large t ,
which, according to the conditions (26) in section 3.2, is the fastest possible decay that can still guarantee convergence to the fixed point w*. The second simulation [figure 5(b)] shows the response of the algorithm t o a sudden change in the environment. The first 5000 learning steps are the same as in figure 5(a). But now the center points are suddenly displaced from q5 = s/4 to 4 = -7r/4. This means that at time t = 5000 the decision boundary is completely wrong. The algorithm measures a larger bias, i.e., notices the ”misadjustment” to the new environmental conditions, and raises the learning parameter. Psychologists might call this ”arousal detection” (see e.g. [ZO]). It can he shown that, for this particular adjustment algorithm, the quickness of the response strongly depends on the learning parameter at the time of the change [29]. The lower the learning parameter, the slower the response. Therefore, it seems better t o keep the learning parameter always above some lower hound, say qmin = 0.001, instead of letting it decrease to zero. Figure 5(c) depicts the consequences of the algorithm in a gradually changing environment, the situation from which the algorithm was derived. In this simulation, we rotate the center points with a constant angular velocity w = 2 ~ / 1 0 0 0 .Simple theory, assuming perfect ”noiseless” measurements, tells us that the learning parameter should decrease exponentially towards a constant ”optimal” learning parameter (291. In practice, the fluctuations are too large and the theory cannot be taken very seriously. Nevertheless, the pictures show that the overall performance is quite acceptable.
5.3
Learning of a learning rule
The algorithm described in section 5.1 and tested in section 5.2 is an example of the ”learning of a learning rule” [3]. It shows how one can use the statistics of the weight variables t o estimate a new learning parameter. This new learning parameter is found through minimization of the ”expected misadjustment” [see equation (31)]. The underlying theory is valid for any learning rule of the form Aw = vf(w,C), which makes the algorithm widely applicable. Although originally designed for learning in changing environment, it also works fine in a fixed environment and in case of a sudden environmental change. The qualitative features of the algorithm (turning down the learning parameter if there
216
time: 96W
Figure 5: Learning-parameter adjustment for a perceptron. The last 150 training patterns are shown. Graphs on the right give the learning parameter 7 , the squared bias MezStimate, and the variance ELtimaterall estimated from the statistics of the network weights. (a) A fixed environment: d ( t ) = b(0) = n/4. (b) A sudden change in the environment: d ( t ) changes abruptly from 7r/4 to -7r/4. (c) A continuously changing environment: d ( t ) = */4 27r1/1000.
+
211 is no new information, "arousal detection" in case of a sudden change) seem very natural from a biological and psychological point of view. It is difficult to compare our algorithm with the many heuristic learning-rate adaptation algorithms that have been proposed for specific learning rules in a fixed environment (see e.g. [35] for a specific example or [26, 51 for reviews on learning-rate adaptation for backpropagation). Usually, these algorithms are based on knowledge of the whole error landscape and cannot cope with pattern-by-pattern presentation, let alone with a changing environment. Furthermore, most of these heuristic methods lack a theoretical basis, which does not necessarily affect the performance on the reported examples, but makes i t very difficult to judge their "generalization capability", i.e., their performance on other (types of) problems. The "learning of the learning rule" of Amari [3] is related to our proposal. Amari argues that the weight vector is far from optimal when two successive weight changes are (likely to be) in almost the same direction, whereas the weight vector is nearly optimal when two successive weight changes are (likely to be) in opposite directions. In our notation, this idea would yield an update of the learning parameter of the form (the original idea is slightly more complicated)
with A W ( t ) %' W ( t )- W ( t - 1) and y a small parameter. The "learning of the learning rule" leads to the same kind of behavior as depicted in figures 5(a)-(c): "the rate of convergence automatically increases or the degree of accuracy automatically increases according to whether the weight vector is far from the optimal or nearly optimal" [3]. Amari's algorithm is originally designed with reference to a linear perceptron operating in a fixed environment, but might also work properly for a larger class of learning rules in a changing environment. The more recent "search then converge" learning rate schedules of Darken et al. [ l l ] are asymptotically of the form for large t . v(t) x These schedules are designed for general learning rules operating in a fixed environment and guarantee convergence to a fixed point w*. The parameter c must be chosen carefully, since convergence is much slower for c 5 C* than for c > c', with c' a usually unknown problemdependent key parameter. To judge whether the parameter c is chosen properly, they propose to keep track of the "drift" F (again rewritten in our notation, their notation is slightly different and more elaborate)
F(t)
er( & A w 4 ) ; ( t )
7
where the average is over the last T learning steps before time t. They argue that the "drift F ( t ) blows up like a power of 2 when c is too small, but hovers about a constant value otherwise" [ l l ] . This provides a signal for ensuring that c is large enough. Although not directly applicable to learning in a changing environment, it is another example of the idea to use the statistics of the weights for adaptation of the learning parameter. This general idea definitely deserves further attention and has great potential for practical applications.
6
Transition times between local minima
6.1 Context and state of the art In the preceding sections, we have only discussed learning in the vicinity of one fixed point solution of the average learning dynamics. Learning rules with only one fixed point foriii a
218
very limited class. Nowadays popular learning rules, such as backpropagation [55] and Kohonen learning [37], can have many fixed points. Some of these fixed points appear t o be better than others. A well-defined measure for how good a particular network state w is, is the error potential E(w). Often, one starts by defining an error potential, such as the (average) squared distance between the network’s output and the desired output for backpropagation, and derives a learning rule from this error by calculating the gradient V with respect t o the network state w as in equation (17). With batch-mode learning, the network gets stuck in a minimum; in which minimum depends only on the initial network state. Many authors (see e.g. [S, 13, 24,441) share the feeling that random pattern presentation, i.e., on-line instead of batch-mode learning, introduces noise that helps t o escape from “bad” local minima and favors lower lying minima. In this section, we will try t o point out a theory that refines and quantifies these statements. We will restrict ourselves t o learning rules for which equation (17) holds. Generalization t o learning rules that cannot be derived from a global error potential is straightforward, except that there is no obvious, unbiased global measure of how good a network state is. The results of section 2 give a purely local description of the stochastic process, i.e., the analysis yields unimodal distributions. This is a direct consequence of the ”small fluctuations Ansatz” (10). For an error potential with multiple minima, we obtain an approximate description around each minimum, but not a global description of a multimodal distribution. Standard theory on stochastic processes [12,14,63] cannot provide us with a general expansion method for unstable systems, i.e., stochastic systems with multiple fixed points. As we noted in section 2.2, the Fokker-Planck approximation, although often applied, does not offer an alternative since its validity is also restricted t o the so-called attraction regions with positive curvature. Leen and Orr [44], for example, report simulations in which the Fokker-Plank approach breaks down even for extremely low learning parameters. Our approach [32] is based on two hypotheses which are supported by experimental and theoretical arguments. These hypotheses enable us to calculate asymptotic expressions for the transition times between different minima. 6.2
The hypotheses
Again, we start with the master equation (4) in a fixed environment. In section 2 we showed that in the attraction regions, where the Hessian H(w)is positive definite, Van Kampen’s system size expansion can be applied for small learning parameters q . Each attraction region contains exactly one minimum of the error E(w).We say that minimum a lies inside attraction region A,. Top stands for the transition region connecting attraction regions o and p. In the transition regions the Hessian has one negative eigenvalue. We can expand the probability density P ( w , t ) : P(w,t) =
c
PdW,t )
a
+
c
P d W , 1) i
UP
where P,(w,t)is equal to P(w,t)inside attraction region A , and zero outside, and similar definitions for Pap(w,t) in the transition regions’. For proper normalization, we define the occupation numbers
n,(t)
5‘
dNu P(w,t),
i.e., the occupation number n a ( t ) is the probability mass in attraction region A,. From the master equation (4),we would now like t o extract the evolution of these occupation numbers na(t). 5We neglect the probability mass outside the attraction and transition regions since it is negligible if compared with the probability mass inside these regions and has no effect on our calculation of transition times anyway.
219
-1
1 W
Figure 6: Histogram found by simulation of 10000 one-dimensional neural networks learning on an error potential with a local and aglobal minimum. (a) t = 1: Initial distribution. (b) t = lo3: Two peaks. (c) t = 10% Stationary distribution. Figure 6 shows the histogram of 10000 independently learning one-dimensional networks at three different times (see [32] for details). We use this simple example t o give an idea of the evolution of the master equation in the presence of multiple minima and t o point at a few characteristic properties of unstable stochastic systems (see [63]). The learning networks perform stochastic gradient descent on a one-dimensional error potential with a local minimum at 20 = -1 and a global minimum at w Y 1. The weights are initialized with equal probability between -1 and 1 (figure 6(a): t = 1). On a time scale of order 1/7, the local relaxation time qlocalin equation (19), P(w,t)evolves t o a distribution with peaks at the two minima (figure 6(b): 1 = lo3). The probability mass in the transition region is much smaller than the probability mass in the attraction regions: transitions between the minima are very rare. The global relaxation time to the equilibrium distribution (figure 6(c): t = lo6) is much larger than the local relaxation time. Our first hypothesis is well-known in the theory of unstable stochastic processes [63]. It says that the rare transitions may affect the probability mass, but not the shape of the distribution in the attraction regions. In other words, we assume that after the local relaxation time, we are allowed to ”decouple time and space” in the attraction regions:
Pu(w, t ) =
4 4PdW) .
This assumption seems t o be valid when the attraction regions are well separated aad when the transitions between them are rare. Substitution of this assumption into the master equation
220 yields
The first term in this equation corresponds to probability mass leaving attraction region A,, the second term to probability mass entering A,. Let us concentrate on the first term alone and neglect the second term. This corresponds to a simulation in which all networks that leave the attraction region A , are taken out. The term between brackets is the probability per unit time to go from attraction region A , to transition region Top. The inverse of this term is called the transition time T ( A , T-6) from attraction region A , to transition region TRp:
-
Below we will sketch how to calculate this transition time for small learning parameters q. We will show that it is of the form r(A,
-
T-0)
N
exp
[““RI
-
for small q,
with Gooa,the so-called reference learning parameter, a constant independent of the learning parameter 7). If the learning parameter is chosen much smaller than the reference learning parameter, the probability to go from the attraction to the transition region within a finite number of learning steps is negligible. Furthermore, the reference learning parameters play an important role in the derivation of cooling schedules that guarantee convergence to the global minimum (see section 8). So, we can compute how the transition time T ( A , T-6) from the attraction region to the transition region scales as a function of the learning parameter 7. But we are more interested in the transition time T ( A , A ~ from ) attraction region A , to attraction region A p , i.e., the average time it takes to get over transition region Top. What happens in this transition region? In the transition regions the small fluctuations expansion of section 2.3 is not valid. If we still try to apply it, we notice that (in this approximation scheme) the fluctuations tend to explode [see equation (13)]. On the other hand, in the attraction regions the (asymptotic) fluctuations are proportional to the learning parameter. The idea is now that, for small learning parameters 17, the transition time from attraction region A , to Ap is dominated by the transition time from A , to transition region T a p . More specifically, our second hypothesis states that
-
-
lim rl+O
-7)
In T ( A *
-
As)
%
lim -q In T ( A , r)+O
-
Too) =
60, ,
i.e., that the reference learning parameter for the total transition from one attraction region to another can be estimated by calculating the reference learning parameter for the transition from the attraction region to the transition region. 6.3
Calculation of t h e reference l earn in g p a r a m e t e r
In this section we will sketch how to calculate the reference learning parameter
22 1 for the transition from attraction region A, to transition region T,o. We recall from section 2.4 that the local probability distribution p,(w) can be approximated by a Gaussian with its average = 9 K , obeying a t the minimum w: and variance
H,K,
+ K,H,
= D,
,
(35)
where the Hessian H , %‘H(w:) and the diffusion matrix D, ‘%‘D(w;) are both evaluated at the minimum w:. In equation (34), we have to integrate over all w and $such that
w E A,
and
w’ = w
+ qf(w, Z) E Tap.
So6, both w and w’ are within order 9 of the boundary Bpa between attraction region A , and transition region Top. Now it is easy to prove [32] that, for small learning parameters 9, the integral in (34) converges to an integral over the boundary Bp, times some term of order 9. This latter term disappears if we take the logarithm, multiply with q, and take the limit 9 Finally, in the limit 9 -+ 0, the only remaining term is
-+
0.
The integral can be approximated using the method of steepest descent. The largest contribution is found when the term betweeu brackets is maximal on the boundary Boo. So, the largest contribution comes from the ”easiest” path from the local minimum w: to the transition region Tap. The matrix li;’ defines the local ”metric”. The final result is
Roughly speaking, the reference learning parameter is proportional t o the height of the error barrier and inversely proportional to the local fluctuations. The result is similar t o the classical Arrhenius factor for unstable stochastic (chemical) processes [63]. In the next section we will apply this formula to calculate the reference learning parameter for the transition from a twist (”butterfly”) to a perfectly ordered configuration in a self-organizing map.
7 Unfolding twists in a self-organizing map 7.1
Twists are local minima of an error potential
The Kohonen learning rule [37, 381 tries to capture important features of self-organizing processes. It has not only applications in robotics, data segmentation, and classification tasks, but may also help to understand the formation of sensory maps in the brain. In these maps, the external information is represented in a topology-preserving manner, i.e., neighboring units code similar input signals. Properties of the Kohonen learning procedure have been studied i n great detail [lo, 52). Most of these studies focussed on the convergence properties of the learning rule, i.e., asymptotic properties of the learning network in a perfectly ordered configuration. In this context, Ritter and Schulten [51, 531 were the first to use the master equation for a description of on-line learning processes. ‘For simplicity, we will only consider the case in which the learning rule is bounded. i.e., for which there exists an M < m such that lf(w,Z)1 < M ,for all w and all 2’ E n.
222 It is well-known that not only perfectly ordered configurations, but also topological defects, like kinks in one-dimensional maps or twists in two-dimensional maps, can be fixed point solutions of the learning dynamics [16]. With a slight change, the Kohonen learning rule can be written as the gradient of a global error potential [30]. Then the topological defects correspond to local minima of this error potential, whereas global minima are perfectly ordered configurations. The unfolding of a twist in a two-dimensional map is now simply a transition from a local minimum to a global minimum. Using the theory developed in section 6, we will calculate the reference learning parameters for these transitions and compare them with straightforward simulations of the learning rule. As an example, we consider a network of 4 units. Each unit has a two-dimensional weight vector, so, the total eight-dimensional network state vector is written w = (I&,. . . ,2U4)T = ( ~ 1 1WIZ, , ~ 2 1 ,... , W42)T. Each learning iteration consists of the following steps. 1. A n input Z = ( z l , ~is~drawn ) ~ with equal probability from a square:
2. The "winning unit" is the unit with the smallest local error
Here h is called the lateral-interaction matrix. The closer two units i and j in the "hardware" network configuration, the stronger the lateral interaction h;j. We choose it of the form
h=-
1
(1
+ u)'
with 0 5 u < 1 the so-called lateral-interaction strength. u = 0 means no lateral interaction. Which unit "wins" depends on the network state w and on the particular input vector 5. We will denote the winning by K ( w , Z ) or just K .
3 . The weights are updated with
So, in principal all weights are moved towards the input vector. To what extent depends on the lateral interaction between the particular unit and the winning unit.
Equation (37) is exactly the Kohonen learning rule. The difference is step 2: the determination of the winning unit. In Kohonen's procedure the winner is the unit with the smallest Euclidian distance to the input vector. We propose t o determine the winning unit on account of the local error e , ( w , Z ) , the same error that is differentiated t o yield the learning rule (37). Then, and only then, it can be shown [27, 30) that this learning procedure performs (stochastic) gradient descent on the global error potential7
'The gradient of E ( w ) consists of two parts: the differentiation of the local error and the differentiation of the "winner-take-all mechanism". This latter term, which is the most difficult one, exactly cancels if and only if the "winner" is determined on account of the local errors e . ( w , i ) [30].
223
-1
'
-1
'
Figure 7: Configurations in a two-dimensional map. (a) Rectangle. (b) Twist. For u = 0 the local error e,(w,F) is just the Euclidian distance between the weight G, and the input Z which makes both learning procedures totally equivalent. Careful analysis shows that, for 0 < u < u* = 0.240, the error potential has 4! = 24 different possible minima: 8 global minima and 16 local minima. To visualize these network states, we draw lines between the positions of the (two-dimensional) weight vectors of neighboring units, i.e., between 1-2, 2-3, 3-4, and 4-1. As can be seen in figure 7(a), the global minima correspond to perfectly ordered configurations. They are called "rectangles". The "twist" or "butterfly" in figure 7(b) is an example of a topological defect: alocal minimum. For o = 0, i.e., no interaction, all minima are equally deep. At u = u* the local minima, representing twists, disappear and only global minima, representing rectangles, remain.
7.2
Theory versus simulations
We will calculate the reference learning parameter f/ for the transition from the local to the global minimum, i.e., from a twist t o a rectangle, for different values of u. This reference learning parameter tells us how the average time needed t o unfold a twist scales as a function of the learning parameter 7. We go through the following steps. 1. Choose the lateral-interaction strength u .
2. Determine the position of the local minimum w*, i.e., the exact network weights of the twist in figure 7(b). 3. Calculate the Hessian H and the diffusion matrix D at this minimum from equation (16) and (15), respectively.
4. Solve equation (35) to find the covariance matrix Zi and its inverse I<-'. 5. On the boundary between the attraction and the transition region, the determinant of the Hessian of the error potential E(w) is exactly zero. Find the point w on this boundary with the smallest distance ( w - w*)*Ii-'(w - w*). 6. Compute the reference learning parameter Q ( o )from equation (36).
The fifth step, optimization under the awkward constraint that the determinant of the Hessian matrix must be zero, is by far the most difficult one. For larger problems, i.e., a higher
224
Figure 8: The reference learning parameter 6 as a function of the lateral interaction strength u . Solid lines show the theoretically obtained results. Simulation results are indicated by an ”*”. dimensional weight space, this may become too difficult. The solid line in figure 8 gives the ”theoretically” obtained reference learning parameter as a function of the strength u. Straightforward simulations of the learning procedure are used for comparison. For each choice of the interaction strength u ,we train 500 independently operating networks for 4 different . learning parameters. For each learning parameter, we determine the transition time ~ ( q )The reference learning parameter 6 follows from the best possible fit of the form InT(q) = q ~ - ~ + d l n q - ~ + c . The reference learning parameters q ( u ) obtained in this way are indicated by an asterix in figure 8. The theoretically obtained reference learning parameters are somewhat smaller than the ones obtained from straightforward simulations. This might be due to the neglect of the transition region. In [27] we try to calculate the transition times for the transition from a ”kink”, a topological defect in a one-dimensional map, to a ”line”, a perfectly ordered configuration. Again, this is a transition from a local minimum, the kink, to a global minimum, the line. For small o,when this transition becomes very improbable (for o = 0 the dynamics of the learning rule is such that a kink cannot be removed), the reference learning parameters predicted by theory do no longer agree with the results obtained from simulations. A possible explanation is the violation of the first assumption explained in section 6.2: in the limit u + 0 the transition region, which normally acts as a buffer between the two attraction regions, vanishes and the assumption that transitions only affect the masses and not the shapes of the probability distributions in the attraction regions is no longer valid. Further study is necessary to solve this problem. In all this, we must not forget that, if we really want to calculate the reference learning parameter, detailed knowledge about the environment and the network structure is needed. The same notion came up in section 4,where we tried to calculate the optimal learning parameter for learning in a changing environment. To a certain extent we could solve this problem in section 5 by considering the statistics of the weights. Here it is much more difficult, since we need to extract global information from the network dynamics. A solution might be a pre-learning phase, similar to the ones proposed for simulated annealing processes (see e.g. [l]).
225
8
On-line learning and global optimization
8.1 The analogy with simulated annealing On-line learning is a stochastic process. The ”intrinsic noise” due to the random pattern presentation enables transitions between different minima. The larger the learning parameter, the greater this noise, so the easier the transitions. We might compare this with simulated annealing 14, 361 or Langevin equations [17, 211 (see also section 3.1). In the simulated annealing approach a candidate w is picked a t random according to some ”generating probability function”. The error E(w’)of the candidate w’is compared with the error E(w) of the current state w. Downhill steps are always accepted, uphill steps are accepted with a probability proportional to exp - E(w0 - E(w)]
[
T
The noise parameter T is called the temperature. Using this dynamics, it can be shown that after sufficient time, the probability distribution P(w,t) resembles a Gibbs distribution. With a proper cooling schedule, i.e., a smart choice for the temperature as a function of time, convergence to the global optimum can be guaranteed for these processes. Can we draw an analogy to on-line learning in neural networks, even when simulated annealing is really different from the learning procedure (l)?Or more specifically, how should we choose our noise parameter, the learning parameter, to get the fastest possible convergence to the global minimum? Starting from the transition times derived in section 6, we will try to answer these questions.
8.2
Derivation of a cooling schedule
For simplicity, we will first consider a two-level system with one global minimum El ‘%‘E(w;) and one local minimum Ez ’%‘E(w;). The average error potential E ( t ) is defined
where we use n l ( t ) + n z ( t = ) 1, i.e., we neglect the probability mass in the transition region. This is correct for times t much larger than the local relaxation time of order 1/17 (see the discussion in section 6.2). Then the probability distribution P(w,t) is strongly peaked in the vicinity of the minima of the error potential. The variance of these local probability distributions is of order 9 and thus the average error potential of the networks in the vicinity of a particular minimum W *
1
(E(w)) x - ((w - w*)=H(w- w*))=(-I
2
NN
Tr [HCz(m)] x qTr D
,
is also of order TJ.For the moment, we will neglect this term. I t will only play a significant role when either n l ( t ) or nz(t) becomes of order 9. The occupation number nz( t ) obeys the differential equation
with transition time section 6)
for the transition from attraction region A2 to A1 of the form (see
- [F] exp
for small q ,
(40)
226 and similarly for r Z l .From (38) and (39), we can derive a differential equation for the average error E ( t ) : dE(t)=- E ( t )- EI Ez - E ( t ) ]
[
dt
712
T2l
We would now like t o choose the learning parameter 7) as a function of time t such that the average error potential E ( t ) decreases as fast as possible, i.e., t o choose q ( t ) such that the term between brackets is as large as possible [58]:
This defines a relationship between E ( t ) and v ( t ) ,which can be used t o transform the differential equation for E ( t ) into a differential equation for the time trajectory of the optimal ~ ( 1 ) .Using the form (40), we obtain [for small learning parameters 7)(t)]
Now, suppose that the transition from the local t o the global minimum is "easier" than vice versa, i.e., has a shorter transition time and thus a smaller reference learning parameter'. Then we can neglect the second term between brackets if compared with the first term. For large 1 , the lowest order solution of the remaining differential equation yields [31]
This constitutes our final optimal cooling schedule. It only depends on the reference learning parameter fj12 for the transition from the local t o the global minimum. In a sense, the derived cooling schedule is indeed optimal. A "faster" cooling schedule, e.g. q ( t ) = fj12/5lnt, cannot guarantee that a network starting at the local minimum will indeed reach the global minimum. We could say that the transition from the local t o the global minimum is "closed". The optimal cooling schedule keeps this transition just "open". A "slower" cooling schedule, e.g. ~ ( t=) 5 fi12/lnt, gives also an open transition, but convergence will take longer than with the optimal cooling schedule. By looking at the transition times we can easily check whether a particular transition is open or closed. If the transition time grows at most linearly with time t the transition is open, if it grows faster than linearly with time t the transition is 2 the local to the global closed. For the optimal cooling schedule (41) the transition time ~ 1 from minimum grows linearly with time t . Generalization t o more minima is tedious. Nevertheless, the final cooling schedule is of the same form [31] 7.
~ ( t =) In t
for large t
,
T h e optimal q' depends on the reference learning parameters between the various minima. It is bounded by [31] ilmin
I
11'
5
%in
t (M-l)(qmax
- Omin)
1
with qmin and qmaxthe smallest and the largest finite reference learning parameter, respectively. This kind of "exponentially slow" cooling schedule is common ground in the theory of stochastic 'As we will argue in section 8.3, a transition from a higher minimum t o a lower minimum i s in almost all cases easier than vice versa. If the reverse is true, then the local minimum is the "most attractive" minimum and, by replacing for 62, in what follows, we can only guarantee convergence to this minimum.
227
Figure 9: [a) Network structure. (b) XOR problem.
processes for global optimization [17, 3619. In cooling schedules for simulated annealing the optimal q* is called "the critical depth" [8]. I t is the depth (suitably defined) of the deepest local minimum which is not a global minimum state [22]. In this context, the approach taken in [62] is most similar to ours: the critical depth is computed from the structure of a Markov chain, i.e., from transition probabilities between different states. Neither we, nor other authors, claim that it is easy t o calculate the optimal parameter q* for practical optimization problems. We only try t o give an intuitive feeling of the factors that determine this parameter.
8.3 Global optimization and on-line backpropagation In this last section we will discuss an example of on-line backpropagation with profound local minima. The structure of the network is depicted in figure 9(a). There are 6 synapses and 3 thresholds, so, N = 9 adaptive elements. These elements are comhined in the weight vector w = (w10,w11, 2 0 ~ 2 wzo, , wzl, wzz, W J ~ 7, . ~ 3 2 ) ~The . network has 2 variable inputs: 11 and 2 2 . Thresholds are incorporated by defining zo = yo = -1. The outputs y1 and yz of the hidden units are given by
the output y3 of the network by
The goal of the backpropagation learning rule is to minimize the quadratic error potential (551
'There is a method called fast simulated annealing [60, 611 based on a cooling schedule that decreases with l j t .
'The difference is the use of a Caiichy distribution (with an infinite variance!) instead of b Gaussian distribution Iwith a finite variance which is more similar to on-line learning processes) for the generation of new states
228
where the sum is over all p training patterns, indicated by three-dimensional vectors ?i = (zr, zt, z $ ) ~ .T h e components zy and z; give t h e input values of the network for pattern p , the component z$ the desired output value. We will use desired output values of f0.8 instead of f l t o prevent divergence of the weights. Rather than minimizing t h e error (42), i t is often convenient t o minimize a n error of the form
E(w) = Eo(w) t X E l ( W )
(43)
7
with &(w) an extra term, the so-called bias [25] (not t o be confused with the bias m of sections 2, 4 , and 5). We will use the bias
with Q = 0.1 and X = 0.01. Incorporation of this bias has a few advantages among which there are prevention of local minima with infinite weights and reduction of training times [39]. After [18], we choose the set of p = 5 training patterns sketched in figure 9(b). Circles indicate negative output, crosses positive output. This is just t h e usual XOR truth table with one additional pattern a t the origin. Because of this additional pattern, the error potential (43) has not only global minima, but also profound local minimal0. T h e thick lines in figure 9(b) show the separation lines of the hidden units t h a t lead to t h e optimal solution. At the global minima all five training patterns are correctly classified. T h e thin lines give the separation lines corresponding t o the local minima. At the local minima only four patterns are correctly classified. For symmetry reasons there are 8 local and 8 global minima. We will compare the optimizatiou capabilities of the following two learning procedures 1. At each learning step, one of the patterns, say u, is drawn a t random from the set of 5 patterns, and a learning step is performed:
AW = -7 V [Eo(w,Z”)t XEi(w)] . This, of course, is an on-line learning process of the type discussed in this chapter 2. Artificial noise is added t o the gradieut of the total error potential, averaged over all training patterns:
AW = -V [Eo(w)t XEi(w)] At t
m€m ,
with ( noise of variance 1. This is called ”Langevin-type learning”; it is a discietized version of the Langevin equation (see section 3.1). We will choose At = 1. For both learning procedures we take an ensemble of 1000 independently operating neural networks, all starting a t a local minimum. We train this ensemble for a few different values of 7 and T . From the dynamics of the occupation numbers at the local and global minima, we ) r ( T ) . Besides this, we collect t h e average error potential measure the transition times ~ ( 7and at the stationary situation, so, for very long learning times. These are denoted E ( 7 )and E ( T ) . The average error E can be viewed as a measure of the asymptotic performance of the learning procedure, the transition time T as the typical time t o reach it. As can be seen from figure 10, where the asymptotic performance E is plotted as a function of the transition time T , on-line
229 I
I
0.15
E 0.1
0.0s
0’ 1o2
I 1o4
10’ 7
Figure 10: Asymptotic performance E versus transition time Langevin-type learning ( t ) .The lines serve to guide the eye.
T
for on-line learning (+) and
learning is highly preferable above Langevin-type learning: the same transition time yields a much better asymptotic performance for on-line learning than for Langevin-type learning. The inhomogeneous intrinsic noise due to the random pattern presentation explains the better performance of on-line learning processes. For Langevin-type learning, the noise is homogeneous, i.e., the same at each minimum, whereas for on-line learning the noise is related to the diffusion D,the fluctuations in the learning rule, which is a function of the weights. Usually we will have that the higher the error potential, the more there is t o learn, the larger the fluctuations in the learning rule, the higher the noise level, and the easier to escape. Roughly speaking, the reference learning parameter for a transition from minimum a to p is proportional to the height of the barrier between a and p and inversely proportional t o the local fluctuations at a. I n the backpropagation example of this section, the reference learning parameter for the transition from the global t o the local minimum is much larger than the reference learning parameter for the reverse transition, whereas the ”reference temperature” for both transitions is of the same order of magnitude. This explains the form of figure 10. Generalization of these arguments suggest that the inhomogeneous noise coming from the random presentation of patterns in on-line learning processes helps to find the global minimum. The comparison made above is just a simplistic and specific example, but it gives a nice idea of the usefulness of on-line learning if compared with other optimization techniques.
Acknowledgment ‘This work was partly supported by the Dutch Foundation for Neural Networks.
References [l] E. Aarts and P. van Laarhoven. A pedestrian review on the theory and applications of the ”At the local iniiiima iii the original XOR problem, carefully analyaed in [46], at least one of the weiglits i p either infinite or zero. After incorporation of the biits (44), we did not encounter any of these “stupid” minima.
230 simulated annealing algorithm. In J. van Hemmen and I. Morgenstern, editors, Heidelberg colloquium on glassy dynamics, pages 288-307, Berlin, 1987. Springer-Verlag. [2] F. Aluffi-Pentini, V. Parisi, and F. Zirilli. Global optimization and stochastic differential equations. Journal of optimization theory and applications, 47:l-16, 1985. [3] S. Amari. A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, 16:299-307, 1967. [4] S. Amato, B. Apolloni, G. Caporali, U. Madesani, and A. Zanaboni. Simulated annealing approach in backpropagation. Neumcomputing, 3:207-220,1991. [5] R. Battiti. First- and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4:141-166, 1992. (61
S. Becker. Unsupervised learning procedures for neural networks. International Journal oj Nevml Systems, 2:17-33, 1991.
[7] D. Bedeaux, K . Lakatos-Lindenberg, and K . Shuler. On the relation between master equations and random walks and their solutions. Journal of Mathematical Physics, 12:2116-2123, 1971.
[8] 0. Catoni. Rough large deviation estimates for simulated annealing: application t o exponential schedules. The Annals of Probability, 20:1109-1146, 1992. [9] D. Clark and K. Ravishankar. A convergence theorem for Grossberg learning. Neural Networks, 3:87-92, 1990. [ l o] M. Cottrell and J. Fort. A stochastic model of retinotopy: a self-organizing process. Biological Cybernetics, 53:405-4 11, 1986. [ll] C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochastic gradient
search. In Neural Networks for Signal Processing 2, New York, 1992. IEEE. [12] J . Doob. Stochastic processes. Wiley, New York, 1953.
[ 131 W . Finnoff. Diffusion approximations for the constant learning rate backpropagation algorithm and resistance to local minima. Preprint Siemens, FRG, 1992. [14] C. Gardiner. Handbook of stochastic methods. Springer, Berlin, second edition, 1985. [15] S. Geman and C. Hwang. Diffusions for global optimization. Siam Journal on Control and O~ItinZi2Qtiol2,24:1031-1043, 1986. [ l G ] T. Geszti. Physical models of neuml networks. World Scientific, Singapore, 1990.
[17] B. Gidas. The Langevin equation as a global minimization algorithm. In E. Bienenstock, F. Fogelman SouliQ,and G. Weisbuch, editors, Disordered systems and biological organization, pages 321-326, Berlin, 1986. Springer-Verlag.
[18] M. Gori and A . Tesi. On the problem of local minima in backpropagation. IEEE Transactions on Patterns Analysis and Machine Intelligence, 14:76-86, 1992. [19] S. Grossberg. On learning a n d energy-entropy dependence in recurrent and nonrecurrent signed networks. Journal of Statistical Physics, 48:105-132, 1969.
23 1 [20] S. Grossberg. The adaptive brain I. North-Holland, Amsterdam, 1986. [21] T. Guillerm and N. Cotter. A diffusion process for global optimization in neural networks. In International Joint Conference on Neural Networks, volume 1, pages 335-340, New York, 1991. IEEE. 1221 B. Hajek. Cooling schedules for optimal annealing. Mathematics of Opemtions Research, 13:311-329, 1988. [23] L. Hansen, R. Pathria, and P. Salamon. Stochastic dynamics of supervised learning. Journal of Physics A , 26:63-71, 1993. [24] L. Hansen and P. Salomon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:993-1001, 1990. [25] S. Hanson and L. Pratt. A comparison of different biases for minimal network construction with back-propagation. In D. Touretzky, editor, Advances in Neural Information Processing Systems 1 , pages 177-185. Morgan Kaufmann, 1989. (261 J. Hertz, A . Krogh, and R. Palmer. Introduction to the theory of neural computation. Addison-Wesley, Redwood City, 1991. [27] T. Heskes. Transition times in self-organizing maps. Submitted to Biological Cybernetics, 1992. [28] T. Heskes and B. Kappen. Learning processes in neural networks. Physical Review A , 44~2718-2726,1991. [29] T. Heskes and B. Kappen. Learning-parameter adjustment in neural networks. Physical Review A , 45:8885-8893,1992. [30] T. Heskes and B. Kappen. Error potentials for self-organization. In 1993 IEEE International Conference on Neural Networks, San Francisco, 1993. [31] T. Heskes, E. Slijpen, and B. Kappen. Cooling schedules for learning in neural networks. Submitted to Physical Review E, 1992. [32] T. Heskes, E. Slijpen, and B. Kappen. Learning in neural networks with local minima. Physical Review A , 46:5221-5231,1992. [33] G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40:185-234, 1989 [34] I<. Hornik and C. Kuan. Convergence analysis of local feature extraction algorithms. Neural Networks, 5:229-240, 1992. [35] R. Jacobs. Increased rates of convergence through learning rate adaptation. Neuml Nelworks, 1:295-307, 1988. [36] S. Kirkpatrick, C. Gelatt, and M. Vecchi. Optimization by simulated annealing. Science, 220:671-680, 1983. 1371 T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43:59-69, 1982. [38] T . Kohonen. Self-organization and associative memory. Springer, New York, 1988.
232 [39] A. Kramer and A. Sangiovanni-Vincentelli. Efficient parallel learning algorithms for neural networks. In D. Touretzky, editor, Advances in Neuml Information Processing Systems 1, pages 40-48. Morgan Kaufmann, 1989. [40] C. Kuan and K. Hornik. Convergence of learning algorithms with constant learning rates. ZEEE Tmnsactions on Neuml Networks, 2:484-89,1991. [41] H. Kushner. Robustness and approximation of escape times and large deviations estimates for systems with small noise effects. SIAM Journal of Applied Mathematics, 44:160-182, 1984. [42] H. Kushner. Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: global minimization via Monte Carlo. SIAM Journal of Applied Mathematics, 47:169-185, 1987. [43] H. Kushner and D. Clark. Stochastic approzimation methods for constrained and unconstrained systems. Springer, New York, 1978. [44] T. Leen and G . Orr. Weight-space probability densities and convergence times for stochastic learning. In International Joint Conference on Neural Networks. IEEE, 1992. [45] E. Levin, N. Tishby, and S. Solla. A statisitical approach t o learning and generalization in layered neural networks. Proceedings ZEEE, 78:1568-1574, 1990. [46] P. Lisboa and S. Perantonis. Complete solution of the local minima in the xor problem. Network, 2:119-124, 1991. [47] L. Ljung. Andysis of recursive stochastic algorithms. IEEE Tmnsactions on Automatic Control, AC-22551-575, 1977. [48] Z. Luo. On the convergence of LMS algorithm with adaptive learning rate for linear feedforward netyworks. Neuml Computation, 3:226-245, 1991. [49] E. Oja. A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15:267-273, 1982. [50] G . Radons, H. Schuster, and D. Werner. Fokker-Planck description of learning in backpropagation networks. In International Neuml Network Conference 90 Paris, pages 993-996, Dordrecht, 1990. Kluwer Academic. [51] H. Ritter, I(. Obermayer, I(. Schulten, and J. Rubner. Self-organizing maps and adaptive filters. In E. Domany, J. van Hemmen, and K. Schulten, editors, Models ofneuml networks, pages 281-306, Berlin, 1991. Springer. [52] H. Ritter and K . Schulten. On the stationary state of Kohonen’s self-organizing sensory mapping. Biological Cybernetics, 54:99-106, 1986. [53] H. Ritter and K. Schulten. Convergence properties of Kohonen’s topology conserving maps: fluctuations, stability, and dimension selection. Biological Cybernetics, 60:59-71, 1988. [54] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386-408, 1958.
233 [55] D. Rumelhart, G . Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323:533-536, 1986. [56] T. Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neuml Networks, 2:459-473,1989. [57] H. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A , 45:60564091, 1992. [58] S. Shinomoto and Y. Kabashima. Finite time scaling of energy in simulated annealing. Journal of Physics A , 24:L141-L144,1991. [59] S. Singhal and L. Wu. Training multilayered perceptrons with the extended Kalman algorithm. In D. Touretzky, editor, Advances in Neuml Information Processing Systems l, pages 133-140, San Mateo, 1989. Morgan Kaufmann. [60] M. Styblinski and T. Tang. Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing. Neuml Networks, 3:467-483, 1990. [Sl] H. Szu. Fast simulated annealing. In J. Denker, editor, Neuml Networks for Computing, pages 420-425, New York, 1986. American Institute of Physics. [62] J. Tsitsiklis. Markov chains with rare transitions and simulated annealing. Mathematics of Operations Research, 14:70-71, 1989. [63] N. van Kampen. Stochastic processes in physics and chemistry. North-Holland, Amsterdam, 1981. [64] T . Watkin, A. Rau, and M. Biehl. The statistical mechanics of learning a rule. To be published in Reviews of Modern Physics, 1992. [65] H. White. Learning in artificial neural networks: a statistical perspective. Neuml Computation, 1:425-464, 1989. [G6] B. Widrow and M. Hoff. Adaptive switching circuits. 1960 IRE WESCON Convention Record, Part 4, pages 96-104, 1960. [G7] N. Wiener. I am a mathematician. Doubleday, New York, 1 9 5 6
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks
J.G.Taylor (Editor) Q 1993 Elsevier Science Publishers B.V. All rights reserved.
235
Mult ilayer Function als D. S. Modha" and R. Hecht-Nielsen' "Department of Electrical & Computer Engineering and Institute for Neural Computation, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0407, USA 'HNC, Inc., and Department of Electrical & Computer Engineering and Institute for Neural Computation, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0407, USA
Abstract Inspired by the architecture of multilayer feedforward networks on Euclidean spaces, we introduce multilayer functionals: a new parametric family of real valued mappings on arbitrary normed linear spaces. We establish that multilayer functionals on a normed linear space can uniformly approximate on compact sets any continuous functional on the normed linear space. Multilayer functionals are then employed to define multilayer operators: a new parametric family of operators on certain spaces of real valued functions (or sequences) on the real line. We establish that multilayer operators can provide an input-output representation, in a certain sense, for any time-invariant system characterized by a continuous functional. We show that the systems representable by multilayer operators are necessarily stable and continuous, but can be continuous-time or discrete-time, finite or infinite memory, causal or noncausal, and linear or nonlinear. We exemplify the developed theory for multilayer operators by providing explicit representations for two practical classes of systems.
1
Introduction
Multilayer feedforward networks [30] are universal function approximators [6,11,12,37]. Multilayer feedforward networks also bypass the "curse of dimensionality" in approximating [4,14] and estimating [5] functions with bounded spectral norms. To a partial extent, these results have provided an explanation for the numerous successes achieved in applications of multilayer feedforward networks. In addition, the results have also provided the previously unavailable theoretical guidelines for conceiving novel applications of multilayer feedforward networks in function approximation and statistical inference [36,37]. Multilayer feedforward networks have also been used in a number of applications, but not solely because they are universal function approximators. Narendra and Parthasarathy
236 [18], for example, have proposed using multilayer feedforward networks with feedback as general input-output models for nonlinear discrete-time systems. Hudson et al. [13] have proposed using multilayer feedforward networks with time-delayed inputs as nonlinear filters. On the other hand Pineda [23], and Williams and Zipser [40] have generalized multilayer feedforward networks to recurrent networks, and have proposed using recurrent networks as general state space models for nonlinear discrete-time systems. In all of these applications, the networks can be thought of as representing a timeinvariant continuous-time (or discrete-time) operator on some set of real valued functions (or sequences) on the real line. To justify such applications, in our opinion, a theory with the following essential ingredients is needed. 1. Define a class of operators analogous to multilayer feedforward networks on some precisely characterized sets of real valued functions (or sequences) on the real line. 2. Precisely determine the approximation capabilities of the proposed class of operators. 3. Provide a simple realization theory for implementing the proposed class of operators.
In this paper, we develop such a theory for a class of timeinvariant operators. A time-invariant operator mapping an input space of real valued functions (or sequences) on the real line into an output space of real valued functions (or sequences) on the real line can be completely characterized by some real valued mapping, namely a functional, on a certain restriction (to be defined in Section 4) of the input space [3,22]. Thus, to understand time-invariant operators on any given input space, it is sufficient to study functionals on the restriction of the given input space. In Section 2, motivated by multilayer feedforward networks on Euclidean spaces [30], we define a class of real valued mappings on arbitrary normed linear spaces. This class is termed multilayer functionals, and can be obtained from multilayer feedforward networks by replacing the first hidden layer of &ne sums with a layer of &ne functionals on the normed linear space. In Section 3, we establish that multilayer functionals (with a squashing function [ll] as the hidden unit activation function) on a normed linear space are uniformly dense on compact sets in the set of continuous functionals on the normed linear space. We also define Volterra functionals on arbitrary normed linear spaces, and establish that they possess precisely the same approximation capabilities as multilayer functionals. However, we anticipate that results showing that multilayer functionals have a superior representation efficiency over Volterra functionals for some restricted class of functionals will soon follow. In Section 4, we define a class of timeinvariant operators on some input spaces of real valued functions (or sequences) on the real line; the defined class of operators is characterized by continuous multilayer functionals on the restrictions of the input spaces. This class of operators is termed multilayer operators. We establish that multilayer operators can provide an input-output representation, in a certain sense to be made precise by Theorem 4.1, for any time-invariant system characterized by a continuous functionals. We show that the systems representable by multilayer operators are necessarily stable (bounded input bounded output [41]) and continuous, but can be continuous-time or discrete-time,
231
finite or infinite memory, causal or noncausal, and linear or nonlinear. Moreover, multilayer operators themselves are also stable and continuous, and consequently guarantee stable implementations of the representable systems. Thus, multilayer operators provide a framework to define and implement a large class of systems. In a closely related setting, Sandberg [31]studies an exponential family of functionals on compact topological spaces, and utilizes the exponential family of functionals to establish the approximation capabilities of radial basis functionals, and to obtain a result equivalent to our Lemma A.l. Sandberg also employs his functionals to represent a class of systems. However, in this paper, we provide a unified treatment for a larger class of systems. Multilayer operators are input-output models and cannot explain state space representations such as recurrent networks [23,40]and input-output models with feedback [18]. Multilayer operators, however, do explain and generalize the time delay neural networks in Back and Tsoi [l], in Hecht-Nielsen [lo],in Hudson et al. [13],and in Wan [35]. We observe that multilayer operators can be thought of as being composed of a number of linear systems, and thus we can exploit the rich realization theory already available for linear systems [21]in implementing multilayer operators. In Section 5, we illustrate the above observation and the abstract theory developed in Section 4 for multilayer operators via two practical examples: a class of continuous-time, infinite memory, noncausal systems, and a class of discrete-time, finite memory, causal systems. For the class of continuous-time systems studied, we provide concrete realizations and a sampling criterion. For the class of discrete-time systems studied, we provide a fast implementation analogous to the Fast Fourier Transform [19].In Section 6, we provide conclusions. To keep the exposition simple, in this paper, we restrict the discussion to single input single output systems. The generalization to vector inputs and outputs is straightforward. All but the most obvious proofs are collected in an appendix.
2
Multilayer Functionals
We introduce some notations and definitions, which will enable us to define multilayer functionals and to precisely discuss their role in functional approximation. A functional is any real (or complex) valued function on a linear space. In this paper, however, we will restrict attention to real valued functionals. A functional F on a linear space X is linear if for any u , v E X, and any a , P E 8,F ( a u Po) = a F ( u ) PF(v). A functional F on a normed linear space X is continuous iff lim,,,,, F(u,) = F ( u ) for each u E X. A linear functional on a normed linear space is bounded iff it is continuous [7]. For a normed linear space X, the set of all continuous (and hence bounded) linear functionals on X is called the dual of X, and is denoted by X'. We define an a f i n e functional on X to mean a functional of the form
+
A(u) = b
+ L(u),
+
(1) where u E X, b E R, and L E X'. Continuity of L trivially implies that of A. Let AF(X,8)denote the set of all afEne functionals on X. Affine functionals on X will be used in the sequel as the building blocks for constructing arbitrary continuous functionals on X.
238
First, we characterize afline functionals on 8 ' and use them to define multilayer feed' is its self-dual (71. Therefore, all the continuous linear functionals forward networks. P L on P' can be expressed in the form
where w = ( w l , ... ,w.) E w',and u = ( u I , ... , u T )E w'. From Equations 1 and 2 it follows that all the afline functionals on w'have the form
where b E R and w , u are as in Equation 2. The ADALINE (Adaptive Linear Element) of Widrow and Hoff [38] is simply an afline functional on R'. The input to ADALINE is an r-dimensional vector in R', and its (unknown) nonbias weights can be interpreted as being an r-dimensional vector in (82')' Z R', where 2 is to be read as is isometrically isomorphic to. In addition, the bias weight b to the ADALINE is a scalar in 31. We now recall that multilayer feedforward networks [30] are arbitrary (that is, not necessarily linear or afline) functionals on %? rep' composed with a sigmoidal resented as weighted superpositions of afline functionals on P nonlinearity. These are formally defined next.
Definition 2.1 For any function II, : R + R, multilayer feedforward networks with n units in the hidden layer on 8 ' are defined b y n
M q ( P , R ) = { F :P + RI F ( u ) = C P ; I I , ( A ; ( U ) ) ,EUP., i=l
pi E 92,
and A; E d.F(F,%) for i = 1 , . . . ,n}.
(4)
For any continuous hidden unit activation function II,, MF+(%?, R) C C(R', 9). Definition 2.1 motivates the central issue addressed in this section: Multilayer feedforward networks are defined on P. Can we define analogues of multilayer feedforward networks on arbitrary normed linear spaces?
Before considering abstract normed linear spaces, we illustrate the working for C ( I ,a), the set of all real valued, continuous functions with domain Z [0,1]. Compactness of I trivially implies that elements of C ( I ,8)are bounded. We associate with each f E C ( I ,8) its uniform norm It is well known that the metric, Ilf - 911 f , g E C(I,R), makes C ( I ,a) into a complete metric space, that is, a Banach space [27]. The metric also allows us to talk about the compactness of the subsets of C(I,R). K c C(Z,R) is compact if every sequence in K contains a convergent subsequence [27].
239
Dual of C(Z,R) is the space of Lebesgue-Stieltjes integrals [7]. Thus, all linear continuous linear functionals L on C(Z, 9)can be expressed in the form [34]
where u E C(Z,R) and w E NBV(Z). In other words, (C(Z,R))* S NBV(1). NBV(1) denotes the set of functions of bounded variation on Z which vanish at 0, and are right continuous [7]. To every w E NBV(Z) there corresponds a positive finite measure p w , such that w ( t ) = pw([O,t]) for all t E I [7, Theorem 3.291. Thus, Equation 6 can also be written as
L(u)= 1
4 t ) dPW(t).
(7)
It is also well known that for every w E NBV(Z), the set of points of discontinuity of w is countable, and that the Radon-Nikodym derivative w' of pul with respect to the (ordinary) Lebesgue measure (rn) exists almost everywhere (that is, except on a set of rn-measure zero) and is absolutely integrable on Z with respect to the Lebesgue measure (71. Thus, by the Radon-Nikodym theorem we can write d d t ) = w'(t) dm(t)
+ dX(t),
(8)
where the measure X lives on a set of rn-measure zero. X assigns point masses to the points of discontinuity of w, and the mass assigned by X to any point of discontinuity of w is precisely the amount of jump of w at that point. Thus, we can now write Equation 7 as
where {tn}y,tn E Z denotes the points of discontinuity of w. It now follows from Equations 1 and 9 that all the affine functionals on C(Z,%) have the form
where b E % and u, w, w', {tn}yare as above. Armed with the most general form of an affine functional on C(Z, 3)and inspired by Definition 2.1, we now define a class of functionals on C(I,R). Definition 2.2 For any function 1c, : R + R, multilayer functionals with n units in the hidden layer on C(Z,X) are defined by n
M q ( C ( Z , s ) , W ) = { F : C(Z,R) + 81 F ( u ) = C P i $ ( A i ( u ) ) , uE C(Z,W), i=l
Pi E R, and A , E d.F(C(Z,R),X) for i = 1,. . . ,n}.
(11)
240
If the hidden unit activation function II, is selected to be continuous, then multilayer functionals MF+(C(I,a),%)are also continuous. The class of functionals on C(I,R) in Definition 2.2 is very closely related to multilayer feedforward networks in Definition 2.1. In both the cases, the inputs u are in a normed linear space X , while the (unknown) nonbias weights are in the dual space of X , namely X'. The two classes of functionals share the same general form except for the choice of the domain X ;multilayer feedforward networks are defined over R', while MP+(C(I,R), X) are defined over C ( I ,32). The abstract structure shared by the above definitions is captured in the following definition.
Definition 2.3 Let X denote a nonned linear space, and let M ( X ,8) denote the set of all a s n e functionals on X . For any function II, : R 8,multilayer functionals with n units in the hidden layer on X are defined by n
MC(X,R)
=
{ F :X
+
Rl F ( u ) = C p i + ( A i ( ~ ) ) , uE X , i=l
pi
E
R, and A; E d F ( X , R ) for i = 1,.. . ,n}.
(12)
Note that, if II, is continuous then M F + ( X , R ) c C(X,W). The above definition, however, does not spare the us the burden of choosing a normed linear space X , and characterizing its dual space X'. Fortunately, the dual spaces of commonly used normed linear spaces are known (7,281. M F + ( X , R ) denotes the set of all multilayer functionals on X with n units in the hidden layer. In practice, however, we are interested in the set of multilayer functionals with arbitrary number of units in the hidden layer. Let M F + ( X , B )denote the set of multilayer functionals with arbitrary number of hidden units, then m
M 3 + ( X ,8)=
U M F + ( X ,R). n=l
(13)
Thus, M 3 + ( X , R ) denotes a parametric family of functionals on X parametrized by elements of M ( X , 9). In the next section, we examine the approximation properties of M F + ( X ,92).
3
Approximation Properties of Multilayer Functionals
It is well known that for any continuous, bounded, nonconstant hidden unit activation function $, multilayer functionals MF+(%, 92) are uniformly dense on compact sets in C(%,W) [12]. This fact and availability of Definition 2.3 motivate the central question addressed in this section: For an arbitrary normed linear space X , are multilayerfunctionals M F + ( X ,82) uniformly dense on compact sets in C ( X ,a)?
24 1
3.A
Universal Approximation
Let X be a normed linear space over R. Norm on X defines a topology on X called the norm topology on X. The topology on X allows us to talk about continuous functionals on X. Let C(X,R) denote the set of all real valued continuous functionals on X. We equip C ( X ,R) with the topology of uniform convergence on compact sets. The topology on C ( X ,32) is generated by associating with each element F of C ( X ,R) a family of seminorms 1). I I K such that IlFllK = SUP IF(u)I, (14) uEK
where K is some compact subset of X. A set S(X,R) of real valued functionals on X is said to be uniformly dense on compact sets in C(X,R), if for every compact K c X, every E > 0, and every F E C(X, R), there exists a G E S ( X ,R) such that ]IF- G l l ~< e. [0,1] is a squashing function, if We need some more terminology. A function rl, : R it is nondecreasing, limA+m $(A) = 1, and limA-,-w $(A) = 0. For example, the familiar sigmoid function ($(A) = (1 +exp(-A))-'), the cosine squasher of Gallant and White [8], and the Cantor-Lebesgue function [7]are all squashing functions. We now state the central result of this section. (The proof can be found in the Appendix.)
Theorem 3.1 For a normed linear space XI and a squashing function rl,] multilayer functionals M 3 + ( X ,R) are uniformly dense on compact sets in C ( X ,R). In the sense made precise by the above theorem, multilayer functionals are universal functional approximators. Theorem 3.1 requires only that X be a normed linear space. This is a relatively mild condition, and hence the theorem applies to a large number of practical spaces. We now consider some examples of normed linear spaces and comment on their dual spaces. 1. Finite-dimensional Euclidean space (8') X = W. The norm on X is the usual norm llullx = (C&,u~)'/', where u = (~1,. . .,u,) E 92'. Theorem 3.1, in this case, reduces precisely to the well known result of Hornik, Stinchcombe, and White [ll].
2. The space of compactly supported continuous functions Let V be a topological space, and let C(V,R) denote the set of all continuous functions on V. Support of f E C(V,92) is defined as the smallest closed set outside of which f vanishes. We now define
CJV,8)= {f E C(V,%)Isupport of f is compact}.
(15)
Then, X = Cc(V,91) is a normed linear space with the uniform norm given by
where f E Cc(V,8). If V = I = [0,1], then V is compact and Cc(I,82)= C(I,P). This is precisely the space considered in Section 2, where it was established that continuous linear functionals on C ( I ,8)are elements of the space of Lebesgue-Stieltjes integrals.
242
3.
LP Spaces Let V be a set equipped with a a-algebra M , and a measure p . Then, (V,M, p ) is called a measure space. If f is a real valued measurable function on V , then for 1 5 p < 00, we define
llfllp =
(/v
IfIpdP)'/p,
and if p = 00
llfllDo= inf{a 1 ol~({tlf(t)> a ) ) = 01.
(18)
Now, we define
P ( V , M , p )= { f : V -+ Rlfismeasurableand
llfllP < 00).
(19)
These LP(V,M , p ) are normed linear spaces. Let V c R, let M denote the Borel a-algebra on V , and let p denote the Lebesgue measure m. Then, O ( V ,M , p ) is denoted simply by Lp(V). If p is a a-finite measure and 1 5 p < 00, then 4.
(LP)' Z Lq,where p-'
+ q-'
= 1.
P Spaces In example 3, let V c Z the set of all integers, let M = P(V n Z) the power set of V n Z, and select p to be the counting measure, then LP(V n Z, P(V n Z), p ) is denoted by P ( V ) .
5. Sobolev Spaces
In example 3, let V C R, let M denote the Borel o-algebra on V , and let p denote the Lebesgue measure m. Let k E N the set of all natural numbers, then the space of all functions f E L'(V) whose distributional derivatives i3"f are also contained in L*(V) for la1 5 k are known as Sobolev spaces, and are denoted by 'Hk(V). 'Hk(V)is a Hilbert space with the following inner product
where f , g E 'Hk(V).'Hk(V)are normed linear spaces with the norm of any element f defined as (f,f)'/'.
'Hk(V)is a Hilbert space and therefore is its self-dual. C(I,R), LP(V), P(V n Z), and 'Hk(V)where I = [0,1], V c R, and 1 5 p 5 00 have more than exemplar value, they will denote various possible admissible sets of input signal functions (or sequences) to a general system in Section 4. Together they encompass almost all functions of engineering interest.
243
3.B
A Conjecture
Theorem 3.1 assumes that the hidden unit activation function J, is a squashing function. But, in the case of multilayer feedforward network it is known that arbitrary nonlinearity is sufficient for universal function approximation [12]. We conjecture that such a result also holds for multilayer functionals.
Conjecture 3.1 For a normed linear space X , and any continuous, bounded, nonconstant function $, multilayer functionals M F + ( X ,92) are uniformly dense on compact sets in C ( X ,8).
It may be possible to establish Conjecture 3.1 by an extension of the results of Cybenko [6] and Hornik [12] from 92' to arbitrary normed linear spaces.
3.C
Relation to Volterra Functionals
Volterra functionals have a rich history of applications in nonlinear systems theory. Originally, Volterra [34] defined Volterra functionals on the space of compactly supported continuous functions on the real line (denote as C ( I ,a)).His definition was inspired by the structure of polynomial functions on the real line. By an use of Frkchet-Weierstrass theorem he also established that Volterra functionals are uniformly dense on compact sets in C(C(1,a),92) (compare Theorem 3.1). Wiener [39], seeking alternatives to linear filtering and Gaussianity assumption, introduced orthogonalized Volterra functionals, which are called Wiener functionals. Some other notable contributions to Volterra functionals can be found in Barrett [3], Gallman and Narendra [9], Koh and Powers [15] , MorhO [17], Palm and Poggio [22], Porter [24], Prenter [25], and Root [26]. Interested readers may also refer to the books by Banks [2], Rugh [29], and Schetzen [32]. Let X be a normed linear space, then we define X = Xito mean an i-dimensional Cartesian product of X [7]. X i may be identified with the set of ordered i-tuples of the . .. ,u ; ) defined elements of X . X i is a normed linear space with norm of any element (ul, as
n&,
I
II(ul,...,ui)IIxs
= CIIujlIx,
(21)
j=1
where 1) Ilp denotes the norm on Xi, and 1). Ilx denotes the norm on X . For u E X , we define ui to mean an ordered i-tuple (u, . .. ,ti) E X i . Let ( X i ) * denote the dual of x'. The elements of ( X i ) * are continuous linear functionals on X i and are denoted by L'. We now define Volterra functionals on an arbitrary normed linear space. To our knowledge, in past, such a general definition has not appeared in the literature.
Definition 3.1 Let X denote an arbitrary n o n e d linear space, then Volterra functionals of integer order n on X are defined by n
VF(X,I)
=
{F:X~alF(u)=B+CL'(u'),ti€X,P€I, i=l
u' = ( u , . . . , u ) E Xi, and
L' E (Xi)* for i = 1,. . . ,n}.
(22)
244
Clearly, V F ( X ,8) c C ( X ,W ) , that is, Volterra functionals are continuous. The set of all Volterra functionals on X is given by
u 00
V 3 ( X , 3 1 )=
VF(X,%).
(23)
n=l
Some properties of Volterra functionals are obvious.
Proposition 3.1 For a normed linear space X , Volterra functionals V F ( X ,8 ) are uniformly dense on compact sets in C ( X ,31). This follows since Volterra functionals separate points and contain constants, and form C ( K ,8 )for any compact subset K of X. Thus, we can apply the Stone-Weierstrass theorem to arbitrary compact subsets K of X in the spirit of Lemma A.l.
a closed subalgebra of
+,
Corollary 3.1 For a normed linear space X and a squashing function multilayer functionah MF+(X,31) are uniformly dense on compact sets in Volterra functionals
V 3 ( X ,8 ) . The result trivially follows as a corollary to Theorem 3.1. Thus, any Volterrafunctional can be approximated to an arbitrary degree of accuracy by some multilayer functional. This implies that multilayer functionals can directly replace Volterra functionals in the existing applications [15,29,32]. Recently, Jones [14] and Barron [4] have established that multilayer feedforward networks bypass the curse of dimensionality in representing functions on Sr with bounded spectral norms. Notably, polynomials do not possess this advantageous property [4]. This leads us to conjecture that multilayer functionals (which are analogous to multilayer feedforward networks) on an arbitrary normed linear space X provide a more compact representation of some subclass of functionals on X than Volterra functionals (which are analogous to polynomials) on X. Precise characterization of this conjecture, and the proof thereof is left as an open problem. We are now equipped to introduce multilayer operators, and establish their role in system representation.
4
Multilayer Operators in System Representation
In this section, we introduce some notations, definitions and assumptions necessary to introduce the notion of a general input-output representation for time-invariant systems. Thus, to incorporate the notion of time, we will be concerened here only with the operators mapping real valued functions of time into real valued functions of time.
245
4.A
System and Time-Invariance
In order to develop a single theory for both continuous-time as well as discrete-time systems, we define Z as the set of all possible time-instances. For continuous-time systems, we set 5 = W ,while for discrete-time systems, we set 5 = Z the set of all integers. Let J c Z denote an interval of time over which the behavior of a class of systems is of interest. Let U and Y denote some sets of real valued functions over J. In the sequel, U will serve to denote the set of all admissible inputs to the class of systems of interest, and Y will serve to denote the set of possible outputs of the class of systems of interest. For all t E J, let Jt be some subset of Z. We define Jt - t to mean a subset of E obtained by translating each element of Jt by - t . Let u E U be an input signal function, then we define the restriction ut of u to Jt as u t ( s ) = u(t
+ s),
(24)
where t E J and s E Jt - t . Clearly, the restriction ut is defined over the interval 51 - t . We define the restriction U,of U to 51 as the set of restrictions ut for all u E U to Jt. Clearly, Ut is some set of real valued functions over Jt - t . For all t E J, let Ft denote a functional mapping Ut to the real line. A system is simply an operator mapping U to Y.In other words, a system S acts on an input function u in U to emit an output function y = S(u) in Y.Without any loss of generality, output of any system S at every t E J can be given by (25)
Y ( t )= N u t ) ,
where the interval J c 5 and the set { Jt, Ut, Ft},,J are individually specified for each system S [3,22,41]. { J , Jt, Ut,Ft}tEJis called an input-output representation for the system
S. In this paper, we are interested only in time-invariant systems. Concept of timeinvariance is introduced in the following definition.
Definition 4.1 A system S : U + Y described by { J , Jt, U,,Ft),,j is defined to be timeinvariant, if 1 . For all t E J , Jt - t are all identical and are denoted by Jo. 2. U is such that for all t E J , the restrictions Ut are all identical and
QIY
denoted by
uo . 3. For all t E J , functionals Ft : Ut + W are all identical and are denoted by Fo.
A time-invariant system S can be completely described by {J,Jo, UO, Fo}. Jo c Z has the interpretation of characterizing the memory of the system S. If JOhas a finite length, then the system is said to have a finite memory. On the other hand, if JO has an infinite length, then the system is said to have an infinite memory. If Jo is a subset of Z n (-co,O] then the system is said to be causal, and if Jo is not a subset of Z n (-oo,O] then the system is said to be noncausal. In the subsequent treatment, however, we place no restrictions on Jo, and hence it can be chosen at will. In this sense, the developed theory deals with finite or infinite memory systems, as well as with causal or noncausal systems.
246
U0describes the nature of inputs allowed over the memory Jo of the system. Once UO is specified, Definition 4.1 automatically provides every restriction Ut for all t E J, and hence also U. Thus, for a time-invariant system Uo serves as the fundamental input space. Uo can be picked at will, provided it is a normed linear space of real valued functions over Jo c Z. Please refer to Section 3 for examples of some practically important normed linear spaces. U0 is a normed linear space, and thus has the norm topology. The norm topology on Uo,and consequently on Ut for all t E J automatically induces a topology on U generated by the following open sets {v E
uI
I1vt - utllu,, < n-l for all t E J } ,
(26)
where n E N, u E U and 11 . Il", denotes the norm on Uo. The topology on U allows us to talk about convergence of elements of U,and hence about the continuity of systems defined on U. The functional Fo in Definition 4.1 is termed the characteristic functional of system S . We now state some relationships between Fo and S. (The proof can be found in the Appendix.)
Proposition 4.1 Let S : U -+ Y be a time-invariant system (as in Definition 4.1) described by { J, Jo,VO,Fo}, then 1 . S is linear iff Fo is linear, and is nonlinear iff Fo is nonlinear.
2. If Fo is continuous, then Y
c B(J, R). B(J,R) denotes
the set of all bounded real
valued functions on J .
3. If Fo is continuous, and Y has the relative topology derived f r o m the topology of uniform convergence on B ( J ,R), then S is continuous. We now make a pragmatic assumption (dictated only by the availability of tools rather than any fundamental property of systems) on the characteristic functional Fo.
Definition 4.2 A time-invariant system S characterized by { J , Jo, Uo, Fo} (as in Definition 4.1) is defined t o be characterized by a continuous functional, if Fo : Uo -+ IR is continuous.
For any system described by Definitions 4.1 and 4.2,it follows from Proposition 4.1 that for all u E U its output y must be in B(J,IR). Thus, for all bounded inputs' the system emits a (uniformly) bounded output and hence is BIB0 stable [41]. In this sense, the theory developed can deal only with stable systems. Moreover, for any system described by Definitions 4.1 and 4.2,it also follows from Proposition 4.1 that the system is continuous. To complete the list of all the advertised qualities of the representable systems in the theory, we only need to show the theory can deal with linear or nonlinear systems.
11.
lu is said to be a bounded input if its restrictions uI for all t E J are such that IIutllu, JJu, denotes the norm on UI.
< m, where
247
4.B Linear or Nonlinear Systems First, we define multilayer operators and Volterra operators. Definition 4.3 Let { J , Jo, UO}be specified, then 1. The set MO+(U,Y ) of multilayer operators is defined to be the class of systems of the form { J, Jo, U o , N } , where N E M3+(uo,%), and rl, is a continuous squashing
function. 2. The set YO(U,Y) of Volterra operators is defined to be the class of systems of the form { J, Jo, UO,V } , where Y E V3((vo,32). Clearly, multilayer operators MO+(U,Y ) and Volterra operators YU(U,Y ) meet the requirements set forth in Definitions 4.1 and 4.2, and are thus stable and continuous. Let 7C(U, Y ) denote the class of all time-invariant systems characterized by continuous functionals. Then, trivially 7C(U, Y) 2 C(U0,n). Also, let CTC(U, Y)C 7C(U, Y ) d e note the class of systems characterized by a linear functional. Then, trivially C7C(U, Y ) 2
(UO)'. Thus, every element of 7 C ( U ,Y ) can be completely characterized by some continuous functional on 170. But, Theorem 3.1 asserts that any continuous functional on UOcan be uniformly approximated on compact sets by some multilayer functional on UO. This observation, finally, makes precise the role of multilayer functionals and multilayer operators in system representation. This is now formally stated. (The proof can be found in the Appendix.) T h e o r e m 4.1 Let { J, Jo, UO} be specified, then 1. (Linear Systems) If S E L7C(U,Y), then there ezists a L E (UO)' such that for all t E J and for all (I E U , the output y of the system is given b y
Y ( t )= L(.t). (27) 2. (Linear/Nonlinear Systems) If S E 7C(U, Y ) , then for every continuous squashing function rl,, for every c > 0 , and for every U such that its restrictions Utfor all t E J are subsets of a compact set K c Uo, there exists a multilayer operator 0 E MO+(U,Y) such that (28) I N u ) - O(u)II < €1 where 1 ) . 11 denotes the uniform norm on Y c B(J,32). Theorem 4.1 assures us that if we assume that U is such that all its restrictions {Ut)t~j are subsets of a compact set in Uo, then all all time-invariant systems characterized by continuous functionals (continuous-time or discrete-time, finite memory or infinite memory, causal or noncausal, and linear or nonlinear) on U can be uniformly approximated by multilayer operators. We hope that these theoretical guarantees will convince the reader that the quest for representing arbitrary systems by multilayer operators is sound. Needless to say, Volterra operators also enjoy the same representation properties as multilayer operators. However, unlike multilayer operators, Volterra operators cannot be expected to escape the curse of dimensionality for some restricted class of operators, and to our knowledge results in such a general setting have not previously been derived for Volterra operators.
248
4.C
Concept of Initial State
Let S be a time-invariant system described by {J,Jo, UO, Fo}. Let J = 3 n [to, t l ] , where - - o o < t o < t l <:andJo=~n[-c,dJ,whereO
(29)
where ut E Ut. But, the input function u is known only over J. In particular, the restriction utOof u to Jt, is not known, and hence the restrictions ut for t E [to, to c] are not known, and consequently y ( t ) is not defined for t E [to, to c]. utois called the initial state of the system, and should either be assumed to have some value or should be estimated from the data. If one assumes that uto z 0, then the system is termed relazed.
+
5
+
Practical Matters
In this section, we illustrate the abstract theory developed in Section 4 for a class of continuous-time systems and for a class of discrete-time systems. We also propose an alternative definition for multilayer functionals which may be more practical.
5.A
A Class of Continuous-time, Infinite memory, Anticausal Systems
Let us consider a class of continuous-time, infinite memory, noncausal systems. Set Z = 32. Let J = Z, JO = Z, and VO= L'(J0) = L2(Z). L Z ( Z )is a Hilbert space, and hence its own dual. Therefore, (UO)'E Lc'(Z).
Linear Systems Let S E LTC(U, Y).Then by Theorem 4.1,there exists a linear functional L described by some function I in (UO)' such that at any time t E J , the output function y of the system is given by
Y(t)
=
L(Ui)
where u E U. Let us define h E L'(a) as h(s) = l ( - s ) , s E 3, and rewrite Equation 30 as a convolution
(31)
249
Function h carries the familiar meaning of the impulse response' of a linear, time-invariant system. Equation 30 or equivalently Equation 31 completely describe the action of the system S at any time t E J. Linear/Nonlinear Systems Let S E 'TC(U,Y),let G be a continuous squashing function, and let the input space U be such that its restrictions U,for all t E J are subsets of a compact set K c Uo. Then, by Theorem 4.1 there exists a multilayer functional N E MF+(Uo,?Jt)such that for any time t E J, the output y of the system S can be approximately given by
where
P i , bi
L'(Z), and
E 92, u E U,Li are linear functionals on UOeach described a function = li(-s),s E Z for i = 1,. . .,n.
1;
in
hi(s)
Thus, multilayer operators can also be alternatively thought of as a nonlinear generalization of convolution, or more generally as being composed of n linear systems.
5.B
A Sampling Criterion
A continuous-time system operates on a continuous-time input to emit a continuous-time output. In the era of digital signal processing [21], however, even if the inputs and the outputs are continuous-time, we may require that the processing be carried out using discrete-time samples. In this subsection, we study the conditions under which such a requirement can be met for systems described by Equation 32. To permit digital implementation, we must first represent the continuous-time input by a discrete input sequence, and the continuous-time system by a discrete-time system. The discrete-time system will then operate on the discrete input sequence to generate a discrete output sequence, which may be useful directly in its discrete form or may be required to be reconverted to a continuous-time signal. With these issues in mind, we study the following questions [16,20,33]: 1. How to represent a continuous-time input signal by a discrete sequence? Under what
conditions does the discrete representation preserve all the information carried by the continuous-time input signal? aIn general, for a given fundamental input space VOthere may exist linear functionah L E (VO)' such that their action cannot be described by any function 1. In other words, there may not exist any function 1 such that L(ul) = JJo l(s)ut(s) dm(s). Clearly, then there does not exist any function h, and the notion of the impulse response breaks down. The tools developed in Section 4, however, are still valid. For example, if UO = C ( I ,92) then the mast general form of the linear functionah on Vo is given by Equation 9. Clearly, if L is the point evaluation functional such that L(ut) = ~ ( t )then , there exists no function 1 such that u(l) = L (ut ) = J,, I(s)ut(s)dm(s).
250
2. How to represent a continuous-time system by a discrete-time system? 3. How is the output sequence generated by digital processing related to the continuoustime output signal generated by the continuous-time system? 4. Can the continuous-time output be reconstructed using the output sequence generated by digital processing? Answers to these questions for linear systems are well known [16,20,33]. We now construct similar answers for multilayer operators described by Equation 32, by simply observing that a multilayer operator is composed of n linear systems, and can be discretized by discretizing each of the component linear systems. For W > 0, let BLw denote the set of functions f in L'(32) such that the Fourier transform3 off vanishes outside [-27rW, 27rW]. BL:w is said to be the set of bandlimited functions with bandwidth W. Let us rewrite Equation 32 as n
where t E 32, z i ( t ) = Js hi(s)u(t- s) dm(s),hi, u E Lz(92),and y E B(R, 92). If the inputs to the multilayer operator are assumed to be in L2(92) n BLw, then each input function u can be represented, without any loss of information, by a discrete sequence { u } obtained from u by periodic sampling such that urn= u(mT),
(34)
where m E Z and T < & [16,20]. This answers question 1 above. To answer question 2 above, we assume (as in the case of linear systems [20]) that for i = 1 , . . . ,n functions hi are bandlimited, that is, are in L'(92) n BLw. Then each function hi can be represented, without any loss of information, by a sequence { h i } such that hi,m = hi(mT), (35) where m E Z and T < &. For i = 1 , . .. ,n, z, is a convolution of two bandlimited functions hi and u in L'(8) n B t w , and consequently is also in LZ(R)nBLw.Therefore, zi can be represented, without any loss of information, by its periodic samples { z i } such that 2i.m
where m E Z and T <
= zi(mT),
&. Now for all m E Z define { Z i }
(36) such that
'We define the Fourier transform pair as in Shannon [33], viz., f ( t ) = & J, i(z) exp(izt) dm(z) and
j(z) = J, f ( t )exp(-izt)
dm(t).
25 1
Then, by results of Oppenheim and Johnson [20] we know that Zi,m
=
= t;(mT),
(38)
for all m E Z. Thus, the sequence {Zi} precisely represents the values of function .zi at the sampling points. Moreover, the function t i can be reconstructed from its samples {Z;,m} by the Shannon interpolation formula as [16,33]
Let us now define the sequence {fi} as
where m E Z, and { i ; } are as above. Let {y} denote the sequence obtained by periodically sampling y such that Ym = Y ( ~ T ) , (41) where m E Z. Then, from Equations 33 and 38 it follows that fim
= Ym = y(mT).
In other words, the discrete output sequence
(42)
{ c } generated by the discrete-time system
operating on the discrete input sequence {u} is precisely the discrete output sequence generated by periodically sampling the output y of the continuous-time system in Equation 32. This settles questions 2 and 3 above. Question 4, however, is answered negatively. Unlike the linear case, the output y of the multilayer operator described by Equation 33 is not a bandlimited function, and by the Shannon interpolation hence cannot be reconstructed from its periodic samples formula.
{c}
5.C
A Class of Discrete-time, Finite memory, Causal Systems
Let us consider a class of discrete-time, finite memory, causal systems. Set Z = Z the set of all integers. Let J = E n [to,t l ] , Jo = Z n [-a, 01, and lJ0 = ~ " ( J o ) Without . any loss of generality, we assume that a is a nonnegative integer. l'(J0) is a Hilbert space, and hence its own dual. Therefore, (UO)'S l'(J0). Let us define -Jo to mean S f l [O,a].
252
Linear Systems If S E L'TC(U, Y), then by Theorem 4.1 there exists a sequence {I} E !'(.To) such that at any time t E J, the output sequence {y} of the system is given by
where { u } E U , { h } E !'(-Jo) is such that ha = L , O 5 s 5 a. Linear/Nonlinear Systems Let S E ' T C ( U , Y ) , let $ be a continuous squashing function, and let the input space
U be such that its restrictions Ut for all t E J are subsets of a compact set K c Uo. Then, by Theorem 4.1 there exists a multilayer functional N E MF+(Uo,Q ) such that for any time t E J , the output sequence {y} of the system S can be approximately given by
where P i , h E Q, { u } E U ,{Zi} E E(J0) and { h i } E !'(-JO), and hi,a = Zi,-a,O 5 s 5 a for i = l , . . . , n . Systems described by Equation 45 are the time delay neural networks considered in Back and Tsoi [l], in Hecht-Nielsen [lo], in Hudson et al. [13], and in Wan [35]. Furthermore, if we set a = 03, then the systems described by Equation 45 generalize the infinite memory networks considered in Back and Tsoi [l].
A Fast Implementation To present a fast implementation for Equation 45, we simply observe that for each i a
C hi.jut-j
(46)
j=O
is a discrete-time convolution and can be efficiently implemented by a Fast Fourier Transform [19] in O(a1oga) operations. Clearly, Equation 45 can then be implemented in O(na log a) operations. We propose that such an implementation be called a Multilayer Fast Fourier Transform. In contrast, fast implementations of discrete-time Volterra operators may require as many as O(an logn) operations [15,17].
253
5.D An Extension to Multilayer Functionals We define the class of extended multilayer functionals with n units in the hidden layer as follows. Definition 5.1 Let X denote a nonned linear space, and let d 3 ( X , 92) denote the set of all afine functionals on X . For any junction 11, : 92 + 8, extended multilayer functionals with n units in the hidden layer on X are defined by n
EM3”,(X,R)
= { F :X
-P
81 F(u) = Po
+ L(u) + x p ; ? b ( A ; ( ~ ) ) , i=l
21
E
x,p;E R , L E ( X ) * ,
and Ai E d F ( X , 8 ) for i = 1,. .. , n } .
(47)
Let E M F + ( X ,92) = U,”==,EMF+(X,8) denote the class of all extended multilayer functionals. Clearly, E M F + ( X ,92) are also universal approximators in the sense of Theorem 3.1. However, adding the affine term makes it trivial to approximate the linear part of any unknown functional. Barron [4]has recently established that for multilayer feedforward networks adding the f i n e term widens the class of functions for which the curse of dimensionality can be bypassed. We have no proof, but we conjecture that extended multilayer functionals also possess similar advantageous properties.
6
Conclusions
In this paper, inspired by the success of multilayer feedforward networks in approximating and estimating functions on Rr, we sought generalizations of multilayer feedforward networks to abstract spaces, and applications of such generalizations to time-invariant system representation. To this end, we introduced multilayer functionals. We established that multilayer functionals on a normed linear space are uniformly dense on compact sets in the set of continuous functionals on the normed linear space. We then employed rnultilayer functionals to define multilayer operators, and showed that multilayer operators can represent, in the sense made precise by Theorem 4.1, a large class of systems. We demonstrated the abstract theory developed for multilayer operators by providing concrete representations for two practical classes of systems. We hope that these results will provide the necessary impetus for applying multilayer operators to nonlinear signal processing, adaptive system identification, and timeseries analysis.
ACKNOWLEDGEMENTS We are grateful to Tim Cacciatore, Joe Costa, David DeMers, Kenneth KreutaDelgado, Andrei Vityaev, and Daniel Wulbert for a number of valuable discussions.
254
Appendix Algebra of F'unctionals
A family 3 C C ( X ,%) of functionals on a normed linear space X form a closed subalgebra of C ( X ,%), if it closed under addition, multiplication, and scalar multiplication. These operations are defined as follows: 1. (addition) (F1
+ F2)u = Flu + Fzu,
2. (multiplication) (F1 . F2)u = (Flu). (Fzu),
3. (scalar multiplication) ( a . F ) u = a . Fu, where F, F1, F 2 E 3,u E X , and a E P. The family 3 is said to separates points, if for every u,u E X with u # w there exists F E 3 such that F(u) # F(u). 3 is said to contain constants, if for each u E X there exists F E 3 such that F ( u ) # 0. Stone-Weierstrass Theorem [7] Let K be a compact Hausdorff space. If 3 is a closed subalgebra of C ( K , % )which separates points, and contains constants then 3 = C ( K ,a). L e m m a A.l
For a normed linear space X , M F - ( X , P) are uniformly dense on compact sets in
C ( X ,X). Proof
X is a normed linear space, and has the norm topology. Thus, X is a topological vector space, and consequently X is Hausdorff [28, Theorem 1.121. Let K be an arbitrary compact subset of X. Then, K is both compact and Hausdorff, and we can apply the Stone-Weierstrass theorem to K. We need to check that M 3 - ( K , P) is a closed subalgebra of C ( K ,P), and that it separates points and contains constants. Obviously, MF,,(K, %) is closed under scalar multiplication and addition. To see that it is also closed under multiplication, it suffices to observe that (cosa) . (cos b) = cos(a b) - cos(a - b), where a,b E %. Pick b = 0, and L E X' to be identically zero. Define an affine functional A by A ( u ) = b L ( u ) I 0. Then, cos(A(u)) G 1 # 0, and M F - ( K , %) contains constants. Let u,v E K , and u # v . Then, by Theorem 5.7 in Folland [7] there exists a F E X' such that F ( u - u ) = Ilu - ullx > 0, where 11. I1.y denotes the norm on X . Define L ( u ) = = 2(I,ullx:ll"llx)IIu - UIIX 0. Also, by the triangle 2(llullx:!lY,lw~F("). Then1 U" inequality, L(u - w) 5 2. Now, define an &ne functional A by A(u) = 0 L ( u ) = L ( u ) , then cos oA separates points. = We now apply the Stone-Weierstrass Theorem to conclude that MF,,(K,P) C ( K ,8).But, I< is some arbitrary compact set of X, and hence the proof.
+
+
'
+
255
Remark 1 Lemma A . l also holds for M . F ~ , ( Xa) , and M F e w ( X ,8). L e m m a A.2 Given an affine functional A on X, a compact subset K of X, a squashing function and a el > 0, there exists a N E M F + ( X , % )such that
11,
114 0 A - N l l ~ < €1, where
+ 3n/2) + 1)
if X 5 -n/2 if -n/2 I if X 2 n/2
5 r/2
is the cosine squasher of Gallant and White [8]. Proof
c,"=;'
Without loss of generality, let c1 < 1. We want to construct a N(u) = pJ+(A,(u)) E M F + ( X ,82) such that Il4oA -NII,y < el. We need to find B , , and A, for j = 1,. .. ,Q - 1.
Let e = 2/3el. Pick Q such that l / Q < e/2. For j = 1,. . . ,Q - 1 set p, = 1/Q. Pick M > 0 such that $(u) 5 e/(2Q) for u 5 -M, and $(ti) 2 1 - e/(2Q) for u 2 M. Since 1c, is a squashing function such an M can be found. For j = 1,. ..,Q - 1 set r, = X such that 4(X) = j / Q , and set f Q = X such that 4(X) = 1- 1/(2Q). Let the &ne functional A be of the form A(u) = bA LA(u), where u E X, bA E 92, and LA E X'. Let us define
+
(4 0 A)(E,dj = t u E X l 4 ( C ) < (4 0 N u ) I 4(4J and
(4 A ) ( c . d ) = {u E xl4(c) < (4 A)(u) < d(d))* Now we can partition the space X into Q 1 disjoint sets such that
+
X = (4 0 A)(-co,r11U (4 0 A)(rI,rz]U . * .U (4 0 A)(rQ-I,TQlU (4 0 A)(rQ,+co). On each of the sets, (40 A)(,,,,,,] for j = 1,. . . ,Q - 1, we will approximate the action of 4 o A by tC, o Ar,,r,tl. We now look for such affine functionals Arj,3+1. For all u E (4oA)(7,,rJ+g]74(rj) < (4oA)(u) I 4(rj+1). But, (4oA)(u) = 4 ( b ~ + L ( u ) ) ,
and 4 is nondecreasing. This implies that r, < bA b3,r,tl E a, and L,,r,tlE X', such that -M < arithmetic reveals that the choices
+ I;A(u)
5 rj+1. We wish to find b,,3+1+ L , , r , + l ( U ) I M . Some
and will ~ f f i c eNow, . define (u) = b,,,,,+, +L,,,,,+, (u). Then, N(u) = xi"=;'pj+(Ar,,3tl (u)) is the desired approximation. After some rather lengthy arithmetic, i t can be verified that
II4 0 A - NIIK < €1.
256
Remark 2 Lemma A.2 is simply a generalization of Lemma A.2 in Hornik, Stinchcombe, and White (111. They develop the lemma for the simple case when X = 8'. The essence of the lemma is that cosine squasher 4 can be approximated to an arbitrary degree of accuracy by a superposition of a finite number of scaled and f i n e l y shifted copies of any squashing function $.
Lemma A.3 Given an affine functional A; on X, a compact subset K of X, a squashing function $, and a € 2 > 0, there exists a N; E M 3 + ( X ,92) such that IICOSOA;-XI(K
<€2.
Proof
A, is continuous and K is compact. This implies that for all u E K, there exists a M > 0 such that -2rM 5 A,(u) 5 2u(M + 1) [27,Theorem 4.151. By a result of Gallant and White [8] on the interval [-27rM,2r(M+l)] the cosine function can be represented by a superposition of a finite number of scaled and af6nely shifted copies of the cosine squasher 4. (For the definition of 4, see Lemma A.2.) Explicitly,
c M
cos(u) =
2[4(-u
+ r/2 - 2mr) + d(u - 3r/2 + 2mr)l - 2(2M + 1 ) + 1.
m=-M
Thus, we can write
c 2[4(-A;(u)+ M
cos(A,(u))=
r / 2 - 2mr) + gl(A;(u)- 3u/2 +2mr))- 2(2M
+ 1 ) + 1.
m=-M
+
Now, we will use Lemma A.2 2(2M 1) times with €1 = 4(2G+l)to approximate each 4 term in the above representation of cos(A;(u))by an element of M 3 + ( X , R ) . For m = -M,. . . ,M , let N;,m,l(~),N,,,,2(u)E M F + ( X ,R) denote approximations to d(-A,(u) r / 2 - 2mr) and d(A,(u)- 3r/2 2mr) respectively. The approximations are obtained by applying Lemma A.2 such that
+
+
lld(-As(.)
+ r / 2 - 2m*) - ~ , m , l l l K <
and
lld(A:(.) - 3r/2
+ 2mr) -
N,m,21)K
€1,
< €1.
+
Also, for some a such that $ ( a ) = 1 / 2 define N,,M+I(u)= 2(1 - 2(2M l))$(a). Now define &(u) = C&-M 2[X,m,1(u) +N,m,~(u)] +N,M+I(u). Then, we have the required result 11 cos oA, - X.11~< € 2 .
251
Proof of Theorem 3.1 Given an arbitrary compact K c X , a F E C ( X , R ) ,and e/2 > 0, Lemma A.l tells us that there exists a N ( u ) E M Fc a(X,R) such that llF - N l l ~< 4 2 . We now need a multilayer functional G(u) E &,a(+) such that /IN - G l l ~< ~ / 2 .Then by triangle inequality we can conclude that IIF - G l l ~< e. Let N(u) = Cy=!=, flj cos(Aj(u)). Let fl = supj flj. Apply Lemma A.3 to each term cos(Aj(u)) with €2 = &j to obtain a Nj(u) such that 11 cos oAj - Njll~< €2. Define G(u)= fljNj(u). Then, we have the required result JIM- G l l ~ < e/2.
Proof of Proposition 4.1 1. Obvious. 2. For any u E U,we have ut E Ut for all t E J. But, U,is assumed to be a normed linear space. Therefore llulll < 00 for all t E J. Then by continuity of F, we have F(u,) = y(t) < 00 for all t E J.
3. Under the topology defined on U: + vt for all t E J.
U by Equation 26, for
u,v
E U,we say
u + v if
Let u" E U + u E U,then continuity of F implies that F(u;) + F(u,) for all t E J. Thus, S(u") -+ S(u)pointwise for every t E J . But, Y c B(J,a) is assumed to have the relative topology derived from the topology of uniform convergence on B(J,R). Thus, when u" + u in the topology on U,S(u") -+ S(u) in the topology on Y.Thus, by definition S is continuous.
Proof of Theorem 4.1 1. Obvious.
2. Since S E 7C(U,Y),by definition there exists a Fo E C(U0,a)characterizing S. Moreover from Theorem 3.1, for any compact subset K of VO,every e > 0, and every continuous squashing function there exists a N E MF+(Uo,R) such that llF0 - N l l K < €. But, set of inputs U is assumed such that its restrictions U,for all t E J are subsets of K. Therefore, for every t E J and for every u E U,we have IFo(u:)-N(ut)l < e, and consequently supteJ IFO(ut) - N(u:)l< e.
+
But, supteJ IFo(t)-N(t)l = ~ ~ S ( u ) - 0 ( uwhere ) ~ ~ ,0 denotes the multilayeroperator described by {J,Jo,Vo,n/} and 11 11 denotes the uniform norm on Y C B(J, a). Thus, the result.
258
References 111 A. D. Back and A. C. Tsoi, "FIR and IIR synapses, a new neural network architecture for time series modelling," Neural Comput., vol. 3, no. 3, pp. 352362, 1991. [2] S. P. Banks, Mathematical Theories of Nonlinear Systems, New York: PrenticeHall, 1988. [3] J. F. Barrett, "The use of functionals in the analysis of nonlinear physical systems," Journal of Electronics and Control, vol. 15, pp. 567-615, 1963. [4] A. R. Barron, "Universal approximation bounds for superpositions of a sigmoidal function", University of Illinois at Urbana-Champaign, Department of Statistics, tech. rep. 58, 1991. [5] A. R. Barron, "Approximation and estimation bounds for artificial neural networks", University of Illinois at Urbana-Champaign, Department of Statistics, tech. rep. 59, 1991. [6] G. Cybenko, "Approximation by superpositions of a sigmoidal function," Mathematics of Control, Signals, and Systems, vol. 2, pp. 303-314, 1989. [7] G. B. Folland, Real Analysis, New York: John Wiley & Sons, 1984. [8] A. R. Gallant, and H. White, "There exists a neural network that does not make avoidable mistakes," in IEEE Second International Conference on Neural Networks, San Diego, CA, New York: IEEE Press, vol. 1, pp. 657-664, 1988. [9] P. G. Gallman and K. S. Narendra, "Representations of nonlinear systems via the Stone-Weierstrass theorem," Automatica, vol. 12, pp. 619-622, 1976.
[lo] R. Hecht-Nielsen, Neurocomputing, Reading, MA: Addison-Wesley, 1991. [ll] K. Hornik, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, pp. 359-366, 1989. [12] K. Hornik, "Approximation capabilities of multilayer feedforward networks," Neural Networks, vol. 4, pp. 251-257, 1991. [13] J. L. Hudson, M. Kube, R. A. Adomaitis, I. G. Kevrekidis, A. S. Lapedes, and R. M. Farber, "Nonlinear signal processing and system identification: applications to time series from electrochemical reactions," Chemical Engineering Science, vol. 45, no. 8, pp. 2075-2081, 1990. [14] L. K. Jones, "A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training," Ann. Statist., vol. 20, no. 1, pp. 608-613, 1992. I151 T. Koh and E. J. Powers, "Second-order Volterra filtering and its application to nonlinear system identification," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-33, no. 6, pp. 1445-1455, Dec. 1985. [16] R. J. Marks 11, Introduction to Shannon Sampling and Interpolation Theory, New York: Springer-Verlag, 1991.
259
[17] M. MorhiE, ”A fast algorithm of nonlinear Volterra filtering,” IEEE Transactions on Signal Processing, vol. 39, no. 10, pp. 2353-2356, Oct. 1991. [18] K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems,” IEEE Transactions on Neural Networks, vol. 1, no. 1, pp. 4-27, Mar.1990. [19] H. J. Nussbaumer, Fast Fourier Transforms and Convolution Algorithms, Berlin: Springer-Verlag, 1981. [20] A. V. Oppenheim and D. H. Johnson, “Discrete representation of signals,” Proceedings of the IEEE, vol. 60, no. 6, pp. 681-691, June 1972. [21] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Englewood Cliffs, NJ: Prentice-Hall, 1975. [22] G. Palm and T. Poggio, “The Volterra representation and the Wiener expansion: validity and pitfalls,” SIAM Journal of Applied Mathematics, vol. 33, no. 2, pp. 195-216, Sep. 1977. [23] F. J. Pineda, “Recurrent backpropagation and the dynamical approach to adaptive neural computation,” Neural Comput., vol. 1, pp. 161-172, 1989. [24] W. A. Porter, “An overview of polynomic system theory,” Proceedings of the IEEE, vol. 64, no. 1, pp. 18-23, Jan. 1976. [25] P. M. Prenter, “A Weierstrass theorem for real, separable Hilbert spaces,” Journal of Approzimation Theory, vol. 3, pp. 341-351, 1970. [26] W. L. Root, “On the modeling of systems for identification. Part I: crepresentations of classes of systems,” SIAM Journal of Control, vol. 13, no. 4, pp. 927-944, 1975. (271 W. Rudin, Principles of Mathematical Analysis, New York: McGraw Hill, 1964. [28] W. Rudin, Functional Analysis, New York: McGraw Hill, 1991. [29] W. J. Rugh, Nonlinear System Theory: The Volterra/Wiener Approach, Baltimore: The Johns Hopkins University Press, 1981. [30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Ezplorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, Eds., vol. 1, pp. 318-362, Cambridge, MA: MIT Press, 1986. [31] I. W. Sandberg, “Approximations for Nonlinear Functionals,” IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications, vol. 39, no. 1, pp. 65-67, Jan. 1992. [32] M. Schetzen, The Volterra and Wiener Theories of Nonlinear Systems, New York: Wiley, 1980. [33] C. E. Shannon, “Communication in the presence of noise,” Proceedings of the Institute of Radio Engineers, vol. 37, no. 1, pp. 10-21, 1948. [34] V. Volterra, Theory of Functionals and of Integral and Integro-Differential Equations, New York: Dover Publications, 1959.
260
[35]E. A. Wan, "Temporal backpropagation for FIR neural networks," Proc. IEEE Int. Joint Conf. Neural Networks, vol. 1, pp. 575-580,1990. [36]H. White, "Parametric statistical estimation with artificial neural networks," in Mathematical Perspectives on Neural Networks, P. Smolensky, M. C. Mozer, and D. E. Rumelhart, Eds., Hilldale, NJ: L. Erlbaum Associates, 1992.
[37]H. White, Artificial Neural Networks: Approximation d Learning Theory, Cambridge, MA: Blackwell Publishers, 1992. [38]B. Widrow and S. D. Stearns, Adaptive Signal Processing,. Englewood Cilffs, NJ: Prentice-Hall, 1985. [39]N. Wiener, Selected Papers of Norbert Wiener, Cambridge, MA: MIT Press, 1964. [40] R. J. Williams and D. Zipser, "A learning algorithm for continually running fully recurrent neural networks," Neural Comput., vol. 1, pp. 270-280,1989. 1411 J. C. Willems, The Analysis of Feedback Systems, Cambridge, MA: MIT Press, 1971.
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B.V. All rights reserved.
26 I
Neural networks: the spin glass approach David Sherrington Department of Physics, University of Oxford, Theoretical Physics, 1 Keble Road, Oxford, OX1 3NP Abstract A brief overview is given of the conceptual basis for and the mathematical formulation of the fruitful transfer of techniques developed for the theory of spin glasses to the analysis of the performance, potential and training of neural networks. 1. INTRODUCTION
Spin glasses are disordered magnetic systems. Their relevance to neural networks lies not in any physical similarity, but rather in conceptual analogy and in the transfer of mathematical techniques developed for their analysis to the quantitative study of several aspects of neural networks. This chapter is concerned with the basis and application of this transfer. A brief introduction to spin glassses in their conventional manifestation is appropriate to set the scene - for a fuller consideration the reader is referred to more specialist reviews (MBzard et. al. 1987, Fischer and Hertz 1991, Binder and Young 1986, Sherrington 1990, 1992). At a microscopic level spin glasses consist of many elementary atomic magnets (spins), fixed in location but free to orient, interacting strongly but randomly with one another through pairwise forces. Individually these forces try to orient their spin pairs either parallel or antiparallel, but collectively they lead to conflicts, or frustration, with regard to the global orientations. The consequence is a system with many non-equivalent metastable global states and consequentially many interesting physical properties. Most of the latter will not concern us here, but the many-state structure has relevance for analogues in neural memory and the mathematical techniques devised to analyze spin glasses have direct applicability. Neural networks also involve the cooperation of many relatively simple units, the neurons, under the influence of conflicting interactions, and they possess many different global asymptotic behaviours in their dynamics. In this case the conflicts arise from a mixture of excitatory and inhibitory synapses, respectively increasing and decreasing the tendency of a post-synaptic neuron to fire if the pre-synaptic neuron fires. The recognition of a conceptual relationship between spin glasses and recurrent neural networks, together with a mathematical mapping between idealizations of each (Hopfield 1982), provided the first hint of what has turned out to be a fruitful transplantation. In fact, there are now two main respects in which spin glass analysis has been of value in considering neural networks for storing and interpreting static data. The first concerns the macroscopic asymptotic behaviour of a neural network of given architecture
262
and synaptic efficacies. The second concerns the choice of efficacies in order to optimize various performance measures. Both will be discussed in this chapter. We shall discuss networks suggested as idealizations of neurobiological structures and also those devised for applied decision making. We shall not, however, dwell on the extent to which these idealizations are faithful, or otherwise, to nature. Although neural networks can also be employed to store and analyse dynamical information, and techniques of non-equilibrium statistical mechanics are being applied to their analysis, we shall restrict discussion in this chapter to static information, albeit stored in dynamic networks. An accompanying chapter (Coolen and Sherrington 1992) gives a brief introduction to dynamics. 2. TYPES OF NEURAL NETWORK
There are two principal types of neural network architecture which have been the subject of active study. The first is that of layered feedforward networks in which many input neurons drive various numbers of hidden units eventually to one or few output neurons, with signals progressing only forward from layer to layer, never backwards or sideways within a layer. This is the preferred architecture of many artificial neural networks for application as expert systems, with the interest lying in training and operating the networks for the deduction of appropriate few-state conclusions from the simultaneous input of many, possibly corrupted, pieces of data. The second type is of recurrent networks where there is no simple feedforward-only or even layered operation, but rather the neurons drive one another collectively and repetitively without particular directionality. In these networks the interest is in the global behaviour of all the neurons and the associative retrieval of memorized states from initialisations in noisy representations thereof. These networks are often referred to as attractor neural networks’. They are idealizations of parts of the brain, such as cerebral cortex. Both of the above can be considered as made up from simple ‘units’ in which a single neuron receives input from several other neurons which collectively determine its output. That output may then, depending upon the architecture considered, provide part of the inputs to other neurons in other units. Many specific forms of idealized neuron are possible, but here we shall concentrate on those in which the neuron state (activity) can be characterized by a single real scalar. Similarly, many types of rule can be envisaged relating the output state of a neuron to those of the neurons which input directly to it. We shall concentrate, however, on those in which the efferent (post-synaptic) behaviour is determined from the states of the afferent (pre-synaptic) neurons via an ‘effective field’ hi =
C J ; j ~-j W;, j#i
‘They are often abbreviated as ANN, but we shall avoid this notation since it is also common for artificial neural networks.
263
where aj measures the firing state of neuron j , J;j is the synaptic weight from j to i and W; is a threshold. For example, a deterministic perceptron obeys the output-input relation
where a,!is the output state of the neuron. More generally one has a stochastic rule, where f(h;)is modified in some random fashion at each step. Specializing/approximately further to binary-state (McCulloch-Pitts) neurons, taken to have a;= fl denoting firing/non-firing, the standard deterministic perceptron rule is
(3)
u: = sgn(h;).
Typical stochastic extensions modify (3) to a random update rule, such as the Glauber rule, 1 2
a;+ u,!with probability -[1+ tanh(ph;u:)],
(4)
or the Gaussian rule ai -+
a: = sgn(h;
+ Tz),
(5)
where z is a Gaussian-distributed random variable of unit variance and T = p-' is a measure of the degree of stochasticity, with T = O(p = m) corresponding to determinism. In a network of such units, updates can be effectuated either synchronously (in parallel) or randomly asynchronously. More generally, a system of binary neurons satisfies local rules of the form
where the u ; ~..., aiC= +l are the states of the neurons feeding neuron i, Rj and Ri are independent tunable stochastic operators randomly changing the signs of their operands, and F; is a Boolean function of its arguments (Aleksander 1988, Wong and Sherrington 1988, 1989). The linearly-separable synaptic form of (2)-(5) is just a small subset of possible Boolean forms. 3. ANALOGY BETWEEN MAGNETISM AND NEURAL NETWORKS
In order to prepare for later transfer of mathematical techniques from the theory of spin glasses to the analysis of neural networks, in this section we give a brief outline of the relevant physical and conceptual aspects of disordered magnets which provide the stimulus for that transfer. 3.1 Magnets
A common simple model magnet idealizes the atomic magnetic moments to have only two states, spin up and spin down, indicated by a binary (Ising) variable a;= f l , where i
264
labels the location and u the state of the spin. A global microstate is a set { u ; } ; i= 1,...N where N is the number of spins. The energy of such a state is typically given by
1
E({u;}) = - - ~ J ; ~ u ; u-, C b;u; 2
ij
(7)
I
where the J;j (known as exchange interactions) correspond to contributions from pairwise forces and the b; t o local magnetic fields. The prime indicates exclusion of i = j . The set of d microstates is referred t o as ‘phase space’. The standard dynamics of such a system in a thermal environment at temperature T is a random sequential updating of the spin states according t o rule (4) with W; -+ -b;. Thus, with this identification, there is a mathematical mapping between the statistical thermodynamics of the spin system and the dynamics of a corresponding recurrent neural network. The converse is not necessarily true since the above spin model has J;j = Jj;, whereas no such restriction need apply t o a general neural network. However, for developmental purposes, we shall assume this symmetry initially, lifting the restriction later in our discussion. Magnetic systems of this kind have been much studied. Let us first concentrate on their asymptotic behaviour. This leads t o a thermodynamic state in which the system randomly passes through the microstates with a Gibbs probabalistic distribution
and the system can be viewed as equivalent to an ensemble of systems with this distributiona. At high temperatures T, effectively all the microstates of any energy are equally likely t o be accessed in a finite time and there are no serious barriers to a change of microstate. At low enough temperatures, however, there is a spontaneous breaking of the phase space symmetry on finite timescales and only a sub-set of microstates is effectively available in a physical measurement on a very large (N -+ co)system. The onset of such a spontaneous separation of phase space is known as a ‘phase transition’. A common example is the onset of ferromagnetism in a system with positive exchange intGactions {J;j} and b = 0. Beneath a critical temperature T,,despite the probabalistic symmetry between a microstate and its mirror image with every spin reversed, as given by (8), there is an effective barrier between the sub-sets of microstates with overall spin up and those with overall spin down which cannot be breached in a physical time, and hence the thermal dynamics is effectively confined t o one or other of these subsets. The origin of this effect lies in the fact that for T < T, the most probable microstates have non-zero values of the averu;,strongly peaked around two values f m ( T ) , while the age magnetization m = N-’ probability of a state of different Im( is exponentially smaller. To go from m = m ( T ) t o m = -m(T) would require the system to pass through intermediate states, such as m = 0, of probability which is vanishingly small as N ---f 00. For T = 0 the dynamics (3) leads t o a minimum of E ( a ) , which would have m = fl for the ferromagnet, with no means of further change.
xi
ZFora further, more complete, discussion of equivalences between temporal and ensemble averages see the subject of ‘ergodicity’ in texts on statistical mechanics
265
A useful picture within which to envisage the effect of spontaneous symmetry-breaking is of an effective energy landscape which incorporates the randomizing tendencies of temperature as well as the ordering tendencies of the energy terms of (7). This is known as a free energy landscape and it evolves with temperature. At high temperature it is such that all energetically equivalent states are equally accessible, but at low temperature it splits into disconnected regions separated by insurmountable ridges. If the system under consideration is the ferromagnet with only positive and zero Jij and without magnetic fields, b = 0, the phase space is thus split into two inversion symmetry related parts. If, however, the Jij are of random sign, but frozen (or quenched), then the resultant low temperature state can have many non-equivalent disconnected regions, or basins, in its free-energy structure; this is the case for spin glasses. Thus, if one starts the dynamical system in a microstate contained within one of the disconnected sub-sets, then in a physical time its evolution will be restricted to that subspace. The system will iterate towards a distribution as given by (8) but restricted to microstates within the sub-space.
3.2 Neural networks Thus one arrives at a potential scenario for a recurrent neural network capable of associatively retrieving any of several patterns { Q } ; p = 1,...p. This is to choose a system in which the J i j are such that, beneath an appropriate temperature (stochasticity) T , there are p disconnected basins, each having a macroscopic overlap3 with just one of the patterns and such that if the system is started in a microstate which is a noisy version of a pattern it will iterate towards a distribution with a macroscopic overlap with that pattern and perhaps, for T -+ 0, to the pattern itself. To store many non-equivalent patterns clearly requires many non-equivalent basins and therefore requires competition among the synaptic weights/exchange interactions {J;j}4. The mathematical machinery devised to study ordering in random magnets is thus a natural choice to consider for adaptation for the analysis of retrieval in the corresponding neural networks. An introduction to this adaptation is the subject of the next section. However, before passing to that analysis a further analogy and stimulus for mathematical transfer will be mentioned. This second area for transfer concerns the choice of {Jij}to achieve a desired network performance. Provided that performance can be quantified, the problem of choosing the optimal { J ; j } is equivalent to one of minimizing some effective energy function in the space of all { J ; j } . The performance requirements, such as which patterns are to be stored and with what quality, impose ‘costs’ on the J;j combinations, much as the exchange interactions do on the spins in (7), and there are normally conflicts in matching local (few-J;j) with global (all-J;j)optimization. Thus, the global optimization problem is conceptually isomorphic with that of finding the ground state of a spin glass, and again a conceptual and mathematical transfer has proved valuable. 8A precise definition of overlap is given later in eqn (9). With the normalization used there an overlap is macroscopic if it is of order 1.
4Note that this concept applies even if there is no Lyapunov or energy function. The expression ‘basin’ refers to a restricted microscopic phase space of the {u},even in a purely dynamical context.
266
4. STATISTICAL PHYSICS OF RETRIEVAL In this section we consider the use of techniques of statistical physics, particularly as developed for the study of spin glasses, for the analysis of the retrieval properties of simple recurrent neural networks. Let us consider such a network of N binary-state neurons, characterized by state variables n; = f l , i = 1,...N, interacting via stochastic synaptic operations (as discussed in section 2) and storing, or attempting to store, p patterns {(f} = {fl};p = 1,...p. Interest is in the global state of the network. Its closeness to a pattern can be measured in terms of the corresponding (normalized) overlap
I
or in terms of the (complementary) fractional Hamming distance
which measures the average number of differing bits. To act as a retrieving memory the phase space of the system must separate so as to include effectively non-communicating sub-spaces, each with macroscopic O ( N o )overlap with a single pattern. 4.1 The Hopfield model
A particularly interesting example for analysis was proposed by Hopfield. It employs symmetric synapses Jij = Jj; and randomly asynchronously updating dynamics, leading to the asymptotic activity distribution (over all microstates)
where E ( a ) has the form of eqn. (7). This permits the applications of the machinery of equilibrium statistical mechanics to study retrieval behaviour. In particular, one studies the resultant thermodynamic phase structure with particular concern for the behaviour of the m w . 4.2 Statistical Mechanics
Basic statistical mechanics for the investigation of equilibrium thermodynamics proceeds by introducing the partition function
Several other quantities of thermodynamic interest, such as the average thermal energy and the entropy, follow immediately; for example
( E ) = 2-l
{W
a aP
E(u)exp ( - P E ( u ) ) = --enZ.
267
Others can be obtained by the inclusion of small generating fields; for example, for any observable O(u),
In particular, the average overlap with pattern p follows from
where
Spontaneous symmetry breaking is usually monitored implicitly, often signalled by divergencies of appropriate response functions or fluctuation correlations in the highsymmetry phase. However, it can be made explicit by the inclusion of infinitesimal symmetry-breaking fields; for example k! = &” will pick out the v t h sub-space if the phase space is disconnected, even for h --t Of, but will be inconsequential for k --t O+ if phase space is connected. 4.3 H e b b i a n Synapses
Hopfield proposed the simple synaptic form .7;j
= N-’
cg y (
1 - bij),
r
inspired by the observations of Hebb; we shall refer to this choice as Hebbian. Let us turn to the analysis and implications of this choice, with all the {Wi} taken t o be zero and for random uncorrelated patterns {tP}. For a system storing just a single pattern, the problem transforms immediately, under u; --t u;&, to a pure ferromagnetic Ising model with J;j = N - ’ . The solution is well known and m satisfies the self-consistency equation m = tanh
(pm),
(19)
with the physical solution rn = 0 for T > 1 (P < 1) and a symmetry-breaking phase transition to two separated solutions f l m ( , with m # 0, for T < 1. For general p one may express exp(-PE(u)) for the Hopfield-Hebb model in a separable form,
268 =
/cfi
d f i p ( P N / 2 ~ ) ) )e ~ p [ ~ ( - N P ( f i ” ) ~-/pfhpc 2 u;
p=l
p=l
L
where we have employed the useful identity for exponentials of complete squares exp
( ~ a 2= )
(27r-f
j dz exp (-z2/2 + ( 2 ~ ) t a r ) ,
(22)
u;
2=
4 (
dfip(PN/27r))) exp (-NPf({fh”})) D
where
f({mP})
(23) D
+ i”)
= c ( f i p ) ’ / 2 - (NP)-l c t n [ 2 cosh ,f3 c ( m P
(24)
4.4 Intensive numbers of patterns
If p is finite (does not scale with N ) , f({+‘}) is intensive in the thermodynamic limit + m) and the integral is e x t r e m d y dominated. The minima of f give the selfconsistent solutions which correspond t o the stable dynamics, although only the lowest minima are relevant t o a thermodynamic calculation. The self-consistency equations are
(N
Furthermore, it is straightforward to show from (16) and (23) that m p = f i p at these extremal solutions. = 0 yields the result: The subsequent analysis of the equations (25) in the limit {k} 1. T=O (i) All embedded patterns { ( p } are solutions; ie. there are p solutions (all of equal probability) mr = 1;one p = 0;rest (ii) There are also mixed solutions in which more than one m p is non-zero in the limit
N
4 00.
Solutions of type (i) correspond t o memory retrieval, while hybrids of type (ii) are normally not wanted. 2. 0.46 > T > 0 There are solutions (i) m p = m # 0; one p = 0;rest (ii) mixed states. In the case of type (i) solutions, we may again speak of retrieval, but now it is imperfect.
269
3. 1 > T > 0.46 Only type (i) solutions remain, each equally stable and with extensive barriers between them. 4.
T>1
Only the paramagnetic solution (all m” = 0) remains. Thus we see that retrieval noise can serve a useful purpose in eliminating or reducing spurious hybrid solutions in favour of unique retrieval. 4.5 Extensive numbers of patterns
The analysis of the last section shows no dependence of the critical temperatures on p. This is correct for p independent of N (and N + co). However, even simple signal-to-noise arguments demonstrate that interference between patterns will destroy retrieval, even at T = 0, for p large enough and scaling appropriately with N. Geometrical, informationtheoretical and statistical-mechanical arguments (to be discussed later) in fact show that the maximum pattern storage allowing retrieval scales as p = aN, where a is an N independent storage capacity. Thus we need to be able to analyse retrieval for p of order N, which requires a different method than that used in (23) - (25). One is available from the theory of spin glasses. This is the so called replica theory (Edwards and Anderson 1975, Sherrington and Kirkpatrick 1975, Kirkpatrick and Sherrington 1978, MCzard et al. 1987). As noted earlier, physical quantities of interest are obtained from ln2. This will depend on the specific set of { J ; j } , which will itself depend on the patterns {t”} to be stored. Statistically, however, one is interested not in a particular set of { J i j } or {t:} but in relevant averages over generic sets, for example over all sets of p patterns drawn randomly from the 2N possible pattern choices. Furthermore, the pattern averages of most interest are self-averaging5,strongly peaked around their most probable values. Thus, we may ignore fluctuations of en2 over nominally equivalent sets of pattern choices and hence ~ , ( ){(I means an average over the specific pattern choices. consider ( l n Z ) { ~where Although in principle one might envisage the calculation first of ln2 for a particular pattern choice and then its average, in practice this would be essentially impossible for large p without some other approximation‘. Rather, one would like to average formally over the patterns {<”} first and then perform the {a}summation. Unfortunately, however, averaging the logarithm of a sum of exponentials is not easy, whereas averaging a sum of exponentials themselves is relatively straightforward. The replica ‘trick’ provides a procedure to transform the difficult average to the easier one, albeit at some cost in complexity and subtlety. 51u a system with quenched disorder a quantity is said to be self-averaging if it overwhelmingly takes the same value (or form) for all specific choices of the disorder. In fact, non self-averaging of certain (usually somewhat subtle) quantities is one of the conceptually important discoveries of the statistical physics of systems with quenched disorder; see e.g. MCeard et. al. 1987. eBut see, for example, the cavity-field approach of MCzard et. al. 1987, or its precursor due to Thouless et. al. 1977.
270
The replica procedure is based on the identity 1 en2 = lim -(Z" - l), n-0 n
together with the identification of 2" as the partition function of an n-replicated system, n
where the a";a = 1,...n are (dummy) replica spins/neuron states7. Given an algorithm for J;j in terms of the {Q}, an average over 2" can be performed to yield
now independent of the where F({a"}) is some effective 'energy' function of the {a"}, but involving interaction/cross-terms between a" with different particular patterns values of the replica label a;i.e. with interaction between replicas. For a system in which there is no short-range structure (for example, is such that the probability that any two i, j have a particular d u e of synaptic connection is independent of the choice of i , j ) F scales as N and, in principle, this problem is exactly soluble, effectively through an exact mean field theory. In practice, however, the solution may be highly non-trivial. If there is short-range structure one may need to resort to approximation or simulation.
{t}
4.6 Replica analysis of a spin glass
Before discussing the neural network problem, let us, for orientational purposes, digress to a discussion of the replica procedure for a spin glass problem where 2 is given by (12) with E(a)as given by (7), with the b; = 0, and the J;j independent random quenched parameters drawn from a Gaussian distribution
(Sherrington and Kirkpatrick 1975). Averaging over the { J ; j } one obtains
7We follow convention and use a and 0 as replica labels. They occur only as superscript labels and summations, which it is hoped will be sufficient to distinguish them trom a , P used for the storage ratio and inverse temperature respectively (again conventional symbols); the latter never appear as subscripts nor as elements to be summed over.
27 1
where (aP) refers t o pairs of diflerent indices and we have deliberately indicated the occurrence of complete squares so that eqn. ( 2 2 ) can be further employed t o write
where g has the N-independent form
g ( { m a } , {q"'}) = ( m a ) ' / 2
+ (qap)'/2 - Ten
exp { P J o ~ u a m a
toP)
0
+(PJ)"n
+ 2 c aaUPqap)}
(33)
(4
where the {u"} summation is over n replicas, each with u" = f l , but the ua carry no site labels. Because g is N-independent, the integral in ( 3 2 ) is overwhelmingly dominated by the values of { m a } ,{ q a P } for which g is minimized. At these extremal values ma,qap are identifiable as ma = (uq)F qap
(34)
= (uq<)F
where the (
)F
(35) refers to a thermodynamic average against the effective energy F({u"}),
However, it is still difficult to evaluate ( 3 2 ) for arbitrary n, needed t o perform analytic continuation to n + 0, without some further ansatz. Since the a,p replicas are artificial, one such as ansatz suggests itself - the replica-symmetric (RS) ansatz
m a = m ; alla
(37)
which effectively treats all replicas as equivalent. With this ansatz g is readily evaluated, in ( 3 3 ) being transformed into a single summation through the double summation C(a.p) another application of eqn ( 2 2 ) t o exp((PJ)'q(E, aa)'). In the limit n + 0, g ( m , q ) then has its extrema at the solutions of the self-consistent equations m=
Q
=
1
dhP(h)tanh p h
(39)
dhP(h)(tanhph)'
(40)
212
where
P ( h ) = (2nJ2q) exp ( - ( h - J0m)’/2J2q).
(41)
These solutions can be interpreted as
where ( u ; )is~ the thermally averaged magnetization of the original system at site i with a particular set of exchange interactions { J ; j } ,
As T, J , / J are varied, one fmds three sorts of solutions (Sherrington and Kirkpatrick 1975); (i) paramagnetic; m = 0 , q = 0, for T > max ( 1 ,J , / J ) (ii) ferromagnetic, m # 0 , q # 0, for T < J o / J and J o / J greater than the T-dependent value of 0 ( 1 ) , and (iii) spin glass, m = 0 , q # 0 for T < 1 and J,/J less than a T-dependent value of O(1). Within the spin glass problem the greatest interest is the third of these, interpreted as frozen order without periodicity, but for neural networks the interest is in an analogue of the second, ferromagnetism. 4.7. Replica analysis of the Hopfield model
Let us now turn to the neural network problem. In place of the order parameter m one now has all the overlap parameters m”. However, since we are principally interested in retrieving symmetry-breaking solutions, we can concentrate on extrema with only one, or a few, rn” macroscopic ( O ( N o ) )and the rest microscopic ( 5 O(N-f)). This enables one to obtain self-consistent equations for the overlaps with the nominated (potentially macroscopically overlapped or condensed) patterns
where the 1 , ...a label the nominated patterns and ( )T denotes the thermal (symmetrybroken) average at fixed {(}, coupled with a spin-glass like order parameter
and a mean-square average of the overlaps with the un-nominated patterns (itself expressible in terms of 9). Retrieval corresponds to a solution with just one m” non-zero. For the case of the Hopfield-Hebb model the analysis follows readily from an extension of (21). Averaging over random patterns yields ‘in fact, within the RS ansats the physical extremum is found from mazimizing the substituted g with respect to q; this is because the number of (ap) combinations n(n - 1)/2 becomes negative in the limit n
+ 0.
273
(Zn) = exp (-np/3/2) {ma}
1fi
fi{dmp(/3N/2n)! exp [ - N ~ ( / 3 ~ ( r n a ) ' / 2
p=la=l
a
-1
+N-'
en cosh i
(BEmpu:))]}.
(47)
P
To proceed further we separate out the condensed and non-condensed patterns and carry out a sequence of manipulations to obtain an extremally dominated form analagous to eqn (32). Details are deferred to Appendix A, but yield a form
(Z"){,> = (@N/2x)"/'
1 fi p,a=l
where
drnw
1n
dqapdrape-Np*
(4)
is intensive. (48) is thus extremally dominated. At the extremum
Within a replica-symmetric ansatz m p = mp, qap = q, rap = r , self-consistency equations follow relatively straightforwardly. For the retrieval situation in which only one m P is macroscopic (and denoted by m below) they are dz
1 15
m= q=
exp ( - z 2 / 2 ) tanh [p(z&
exp (-.'/a)
+ m)]
tanh2p(z&G+ m)]
(52) (53)
where T
= q ( l - p(1 - q ) ) - 2
(54)
Retrieval corresponds to a solution m # 0. There are two types of non-retrieval solution, (i) m = 0 , q = 0, called paramagnetic, in which the system samples all of phase space, (ii) m = 0,q # 0, the spin glass solution, in which the accessible phase space is restricted but not correlated with a pattern. Fig. 1 shows the phase diagram (Amit et. al. 1985); retrieval is only possible provided the combination of fast (stochastic) noise T and slow (pattern interference) noise a is not too great. There are also (spurious) solutions with more than one m p # 0, but these are not displayed in the figure. In the above analysis, replica symmetry was assumed. This can be checked for stability against small fluctuations by expanding the effective free energy functional
214
F({m"}, { q " P } ) to second order in E" = m" - m, quo = q"P - q and studying the resultant normal mode spectra (de Almeida and Thouless 1978). In fact, it turns out to be unstable in the spin glass region and in a small part of the retrieval region of ( T , a ) space near the maximum a for retrieval. A methodology for going beyond this ansatz has been developed (Parisi 1979) but is both subtle and complicated and is beyond our present scope. However, it might be noted as (i) corresponding to a further hierarchical disconnectedness of phase space (Parisi 1983), and (ii) giving rise to only relatively small changes in the critical retrieval capacity. For the example of section 4.6 replica-symmetry breaking changes the critical boundary between spin-glass and ferromagnet to J,/J = 1. A similar procedure may be used, at least in principle, to analyze retrieval in other networks with J;j = Jj;. Transitions between non-retrieval (m = 0) and retrieval m # 0 may be either continuous or discontinuous; for the fully connected Hopfield-Hebb model the transition is discontinuous but for its dilutely, but still symmetrically, connected counterpart the transition is continuouss (Kanter 1989, Watkin and Sherrington 1991).
10
+
05 -
I
0
0
0 05
0 10
a
f
015
a c = 0.138
Figure 1. Phase diagram of the Hopfield model (after Amit et. al. 1985). T, indicates the limit of retrieval solutions, between T, and Tgthere are spin-glass like non-retrieval solutions, above Tgonly paramagnetic non-retrieval.
4.8 Dilute asymmetric connectivity W e might note that a second type of network provides for relatively straightforward analysis of retrieval, including not only that of the asymptotic retrieval overlap (the m
obtained in the last section) but also the size of the basin from which retrieval is possible (i.e. the minimum initial overlap permitting asymptotic retrieval). This is the dilute 'The dilute case also has greater RS-breaking effects (Watkin and Sherrington 1991)
275
asymmetric network (Derrida et. al 1987) in which synapses are only present with a probability C / N and C is sufficiently small compared with N that self-correlation via synaptic loops is inconsequential. C <( l n N suffices for this to apply, albeit that both C and N might be individually large. Note that Jij and Jji are considered separately and thus effectively are never both present in this limit. For this problem retrieval can be considered via iterative maps for the instantaneous overlaps”, such as
m’(t
+ 1) = f”({m”(t)})
(55)
for synchronous dynamics or
dm’ dt
-= f”({m”(t))) - mP(t)
for asynchronous dynamics. For uncorrelated patterns, if the system is started with a finite overlap with just one pattern, the maps simplify; for the synchronous case to
where m is the overlap with the relevant pattern. The asymptotic overlaps are given by the stable fixed points of the map, while the unstable fixed points give the basin boundaries. Fig. 2a illustrates iterative retrieval. A non-zero stable fixed point therefore implies retrieval, albeit imperfect if the corresponding m* is not unity. In principle, there can be several stable fixed points separated by unstable fixed points, corresponding to several possible retrieval basins for the same pattern, having different qualities of retrieval accessible from different ranges of quality of initialization. Varying the retrieval temperature T or the capacity a causes f ( m ) to vary and hence affects retrieval. There is a phase transition when a non-zero stable fixed point and an associated unstable fixed point are eliminated under such a variation. This is illustrated in Fig. 2b; in this example the transition is first order (discontinuous reduction of m*),but it is also possible to have a second order (continuous) transition if the relevant unstable fixed point is at mg = 0 throughout the variation of T or a as m* 0. Given any algorithm relating { J i j } to the (0, together with a retrieval dynamics, f ( m ( t ) )follows directly and hence so do the retrieval and basin boundary overlaps. In fact, f ( m ( t ) ) can be usefully expressed in terms of a simple measure, the local stability field” distribution --f
&(A - A:);
p ( A ) = (Np)-’ i
A: =
r
c
J$)i.
Jij(T/(c
j#i
a#;
For the case of the dynamics of (46) f ( m )= / d A p ( A ) erf [mA/(2(1- m2
+T2))i]
(59)
l0This simplification is a consequence of the fact that, with no self-correlations, the only relevant global state measure is the probability that any neuron is in a state corresponding to a pattern. “A: is called a stability field since pattern p is stably iterated at neuron i if A: > 0.
216
(Wong and Sherrington 1990a). Thus (50) and (52) can be used to determine the retrieval of any such network, given p ( A ) . In particular, this provides a convenient measure for assessing different algorithms for { J i j } . Of course, for a given relationship between { J ; j } and { t } , p ( h ) follows directly; for the Hebb rule (18), p ( A ) is a Gaussian of mean a-f and standard deviation unity.
1
1
7
f (m), m
/ I
f (m). m
0 mB
mo
m'
1
0
m
1
Figure 2 Schematic illustrations of (a) iterative retrieval; 0, m* are stable fixed points, asymptotically reached respectively from initial states in 0 _< m < m ~mg , < m _< 1, (b) variation of f ( m )with capacity or retrieval temperature, showing the occurrence of a phase transition between retrieval and non-retrieval.
5. STATISTICAL MECHANICS OF LEARNING In the last section we considered the problem of assessing the retrieval capability of a system of given architecture, local update rule and algorithm for {Jjj}. Another important issue is the converse; how to choose/train the { J i j } , and possibly the architecture, in order to achieve the best performance as assessed by some measure. Various such performance measures are possible; for example, in a recurrent network one might ask for the best overlap improvement in one sweep, or the best asymptotic retrieval, or the largest size of attractor basin, or the largest storage capacity, or the best resistance to damage; in a feedforward network trying to learn a r d e from examples one might ask for the best performance on the examples presented, or the best ability to generalize. Statistical mechanics, again largely as originally developed for spin glasses, has played an important role in assessing what is achievable in such optimization and also provides a possible mechanism for achieving such optima (although there may be other algorithms which are quicker to attain the goals which have been shown to be accessible). Thus in this section we discuss the statistical physics of optimization, as applied to neural networks.
211
5.1 Statistical physics of optimization Consider a problem specifiable as the minimization of a function Eia)({b}) where the {a} are quenched parameters and the {b} are the variables to be adjusted, and furthermore, the number of possible values of {b} is very large. In general such a problem is hard. One cannot try all combinations of {b} since there are too many. Nor can one generally find a successful iterative improvement scheme in which one chooses an initial value of {b} and gradually adjusts the value so as to accept only moves reducing E. Rather, if the set {a} imposes conflicts, the system is likely to have a ‘landscape’ structure for E as a function of {b} which has many valleys ringed by ridges, so that a downhill start from most starting points is likely to lead one to a secondary higher-E (local) minimum and not a true (global) minimum or even a close approximation to it. To deal with such problems computationally the technique of simulated annealing was invented (Kirkpatrick et. al. 1983). In this technique one simulates the probabalistic energy-increase (hill-climbing) procedure used by a metallurgist to anneal out the defects which typically result from rapid quenches (downhill only). Specifically, one treats E as a microscopic ‘energy’, invents a complementary ‘temperature’, the annealing temperature TA,and simulates a stochastic thermal dynamics in {b} which iterates to a distribution of the Gibbs form
Then one reduces TA gradually to zero. The actual dynamics has some freedom - for example for discrete variables Monte Car10 simulations with a heat bath algorithm (Glauber 1963), such as (4), or with a Metropolis algorithm (Metropolis et. al. 1953), both lead to (60). For continuous variables Langevin = -vbE(b) ~ ( t )where , ~ ( tis) white noise of strength TA, would dynamics with also be appropriate. Computational simulated annealing is used to determine specific {J}to store specific pattern sets with specific performance measures (sometimes without the limit TA -+ 0 in order to further simulate noisy data). It is also of interest, however, to consider the generic results on what is achievable and its consequences, averaged over all equivalently chosen pattern sets. An additional relevance lies in the fact that there exist algorithms which can be proven to achieve certain performance measures if they are achiewabk (and the analysis indicates if this is the case). The analytic equivalent of simulated annealing defines a generalized partition function
+
where we use C to denote an appropriately constrained sum or integral, from which the average Lenergy’at temperature TAfollows from
218
and the minimum E from the zero ‘temperature’ limit, Em,,,= lim ( E ) T ~ . TA-0
As noted earlier, we are often interested in typicallaverage behaviour, as characterized by averaging the result over a random choice of {a} from some distribution. Hence we require to study (!nZn)(,,},which naturally suggests the use of replicas again. In fact, the replica procedure has been used to study several hard combinatorial optimization problems, such as various graph partitioning (Fu and Anderson 1986, Kanter and Sompolinsky 1987, Wong and Sherrington 1987) and travelling salesman (M6zard and Parisi 1986) problems. Here, however, we shall concentrate on neural network applications. 5.2. Cost functions ddpendent on stability fields
One important class of training problems for pattern-recognition neural networks is that in which the objective can be defined as minimization of a cost function dependent on patterns and synapses only through the stability fields; that is, in which the ‘energy’ to be minimized can be expressed in the form
E&({JH = -
cC 9 ( A 3 P
(64)
i
The reason for the minus sign is that we are often concerned with maximizing performance functions, here the g(A). Before discussing general procedure, some examples of g(A) might be in order. The original application of this technique to neural networks concerned the maximum capacity for stable storage of patterns in a network satisfying the local rule =
sgn
(CJijuj) i#i
(66)
(Gardner 1988). Stability is determined by the A:; if A: > 0, the input of the correct bits of pattern p to site i yields the correct bit as output. Thus a pattern p is stable under the network dynamics if
A: > 0;all i.
(67)
A possible performance measure is therefore given by (64) with g ( A ) = -@(-A)
where @(x)
= 1;z > 0
0 ; x < 0.
(68)
279
g(At) is thus non-zero (and negative) when pattern p is not stably stored at site i. Choosing the { J ; j } such that the minimum E is zero ensures stability. The maximum capacity for stable storage is the limiting value for which stable storage is possible. An extension is to maximal stability (Gardner and Derrida, 1988). In this case the performance measure employed is g(A) = -O(K
-
A)
(70)
and the search is for the maximum value of n for which Em;,, can be held to zero for any capacity a, or, equivalently, the maximum capacity for which Em;,,= 0 for any n. All patterns are then stored with stability fields greater than IC. In fact, for synapses restricted only by the spherical constraint"
C J;"j= N , j#i
with J;j and Jj; independent, the stability field n and the storage ratio a = p / N at criticality can be shown to be related by
For n = 0, the conventional problem of minimal stability, this reduces to the usual a, = 2 (Cover 1965). Yet another example is to consider a system trained to give the greatest increase in overlap with a pattern in one step of the dynamics, when started in a state with overlap mt. In this case, for the update rule (5) the appropriate performance function, averaged over all specific starting states of overlap mt, is (Wong and Sherrington 1990a)
where
This performance function is also that which would result from a modification of eqn (68) in which A: is replaced by
(r
is the result of randomly distorting ,$ with a (training) noise dt = (1 - mt)/2 where and E A is averaged over all implementations of the randomization (with fixed m t ) . This is referred to as 'training with noise' and is based on the physical concept of the use of such "Note that this is a different normalization than that used in eqn (18).
280
noise to spread out information in a network, perhaps with an aim towards producing better generalization, association or stability. 5.3 Methodology
Let us now turn to the methodology. For specific pattern sets we could proceed by computational simulated annealing, as discussed in the first part of section 5.1. Analytically, we require ( l n Z A { ( } ) { , , , where
from which the average minimum cost is given by
( P n Z A ) ( o is obtained via the replica procedure, (26), averaging over the {tp} to yield a replica-coupled effective pure system which is then analyzed and studied in the limit n + 0. The detailed calculations are somewhat complicated and are deferred to Appendix B. However we note here that the general procedure is analagous to those of sections (4.5) - (4.7) but with the {J} as the annealed variables, the as the quenched ones and the retrieval temperature replaced by the annealing temperature. For systems with many neurons the relevant integrals are again extremally dominated, permitting steepest descent analysis. New order parameters are again introduced, including an analogue of the spin glass order parameter q"@;here
{t}
where ( )eg is an average against the effective system resulting from averaging over the patterns; cf eqn (35). Again a mathematical simplification is consequential upon a replica-symmetric ansatz. The net result of such an analysis is that the local field distribution p ( A ) in the optimal configuration is given", within RS theory, for synapses obeying the spherical rule (64) by (Wong and Sherrington 1990)
where Dt = dt exp(-t2/2)
/&
(80)
laNote that when the expression for the partition function is extremally dominated, any other thermal measure is similarly dominated and is often straightforward to evaluate; this is the case here with
( p ( A ) ) { € l= ( N p ) - ' ( x x 6 ( A - A:)){€),
as demonstrated in Appendix B.
28 1 and X ( t ) is the value of X giving the largest value of [g(X) implicitly by a-l =
1Dt(X(t)
-q
- (A
- t ) ’ / 2 ~ where ] 7 is given
2 .
The same expressions apply to single perceptrons storing random input-output associations, where the index i can be dropped and AP = q” CjJj(,”/(CJ:): where {(,”}; j = 1,...N are the inputs and 7’’ the output of pattern p, and for dilute networks where N is replaced by the connectivity C. Immediately, one gets the one-step update of any network optimized as above. Thus, for the dynamics of (5) m’ = / d A p ( A ) erf [mA/(2(1 - m2
+T2))i].
(82)
For a dilute asymmetric network this applies to each update step, as in (57). Wong and Sherrington (1990a,b) have used the above method to investigate how the p(A) and the resultant retrieval behaviour depend on training noise, via (66). They have demonstrated that infinitesimal training noise yields the same p ( A ) as the maximum stability rule, while the limit of very strong training noise yields that of the Hebb rule. The former gives perfect retrieval for T = 0 and a < 2 but has only narrow basins of attraction for a > 0.42, while the Hebb rule has only imperfect retrieval, and that only for a < 0.637, but has wide basins of attraction. Varying mt gives a method of tuning performance between these limits. Similarly, for general T one can determine the optimal mt for the best retrieval overlap or basin size and the largest capacity for retrieval with any mt (Wong and Sherrington 1990b); for example for maximum capacity it is better to use small training noise for low T,high training noise for higher T. Just as in the replica analysis of retrieval, the assumption of replica symmetry for qap of (71) needs to be checked and a more subtle ansatz employed when it is unstable against q 9”p;9pp small. In fact, it should also be tested even when small fluctuations q p p small fluctuations are stable (since large ones may not be). Such effects, however, seem to be absent or small for many cases of continuous { J ; j } , while for discrete {Jij} they are more important’*. --f
+
5.4 Learning a rule
So far, our discussion of optimal learning has concentrated on recurrent networks and on training perceptron units for association of given patterns. Another important area of practical employment of neural networks is as expert systems, trained to try to give correct few-option decisions on the basis of many observed pieces of input data. More precisely, one tries to train a network to reproduce the results of some usually-unknown rule relating many-variable input to few-variable output, on the basis of training with a few examples of input-output sets arising from the operation of the rule (possibly with error in this training data). “For discrete { J , l } there is first order replica-symmetry breaking (Krauth and MCnard 1989) and small fluctuation analysis is insufficient.
282
To assess the potential of an artifical network of some structure to reproduce the output of a rule on the basis of examples, one needs to consider the training of the network with examples of input-output sets generated by known rules, but without the student network receiving any further information, except perhaps the probability that the teacher rule makes an error (if it is allowed to do so). Thus let us consider first a deterministic teacher rule 9 = V({€I),
(83)
relating N elements of input data deterministic student network
((:,ti
..(:)
to a single output B”, being learned by a
9 = B({€)).
(84)
B is known whereas V is not. Training consists of modifying B on the basis of examples drawn from the operation of V. Problems of interest are to train B to give (i) the best possible performance on the example set, (ii) the best possible performance on any random sample drawn from the operation of V, irrespective of whether it is a member of the training set or not. The first of these refers to the ability of the student to learn what he is taught, the second t o his ability to generalise from that training. Note that the relative structures of teacher and student can be either such that the rule is learnable, or not (for example, a perceptron is incapable of learning a parity rule (Minsky and Papert 1969)). The performance on the training set p = 1, ...p can be assessed by a training error
where e(z,y) is zero if z = y, positive otherwise. We shall sometimes work with the fractional training error Et
(86)
= Et/p.
The corresponding average generalisation error is
A common choice for
e
is quadratic in the difference (z - y). With the scaling
e(z,y) = (2 - YI2/4
(88)
one has for binary outputs, 9 = 6 1 , e(z,y) = @(-ZY),
so that if E l ( { ( } ) is a perceptron,
(89)
283 then e’ =
@(-A’)
where now.
A” = qJ”1J j ( r / ( CJ:)i, i
i
making Et analagous t o E A of eqn (64) with ‘performance function’ (68). This we refer t o as minimal stability learning. Similarly t o section 4, one can extend the error definition to e” = @(n
- A”)
(93)
and, for learnable rules, look for the solution with the maximum K for zero training error. This is maximal stability learning. Minimizing Et can proceed as discussed above, either simulationally or analytically. Note, however, that for the analytic study of average performance the ( 7 , t ) combinations are now related by the rule V, rather than being completely independent. eg follows from the resultant distribution p ( A ) . For the case in which the teacher is also a perceptron, the rule is learnable and therefore the student can achieve zero training error. The resultant generalization error is however, not necessarily zero. For continuous weights Jj the generalization error with the above two training error formulations scales as 1/a for large a,where p = a N , with the maximal stability form (Opper et. al. 1990) yielding a smaller multiplicative factor than the minimal stability form (Gyorgy and Tishby 1990). Note, however, that maximal stability training does not guarantee the best generalization; that has been obtained by Watkin (1992), on the basis of Bayesian theory, as the ‘centre of gravity’ of the possible J space permitted by minimal stability. For a perceptron student with binary weights { J j = fl}, learning from a similarly constrained teacher, there is a transition from imperfect t o perfect generalization a t a critical number of presented patterns p = a , N . This is because, in order for the system t o have no training error, beyond critical capacity it must have exactly the same weights as the teacher. Just as in the case of recurrent networks for association, it can be of interest t o consider rule-learning networks trained with randomly corrupted data or with unreliable (noisy) teachers or students. Another possibility is to train a t finite temperature; that is t o keep the annealing temperature finite rather than allowing it to tend to zero. Analysis for PA small is straightforward and shows that for a student perceptron learning to reproduce a teacher perceptron the generalization error scales as eg 1/(/3~a), so that increasing a leads to qualitatively similar performance as a zero-temperature optimized network with p/N =& = (Sompolinsky et. al. 1990). There are many other rules which can be analyzed for possible reproduction by a singlelayer perceptron, some learnable, some not, and attention is now also turning towards the analysis of multilayer perceptrons, but for further details the reader is referred elsewhere (Seung et. al. 1992, Watkin et. al. 1992, Barkai et. al. 1992, Engel et. al. 1992).
-
284 6. CONCLUSION
In this chapter we have tried to introduce the conceptual and mathematical basis for the transfer of techniques developed for spin glasses to the quantitative analysis of neural networks. The emphasis has been on the underlying theme and the principles behind the analysis, rather than the presentation of all the intricacies and the applications. For further details the reader is referred to texts such as those of Amit (1989), Miiller and Reinhardt (1990), Hertz et. al. (1991), the review of Watkin et. al. (1992) and to the specialist research literature. We have restricted discussion to networks whose local dynamics is determined by pairwise synaptic forces and in applications have employed binary neurons and zero thresholds, but all of these restrictions can be lifted in principle and mostly in practice. For example, with binary neurons the synaptic update rule includes only a small subset of all Boolean rules and it is possible to extend the analysis of retrieval in a dilute network and the optimization of rules for pattern association to the general set of Boolean rules (Wong and Sherrington 1989a, 198913). Binary neurons can be replaced by either continuous-valued or multi-state discrete ones. Thresholds can be considered as arising from extra neurons in a fixed activity state. We have discussed only a rather limited set of the types of problems which can be envisaged for neural networks. In particular we have discussed only the storage and retrieval of static data and only one-step or asymptotic retrieval (but see also the accompanying chapter on dynamics: Coolen and Sherrington 1992). Statistical mechanical techniques have been applied t o networks with temporally structured attractors and to the issue of competition between such attractors and ones associated with static associations. Indeed, the study of more sophisticated aspects of dynamics is an active and growing one. Also, we have discussed only supervised learning whilst it is clear that unsupervised learning is also of major biological and engineering relevance - again, there has been and continues to be statistical mechanical transfer to this area also. We have not discussed the problem of optimizing architecture, except insofar as this is implicit in the inclusion of the possibility of Jij = 0. Nor have we discussed training algorithms other than that of simulated annealing, but we note again that there exist other algorithms for certain problems which are known to work zf a solution ezists, while the analytic theory can show if one does. Similarly, we have not discussed the rate of convergence of any algorithms, either to optimum or specified sub-optimal performance. However, overall, it is hoped that it will be apparent that the statistical physics developed for spin glasses has already brought to the subject of neural networks both new conceptual viewpoints and new techniques, particularly oriented towards the quantitative study of typical rather than worst cases and allowing for the consideration of imprecise information and assessment of the resilience of solutions. There is much further potential for the application of the statistical physics of disordered systems to neural networks and possibly also for the converse, where we note in conclusion that the corresponding investigations of spin glasses, started almost two decades ago, led to a major reconsideration of both tenets and techniques of statistical physics, and neural networks could provide an interesting sequel to the fascinating developments which unfolded in that study.
285
Appendix A Here we consider in more detail the derivation of eqns (48)-(54), starting from eqn (47). For the non-condensed patterns, p > s, only small r n p contribute and the corresponding en cosh can be expanded to second order to approximate
The resultant Gaussian form in r n p is inconvenient for direct integration since it would yield an awkward function of u's. However, C usuf may be effectively decoupled by the introduction of a spin-glass like order parameter qa' via the identities
In eqn(47) the m w ; p > s integrations now yield the u-independent result where (2n/NP) t ( p - " ) ( det A)- i(p-a),
A"' = (1 - P)S,p - Pf',
(A-5)
while the a; contributions now enter in the form
which is separable in i. Further anticipating the result that the relevant r' scales as p = a N , (A.6) has the consequence that (Z"){,l is extremally dominated, as in (48). Re-scaling
yields eqn (48) with
286
(-4.8) (4)
P=1
{OP}
where the {u”} summations are now single-site. Minimizing @ with respect to {ma}, {qaa}, {T”@} yields the dominant behaviour, with (49)-(51) providing an interpretation of the extremal values, as follows from an analagous evaluation of the right hand sides of those equations, which are again extremally dominated. Explicit evaluation and the limit n --t 0 are facilitated by the replica symmetric ansatz m p = mP7 -
TaB
(-4.9)
q,
(A.lO)
= T.
(A.ll)
With these assumptions the exponential in the last term of (A.8) may be written as (A.12)
where we have re-expressed the exp(C Cn cosh) to obtain a form with a linear u dependence in an exponential argument. The second term of (A.12) can similarly be transformed t o a linear form by the use of (22), thereby permitting the separation and straightforward execution of the sums of {u“} in (A.7). Also, with the use of (A.lO) the evaluation of CnA is straightforward and in the limit n + 0 yields (A.13)
Thus {mP}, q and
9({ma},q,T)
T
are given as the extrema of
= a/2
+ C(m’)2/2 + aPT(1-
q)/2
P
+ ( a / 2 P ) ( M l - 4 1 - 4)) - P d ( 1 - P(1 - 4)))
-p-’
/ d~e”’/~(Cn[2cosh
@(z&
+ 2 m”~P)])p=*l,r=~... (A.14) p=l
Specializing to the case of retrieval, with only one m’ macroscopic (i.e. s = l ) , there result the self-consistency equations (52)-(54).
287
Appendix B In this appendix we demonstrate how to perform the analytic minimization of a cost function of the general form E$>({JH= -
cd A P ) P
where e
A’ = (:
C Jj[$, j=1
with respect to
Jj
which satisfy spherical constraints
The {(} are random quenched fl. The solution of this problem also solves that of eqn (57) since the J;j of (57) are uncorrelated in i and thus the problem separates in the i label; note that J;j and Jj; are optimized independently. The method we employ is the analytic simulated annealing discussed in section 5.1, with the minimum with respect to the {J}then averaged over the choice of {(}. Thus we require (.!nZ{t)){t)where
{t}
In order to evaluate the average we separate out the explicit ( dependence via delta functions S ( P - ( P cjJ,(j”/Ct)and express all the delta functions in exponential integral represent ation,
exp(iq5”(XP - ( ” c J j ( T / C ! ) ) .
(B.5)
i
Replica theory requires ( Z c > ) { oand therefore the introduction of a dummy replica index on J , E , X and 4; we use a = 1, ..A. For the case in which all the ( are independently distributed and equally likely to be fl,the ( average involves
For large C (and q5.J exponentiation) by
I O(1)) the cosine can be approximated
(after expansion and re-
288
where
9 is conveniently re-expressed as
where we have used the fact that only C Jj" = C contributes to 2. In analogy with the procedure of Appendix A, we eliminate the term in Cj J;Jf in favour of a spin glass like order parameter/variable qa@introduced via the identities 1=
1
dqaP6(qaP - C-'
= (C/27r)
c JTJ;)
(B.ll)
j
1
dzaPdqaPexp(izaP(CqaP-
c J;J;)).
(B.12)
j
The j = 1,...C and p = 1,...p contribute only multiplicatively in relevant terms and (Z"){,i can be written as
where =)/)n d J a e x p ( - ~ a " ( ( J a ) 2 exp G J ( { E " } , { Z " ~ OL
a
-
1)
+ C zaPJaJP)
(B.14)
a
and there are now no j or p labels. Since GJ and G, are intensive, as is p / C = a for the situation of interest, (B.13) is dominated by the maximum of its integrand.
289
In the replica symmetric ansatz E“ = &,zap = z and quo = q. In the limit n elimination of E and x at the saddle point &&/a& = a@/& = a @ / a q= 0 yields 1 2 +a/ D t Pn(27r(1-
(PnZ){o = C ext,(-Pn[2~(1- q)]
3
0,
+ (2(1 - q))-’
(I))-;
/W exppAg(X) - (A - &)‘/2(1
- q ) ] ) (B.16)
(B.17)
where D t = dt exp(-t2/2)/&.
In the low temperature limit, ,Ba + 00,q + 1 and p ~ ( -l q ) -i 7,independent of TA to leading order. The integration over X can then be simplified by steepest descent, so that
1
dX exp(kJag(X) - (A - t)’/2(1 - q ) )
+
exp(PA(g(i(t)) - ( i ( t ) - t)2/27>>
(B.18)
where i ( t ) is the value of X which maximizes (g(X) - (A - t)’/27); i.e. the inverse function of t(X) given by
- 7g’(X)l+x.
t(X) =
(B.19)
Extremizing the expression in (A.15) with respect to q, or equivalently to 7,gives the (implicit) determining equation for 7
/ Dt(i(t)
(B.20)
- t ) 2 = a.
The average minimum cost follows from (B.21) This can be obtained straightforwardly from (B.16). Similarly, any measure f(h”))~,)(C} may be obtained by means of the generating functional procedure of eqn (14). Alternatively, they follow from the local field distribution p ( X ) defined by
((x;=l
(B.22) which is given by p(X) =
J DtqX - X(t)).
(B.23)
A convenient derivation of (B.23) proceeds as follows. The thermal average for fixed {(} is given by P
(P-’ C &(A - A ” ) ) T A ”=l
=
C(P-’C &(A (4
fi
- A”))eXP(PA
C g(A”))/Z. ”
(B.24)
290
Multiplying numerator and denominator by 2”-’ and taking the limit n (p-’
c
&(A - A’)),
c1
= lim
c
(p-’
n-10 {JQ};a=l, ...n
c P
which permits straightforward averaging over domination as above, so that
exP(c[PAg(Aa) a
&(A - A”’))eXp(PA
---t
g(A””))
0 gives
(B.25)
P”
{ I } .There then results the same extremal
+ i A n Y - (4’”)2/21 - a
(B.26)
whence the replica-symmetric ansatz and limit PA---t 00 yield the result (B.23). In the corresponding problems associated with learning a rule, to in (B.2) is replaced by the teacher output 9, while certain aspects of noise may be incorporated in the form of g(hP). Minimization of the corresponding training error may then be effectuated analagously to the above treatment of random patterns, but with 7 related to the { I j } ; j = 1,... by the teacher rule (76).
REFERENCES Aleksander I.; 1988, in “Neural Computing Architectures” ed. I. Aleksander (North Oxford Academic), 133 Amit D.J.; 1989, “Modelling Brain Function” (Cambridge University Press) Amit D.J., Gutfreund H. and Sompolinsky H.; 1985, Ann. Phys. 173, 30 de Almeida J.R.L. and Thouless D.J.; 1978, J. Phys. A l l , 983 Binder K. and Young A.P.; 1986, Rev. Mod. Phys. 58, 801 Coolen A.C.C. and Sherrington D.; 1992 “Dynamics of Attractor Neural Networks”, this volume Derrida B., Gardner E. and Zippelius A.; 1987, Europhys. Lett. 4, 167 Edwards S.F. and Anderson P.W.; 1975, J. Phys F5, 965 Fischer K.H. and Hertz J.A.; 1991 “Spin Glasses” (Cambridge University Press) Gardner E.; 1988, J. Phys A21, 257 Gardner E. and Derrida B.; 1988, J. Phys A21, 271 Glauber R.; 1963, J. Math. Phys. 4, 294 Gyorgy G. and Tishby N.; 1990, in “Neural Networks and Spin Glasses”, eds. W.K. Theumann and R. Koberle (World Scientific) Hertz J.A., Krogh A. and Palmer R.G.; 1991, “Introduction to the Theory of Neural Computation” ( Addison-Wesley) Hopfield J.J.; 1982, Proc. Natl. Acad. Sci. USA 79,2554 Kanter I. and Sompolinsky H.; 1987, Phys. Rev. Lett. 58,164 Kirkpatrick S., Gelatt C.D. and Vecchi M.P., 1983, Science 220, 671 Kirkpatrick S. and Sherrington D.; 1978, Phys. Rev. B17,4384 Krauth W. and MCzard M.; J. Physique (France) 5 0 , 3057 (1989)
291
Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller A.H. and Teller E.; 1953, J. Chem. Phys. 21, 1087 Mizard M. and Parisi G.; 1986, J. Physique (Paris) 47, 1285 Mdaard M., Parisi G. and Virasoro M.A.; 1987, “Spin Glass Theory and Beyond” (World Scientific) Minsky M.L. and Papert, S.A.; 1969 “Perceptrons” (MIT Univ. Press) Muller B. and Reinhardt J.; 1990, “Neural Networks: an Introduction” (Springer-Verlag) Opper M., Kinael W., Kleina J. and Nehl R.; 1990, J. Phys A23, L581 Parisi G.; 1979, Phys. Rev. Lett. 43, 1754 Parisi G.; 1983, Phys. Rev. Lett. 5 0 , 1946 Seung H.S., Sompolinsky H. and Tishby N.; 1992, Phys. Rev. A45,6056 Sherrington D.; 1990, in “1989 Lectures on Complex Systems” ed. E. Jen (Addison-Wesley) p.415 Sherrington D.; 1992, in “Electronic Phase Transitions” ed. W. Hanke and Yu.V. Kopaev (North-Holland) p.79 Sherrington D. and Kirkpatrick S.; 1975, Phys. Rev. Lett 35, 1972 Thouless D.J., Anderson P.W. and Palmer R.; 1977, Phil. Mag. 35, 1972 Watkin T.L.H.; 1992 “Optimal Learning with a Neural Network”, to be published in Europhys. Lett. Watkin T.L.H., Rau A. and Biehl M.; 1992 “The Statistical Mechanics of Learning a Rule”, to be published in Revs. Mod. Phys. Watkin T.L.H., Sherrington D.; 1991, Europhys. Lett. 14,791 Wong K.Y.M. and Sherrington D.; 1987, J. Phys. A20, L793 Wong K.Y.M. and Sherrington D.; 1988, Europhys. Lett. 7 , 197 Wong K.Y.M. and Sherrington D.; 1989, J. Phys A22, 2233 Wong K.Y.M. and Sherrington D.; 1990a, J. Phys. A23, L175 Wong K.Y.M. and Sherrington D.; 1990b, J. Phys. A23, 4659 Wong K.Y.M. and Sherrington D.; 199Oc, Europhys. Lett. 10,419
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 8 1993 Elsevier Science Publishers B.V. All rights reserved.
293
Dynamics of Attractor Neural Networks
Ton Coolen and David Sherrington Department of Physics, University of Oxford, Theoretical Physics, 1 Keble Road, Oxford, OX1 3NP Abstract We illustrate the use of techniques from non-equilibrium statistical mechanics for studying dynamical processes in symmetric and non-symmetric attractor neural networks.
1. INTRODUCTION Although techniques from equilibrium statistical mechanics can provide much detailed quantitative information on the behaviour of large interconnected networks of neurons, they also have some serious restrictions. The first (obvious) one is that, by definition, they will only provide information on equilibrium properties. For associative memories, for instance, it is not clear how one can calculate quantities like sizes of domains of attraction without studying dynamics. The second (more serious) restriction is that for equilibrium statistical mechanics to apply the dynamics of the system under study must obey a property called detailed balance. For Ising spin neural networks in which the dynamics is a stochastic alignment to local fields (or post-synaptic potentials) which are linear ih the neural state variables, this requirement implies immediately symmetry of the interaction matrix. From a physiological point of view this is clearly unacceptable. The dynamics of symmetric systems can be understood in terms of the minimisation of some scalar quantitity (in equilibrium to be identified with the free energy). For nonsymmetric systems, although the microscopic probability distribution will again evolve in time to some equilibrium, it will no longer be possible to think in terms of some scalar quantity being minimised. Ergodicity breaking in the thermodynamic limit (i.e. on finite timescales) may now manifest itself in the form of limit-cycle attractors or even in chaotic trajectories. One must study the dynamics directly. The common strategy of all non-equilibrium statistical mechanical studies is rather simple: try to derive and solve the dynamical laws for a suitable smaller set of relevant macroscopic quantities from the dynamical laws of the underlying microscopic system. This can be done in two ways. The first route consists of calculating from the microscopic stochastic equations a differential equation for the macroscopic probability distribution, which subsequently is to be solved. The second route consists of solving the macroscopic stochastic equations directly; from this solution one then calculates the values of the macroscopic quantities. For such programmes to work the interaction matrix must either have a suitable structure of some sort, or contain (frozen) disorder, over which suitable averages can be performed (or a combination of both). A common feature of many sta-
294
tistical mechanical models for neural networks is separability of the interaction matrix, which naturally leads to a convenient description in terms of macroscopic order parameters.
2. THE MACROSCOPIC PROBABILITY DISTRIBUTION
In this section we will show how one can calculate from the microscopic stochastic evolution equations (at the level of individual neurons) differential equations for the probability distribution of suitably defined macroscopic state variables. We will investigate which are the conditions for the evolution of these macroscopic state variables to (a) become deterministic in the limit of infinitely large networks and, in addition, (b) be governed by a closed set of dynamic equations. For simplicity in illustrating the techniques we will restrict ourselves to systems of McCulloch-Pitts/Ising spin neurons a;E {-1, 1) (a;= 1 indicates that neuron i is firing with maximum frequency, u; = -1 indicates that it is at rest). The N-neuron network state will be denoted by the vector u E {-1, l}N;the probability to find the system at time t in state u by pt(u). The evolution in time of this microscopic state probability is governed by a stochastic process in the form of a master equation:
in which the Fj are ‘spin-flip’ operators: F j f ( u ) E f(a1,. . .,-uj,. . .,U N ) . The process (1) becomes a stochastic local field alignment if for the transition rates wj(u) of the transitions u -+ Fju we make the usual choice: 1 w ~ ( u ) - [l - t a n h ( p ~ j h j ( ~ ) ) ] 2
hj(U)
JjkUk
-
w;
(2)
k
where p = 1/T (the ‘temperature’ T being a measure of the amount of stochastic noise) and the quantities hj(u) are the local alignment fields (or post-synaptic potentials). There are clearly many alternative ways of defining a stochastic dynamics for such systems, most of which are Markov processes (i.e. with discrete time steps). The advantage of the above choice is simply that we are now dealing with differential equations (as opposed to discrete mappings) from the very beginning.
2.1 A Toy Model
Let us first illustrate the basic ideas with the help of a simple toy model: J.. =
-J7 . tr . 3
1 3 - N
w;= 0
(the variables q; and & are arbitrary, but may not depend on N). For 77; = [ i = 1 we recover the infinite range ferromagnet (J > 0) or anti-ferromagnet (J < 0); for q; = [; E {-1,l) (random) and J > 0 we recover the Liittinger (1976) or Mattis (1976) model
295
(equivalently: the Hopfield (1982) model with only one stored pattern). Note, however, that the interaction matrix is non-symmetric as soon as a pair ( i j ) exists, such that v;(j # qjt; (in general, therefore, equilibrium statistical mechanics does not apply). The local fields become h;(u) = Jv;m(u) with m ( u ) E & x k (kuk. Since they depend on the microscopic state u only through the value of m , the latter quantity appears to constitute a natural macroscopic level of description. The ensemble probability of finding the macroscopic state m ( u ) = m is given by Pt[ml
= C P t ( U ) 6 [m-m(u)l U
Its time derivative is obtained by inserting (1):
Inserting the expressions (2) for the transition rates and the local fields gives:
In the thermodynamic limit N + 00 only the first term survives. The solution of the resulting differential equation for Pt [m] is:
Pt[m]= / d m o %[m016 [m-m'(t)l
This solution describes deterministic evolution, the only uncertainty in the value of m is due to uncertainty in initial conditions. If at t = 0 the quantity m is known exactly, this will remain the case for finite timescales; m turns out to evolve in time according to (3).
2.2 Arbitrary S y n a p t i c Interactions
We will again define our macroscopic dynamics according to the master equation (l),but we will now allow for less trivial choices of the interaction matrix. We want to calculate the evolution in time of a given set of macroscopic state variables n(u)f (CiI(u), . ..,Cin(u))in the thermodynamic limit N + 00. At this stage there are no restrictions yet on the form or the number n of these state variables nk.0); such conditions, however, naturally arise if we require the evolution of the variables 0 to obey a closed set of deterministic laws, as we will show below. The ensemble probability of finding the system in macroscopic state f2 is given by:
296
The time derivative of this distribution is obtained by inserting (1)and can be written as
This expansion (to be interpreted in a distributional sense, i.e. only to be used in expressions of the form JdnPt(n)G(f?) with sufficiently smooth functions G ( n ) , so that all derivatives are well-defined and finite) w i l l only make sense if the single spin-flip shifts Ajk in the state variables n~, are sufficiently small. This is to be expected from a physical point of view: for finite N any state variable &(c)can only assume a finite number of possible values; only in the limit N + 00 may we expect smooth probability distributions for our macroscopic quantities (the probability distribution of state variables which only depend on a small number of spins, however, will not become smooth, whatever the system size). The first (I = 1)term in the series (4) is the flow term; retaining only this term leads us to a Liouville equation which describes deterministic flow in n space, driven by Including the second ( I = 2) term as well leads us to a Fokker-Planck the flow field dl). equation which (in addition to the flow) describes diffusion in 0 space of the macroscopic probability density Pt [n],generated by the diffusion matrix F$). According t o (4) a sufficient condition for a given set of state variables n(u)to evolve in time deterministically in the limit N + 00 is:
(since now for N + 00 only the 1 = 1 term in (4) is retained). In the simple case where the state variables n k are of the same type in the sense that $ shifts Ajk are of the same order in the system size N (i.e. there is a monotonic function AN such that Ajk = AN) for all jk), for instance, the above criterion becomes:
If for a given set of macroscopic quantities the condition (5) is satisfied we can for large N describe the evolution of the macroscopic probability density by the Liouville equation:
the solution of which describes deterministic flow:
291
d -6)*(t) = F(l)[f?*(t);t] dt
n*(o) = no
(7)
In taking the limit N + 00, however, we have to keep in mind that the resulting deterministic theory is obtained by taking this limit for finite t. According to (4) the 1 > 1 terms do come into play for sufficiently large times t ; for N + 00, however, these times diverge by virtue of (5). The equation (7) governing the (deterministic) evolution in time of the macroscopic state variables 6) on finite timescales w i l l in general not be autonomous; tracing back the origin of the explicit time dependence in the right-hand side of (7) one finds that in order to calculate F(') one needs to know the microscopic probability density p * ( u ) . This, in turn, requires solving the master equation (1) (which is exactly what one tries to avoid). However, there are elegant ways of avoiding this pitfall. We will now discuss two constructions that allow for the elimination of the explicit time dependence in the right-hand side of (7) and thereby turn the state variables 6) and their dynamic equations (7) into an autonomous level of description. The first way out is to choose the macroscopic state variables $2 in such a way that there is no explicit time dependence in the flow field F(') [f?;t ] (if possible). According to the definition of the flow field this implies making sure that there exists a vector field # [n]such that N
(with Aj ( h a , . ..,Aj,,)) in which case the time dependence of F ( l )drops out and the macroscopic state variables f? evolve in time according to:
This is the construction underlying the approach in papers like Buhmann and Schulten (1988), Riedel et al (1988), Coolen and Ruijgrok (1988). The advantage is that no restrictions need to be imposed on the initial microscopic configuration; the disadvantage is that for the method to apply, a suitable separable structure of the interaction matrix is required. If, for instance, the macroscopic state variables S l k depend linearly on the microscopic state variables u (i.e. n ( m ) $ Cj"=,,fjuj), we obtain (with the transition rates (2)):
in which case it turns out that the only further condition necessary for (8) to hold is that all local fields hk must (in leading order in N) depend on the microscopic state u only through the values of the macroscopic state variables n (since the local fields depend linearly on u this, in turn, implies that the interaction matrix must be separable). If it is not possible to find a set of macroscopic state variables that satisfies both conditions (5,8), additional assumptions or restrictions are needed. One natural assumption that allows us to close the hierarchy of dynamical equations and obtain an autonomous
298 flow for the state variables n is to assume equipartitioning of probability in the subshells of the ensemble, which allows us to make the replacement:
n-
with the result
Whether or not the above way of closing the set of equations is allowed will depend on w j ( u ) A j ( u )is constant within the extent t o which the relevant stochastic vector the 0-subshells of the ensemble. At t = 0 there is no problem, since one can always choose the initial microscopic distribution n(a)to obey equipartioning. In the case of extremely diluted networks, introduced by Derrida et al (1987), this situation is subsequently maintained by assuring that, due t o the extreme dilution, no correlations can build up in finite time and equipartitioning will be sustained (see also the review paper by Kree and Zippelius 1991). The advantage of extreme dilution is that less strict requirements on the structure of the interaction matrix are involved; the disadvantage is that the required sparseness of the interactions (compared to the system size) does not correspond t o biological reality.
c:,
3. SEPARABLE MODELS
In this section we will show how the formalism described in the previous section can be applied t o networks for which the matrix of interactions Jij has a separable form (which includes most symmetric and non-symmetric Hebbian type attractor models). We will restrict ourselves t o models with Wi = 0; the introduction of non-zero thresholds is straightforward and does not pose new problems.
3.1 Description at the Level of Sublattice Magnetisations The following type of models was introduced by van Hemmen and Kiihn (1986) (for symmetric choices of the kernel Q). The dynamical properties (for arbitrary choices of the kernel Q) were studied by Riedel et al (1988):
1
J'3. . = - -Q N
(€..<.) 3 1'
€a
= (e, .'a
(9)
*.
C
The components are assumed to be drawn from a finite discrete set A which contains elements (again the variables .$ are not allowed t o depend on N). They typically represent/contain the information ('patterns') t o be stored or processed. The Hopfield model, for instance, corresponds to choosing Q(z; y) E 2 . y and A E {-1,l). One can now introduce a partition of the system {I,. . . ,N} into nz so-called sublattices 19: ng
19 s {;I
& = 9)
{1,..., N }
EUI~ 9
9 E Ap
299
The number of spins in sublattice 19 will be denoted by 1191 (this number will have t o be large). If we choose as our macroscopic state variables the magnetisations within these sublattices, we are able to express the alignment fields hk solely in terms of macroscopic quantities:
If all relative sublattice occupation numbers p 9 are of the same order in N (which, for example, is the case if the vectors & have been drawn at random from the set AP) we may write Aj9 = O(nXN-') and use ( 6 ) . The evolution in time of the sublattice magnetisations is then found t o be deterministic in the thermodynamic limit if lim N-+w
P=o log N
Furthermore, condition (8) holds, since
We may conclude that the evolution in time of the sublattice magnetisations is governed by the following autonomous set of differential equations:
A nice overview of applications and analysis of these equations (and generalisations thereof) for various choices of the kernel Q(.; .) can be found in review papers like van Hemmen and Kiihn (1991) and Kiihn and van Hemmen (1991). The above procedure does not require symmetry of the interaction matrix. In the symmetric case Q(a; y ) = Q(y; a) one can easily verify that the system will approach equilibrium; if the kernel Q is positive definite, for instance, by inspection of the Liapunov function L { m q } :
which is bounded and obeys:
Also for some very specific non-symmetric cases Liapunov functions do exist, as shown by Herz et al (1991).
300 Note that from the sublattice magnetisations one can easily calculate the so called 'overlap' order parameters (which can be written as averages over the nz sublattice magnetisations):
Whether our not, in turn, there exists an autonomous set of laws at the level of overlaps depends on the form of the kernel Q(.; .).
3.2 Description at the Level of Overlaps Equation (10) suggests that at the level of overlaps there will be, in turn, an autonomous set of dynamical laws if the kernel Q is bilinear, i.e. Q(z;y) = C,,,,z,APy,, or: r
v
Symmetric and non-symmetric models of this type were introduced and studied by Coolen and Ruijgrok (1988) and Shiino et al (1989); also the pseudo-inverse (Kohonen 1984) is of the form (12). Now the components need not be drawn from a finite discrete set, as we will see (as long as they do not depend on N). The Hopfield model corresponds to choosing A,, ,6 and E {-1,l). The alignment fields h k can now be written in terms of the overlap order parameters m,:
(r
(r
hk(u) = t
k
.Am(u)
m
= (ml,. . . ,mp)
Since for the present choice of macroscopic variables we find A,, = O(N-'), the evolution in time of the overlap vector m becomes deterministic in the thermodynamic limit if (according to ( 6 ) ) : lim - P =O N-w
rn
Again condition (8) holds, since N
In the thermodynamic limit the evolution in time of the overlap vector m is governed are drawn at random by an autonomous set of differential equations; if the vectors according to some distribution p ( ( ) these dynamical laws become:
tk
As in the case of the more familiar methodology of equilibrium statistical mechanics, the macroscopic laws one obtains are coupled and non-linear. Only for very special cases
30 1
T=O. 1
-1
0
T=0.6
-1
1
0
T=1.1
1
-1
1
1
1
0
0
0
-1
-1
-1
0
1
0
1
mz
-1 -1
m1
0 m 1
1
-1
0
1
m1
Figure 1. Flow diagrams obtained by numerically solving eqns. (13) for p = 2. Upper row: A,, = ,6 (the Hopfield model); lower row: A =
(ll )
(for both models the critical temperature is T,= 1).
will one be able to calculate all solutions analytically (Coolen et al 1989, Coolen 1991, Jonker and Coolen 1991). Generalisations of (13) to systems with special structure (where the order parameters will also be position dependent: m(t)+ m ( e , t ) )can be found in Coolen (1990) and Coolen and Lenders (1992). Figure 1 shows in the ml,mz-plane the result of solving the macroscopic laws (13) numerically for p = 2 and two choices of the matrix A. The first choice (upper row) corresponds to the Hopfield model; as the temperature increases the amplitudes of the four attractors (corresponding to the two patterns (’’ and their mirror imagesf’) continuously decrease, until at the critical temperature T, = 1 they merge into the trivial attractor m = (0,O). The second choice corresponds to a non-symmetric model (i.e. without detailed balance); at the macroscopic level of desciption (at finite timescales) the system clearly does not approach equilibrium; macroscopic order now manifests itself in the form of a limit-cycle (provided the temperature T is below the critical temperature T, = 1 where this limit-cycle is destroyed). To what extent the laws (13) are in agreement with the result of performing the actual simulations in finite systems is illustrated in figure 2. Again symmetry of the interaction matrix is not required. For specific non-symmetric choices for the matrix A limit-cycle solutions exist (Coolen and Ruijgrok 1988, Nishimori et al 1990); competition between such limit-cycles and ordinary fixed-point attractors
302
N= 1 000
Analytical
N=2000
1
1
1
0
0
0
-1
-1
0
-1
1
m2
-1 -1
0
m1
m 1
1
-1
0
1
m1
Figure 2. Comparison between simulation results for finite systems ( N = 1000 and N = 2000) and the analytical prediction (13) with respect to the evolution of the order parameters m ( p = 2, T = 0.8 and A=
(
gives rise to interesting phase transitions (Coolen and Sherrington 1992). In the symmetric case A,, = A , the system will approach equilibrium; the Liapunov function (11) now becomes:
L{m,}
1
.
1
= -m A m - -(log cosh [/3(
2
P
+
Am])t
If one is willing to pay the price of restricting oneself to the limited class of models (12) (as opposed to the more general class (9)) and t o the more global level of description in terms of p overlap parameters m, instead of nx sublattice magnetisations mv, then (for there are two rewards. Firstly there will be no restrictions on the stored quantities instance, they are allowed to be real-valued); secondly the number p of patterns stored can be much larger for the deterministic autonomous dynamical laws to hold (p << f l instead of p << log N,which from a biological point of view is too restrictive.
(r
4. PATH INTEGRAL FORMALISM
A different approach is based on a formal solution of the microscopic dynamical laws (1) developed by Sommers (1987), in the spirit of a dynamical method for Langevin spins, based on the calculation of a generating functional (De Dominicis, 1978) as applied to spin-glasses by Sompolinsky and Zippelius (1982) and Crisanti and Sompolinsky (1987). Now the step from the microscopic level of description in terms of spin states u t o the macroscopic level of description in terms of dynamic order parameters is taken after having (formally) solved the microscopic master equation. The local fields hk are defined in a more general way h i( u ;t ) =
c N
j=1
Jijc7j
+4(t)
J 11 .. - 0
303
in order to enable the calculation of time dependent response functions and correlation functions by differentiation with respect to the time-dependent external fields O k ( t ) . The solution of the master equation as given by Sommers holds for a specific type of initial conditions:
and takes the form of a path integral:
in which the complex functions m;(t;8,s’) are the solutions of:
d -m;
dt
= tanh phk(s(t)jt ) ] - m;
+ is; [l - mi2]
m;(O)
= mi
This expression is obtained by performing manipulations on the formal solution of the master equation:
in which ? is the time-ordering operator. We will not discuss the details of the derivation of (15) from (16), since firstly it is somewhat involved and, secondly, although there appears to be agreement in the literature on the final result (15), there is still discussion about which is the proper derivation (see Sommers,l987, and Lusakowski, 1991). If, finally, we insert the unit operator
and introduce additional variables hi(.r)for representing the 6 functions as integrals, we obtain
d -m; dt
= tanh p h k ( t ) ] - m;
+ is; [l - mi2]
m;(O)
= mi
(17)
The advantage of this (rather complicated) expression (17) for the solution p t ( u ) of the master equation is that the interaction variables {J;j}appear linearly in the exponent, which, in cases where the interactions contain quenched disorder, enables one to average easily expectation values of self-averaging physical quantities. Application of Sommer’s
304
result to the dynamics of the Hopfield model by Rieger et al(1988) led to a dynamical verification of the equilibrium picture obtained by Amit et al. (1985). The order parameters in this description are not only the familiar overlaps m,, but, in addition, time-dependent correlation and response functions. More recently Shim et al (1991) applied a modified version of (15) to symmetric Hopfield-type systems with transmission delays. Note, finally, that the restriction to initial conditions of the type (14) is not dramatic, since any initial distribution (and corresponding t > 0 solution) can be written as an integral over distributions of the type (14). A similar method was used by Horner et al (1989) who obtained approximate dynamic mean field equations for symmetric Hopfield-type models. Here the key object is a suitably defined generating functional 8,rather than the formal solution of the master equation. From 5, which is defined as an average over all trajectories generated by the microscopic stochastic process and all initial conditions, one can calculate observables and time-dependent response functions by functional differentiation. The path integral methods described in this section are powerfull tools, in that they generate exact dynamical expressions at the microscopic level of description. Therefore they can in principle be applied in notorious cases, where no alternative dynamical method is yet available (like having an extensive number of p = a N stored patterns in an N-neuron attractor model). The common ingredient of the applications of these methods to neural network dynamics that have appeared in literature so far, however, is that in the aforementioned notorious cases, in order to arrive at analytical results it appeared to be necessary to either restrict oneself to semi-static situations (Rieger et al, 1988) or to resort to interpolations between such semi-static situations and exact short-time solutions (Horner et al, 1989).
5 . OUTLOOK
In this contribution we have restricted ourselves to McCulloch-Pitts/Ising spin models of neural networks. Most of the techniques we have described, however, can also be applied to dynamical models which employ different types of neural variables, different types of interactions (or interaction architectures) or different types of stochastic dynamical rules. We will briefly discuss some references in which such extensions can be found. Dynamical studies of diluted and fully connected models with non-Ising neural variables have been performed, for instance, for Langevin spins (Kree and Zippelius, 1987), phase variables (Noest, 1988), Potts variables (Boil6 and Mallezie, 1989), continuous (‘graded response’) neurons (Mertens, 1991; Jedrzejewski and Komoda, 1992) and discrete multi-state spins (Bolle et al, 1992). Applications of dynamical methods to systems with multi-spin interactions (where the local alignment fields are no longer linear in the neural state variables) can be found in, for instance, Kanter (1988), Kuhn et al (1989) and Tamarit et al (1991). The dynamics of models with layered feed-forward interaction architectures (where the analysis is simplified by the architecture, which allows for iteratively solving the dynamics layer by layer) has been studied by Domany et al (1989) (for Ising spin neurons) and Shim et al (1992) (for Potts spins). The macroscopic analysis in terms of dynamical order parameters for separable fully connected Ising spin models with
305 parallel stochastic dynamics has been performed by Bernier (1991) (the coupled non-linear differential equations for the macroscopic state variables now become coupled non-linear discrete mappings).
Fortunately, many interesting unsolved problems remain. There are, for instance, the as yet unexplained remanence effects in attractor networks where numerical simulations show that above the critical storage capacity line (in contrast to what the equilibrium theory predicts) the retrieval order parameter is found not to vanish (Amit and Gutfreund, 1987).
REFERENCES Amit D.J., Gutfreund H. and Sompolinsky H.; 1985, Phys. Rev. A32,1007 Amit D.J., Gutfreund H. and Sompolinsky H.; 1985, Phys. Rev. Lett. 55, 1530 Amit D.J. and Gutfreund H.; 1987, Annals of Physics 173, 30 Bernier 0.; 1991, Europhys. Lett. 16, 531 Buhmann J. and Schulten K.; 1987, Europhys. Lett. 4, 1205 Coolen A.C.C.; 1990, in “Statistical Mechanics of Neural Networks” ed. L. Garrido (Berlin, Springer), 381 Coolen A.C.C.; 1991, Europhys. Lett. 16, 73 Coolen A.C.C., Jonker H.J.J. and Ruijgrok Th.W.; 1989, Phys. Rev. A40, 5295 Coolen A.C.C. and Ruijgrok Th.W.; 1988, Phys. Rev. A38, 4253 Coolen A.C.C. and Lenders L.G.V.M.; 1992, J. Phys. A25,2577 Coolen A.C.C. and Sherrington D.; 1992, to be published in J. Phys. A Crisanti A. and Sompolinsky H.; 1987, Phys. Rev. A36,4922 De Dominicis C.; 1978, Phys. Rev. B18, 4913 Derrida B., Gardner E. and Zippelius A.; 1987, Europhys. Lett. 4, 167 Domany E., Kinzel W. and Meir R.; 1989, J. Phys. A22, 2081 van Hemmen J.L. and Kiihn R.; 1986, Phys. Rev. Lett. 57, 913 van Hemmen J.L. and Kiihn R.; 1991, in “Models of Neural Networks” Hers A.V.M., Li 2. and van Hemmen J.L.; 1991, Phys. Rev. Lett. 66, 1370 ed. E. Domany, J.L. van Hemmen and K. Schulten (Berlin, Spinger), 1 Hopfield J.J.; 1982. Proc. Natl. Acad. Sci. USA 79, 2554 Horner H., Bormann D., Frick M., Kinzelbach H. and Schmidt A.; 1989, Z. Phys. B76, 381 Jedrzejewski J. and Komoda A.; 1992, Europhys. Lett. 18, 275 Jonker H.J.J. and Coolen A.C.C.; 1991, J. Phys. A24,4219 Kanter I.; 1988, Phys. Rev. A38, 5972 Kree R. and Zippelius A.; 1987, Phys. Rev. A36,4421 Kree R. and Zippelius A.; 1991, in “Models of Neural Networks” ed. E. Domany J.L. van Hemmen and K. Schulten (Berlin, Spinger), 193 Kiihn R. and van Hemmen J.L.; 1991, in “Models of Neural Networks” ed. E. Domany, J.L. van Hemmen and K. Schulten (Berlin, Spinger), 213 Kiihn R., van Hemmen J.L. and Riedel U.; 1989, J. Phys. A22, 3123 Lusakowski A.; 1991, Phys. Rev. Lett. 66, 2543
306
Luttinger J.M.; 1976, Phys. Rev. Lett. 37, 778 Mattis D.C.; 1976, Phys. Lett. A56, 421 Mertens S.; 1991, J. Phys. A24, 337 Nishimori H., Nakamura T. and Shiino M.; 1990, Phys. Rev. A41, 3346 Noest A.J.; 1988, Europhys. Lett. 6, 469 Noest A.J.; 1988, Phys. Rev. A38, 2196 Riedel U., K i h n R. and van Hemmen J.L.; 1988, Phys. Rev. A38, 1105 Rieger H., Schreckenberg M. and Zittartz J.; 1988, Z. Phys. B72, 523 Shiino M., Nishimori H. and Ono M.; 1989, J. Phys. SOC.Japan 58, 763 Shim G.M., Choi M.Y. and Kim D.; 1991, Phys. Rev. A43, 1079 Shim G.M., Kim D. and Choi M.Y.; 1992, Phys. Rev. A45, 1238 Sommers H.J.; 1987, Phys. Rev. Lett. 58, 1268 Sommers H.J.; 1987, Habilitationsschrift, Essen Sompolinsky H. and Zippelius A.; 1982, Phys. Rev. B25,6860 Tamarit F.A., Stariolo D.A. and Curado E.M.F.; 1991, Phys. Rev. A43, 7083
Mathematical Approaches to Neural Networks
J.G.Taylor (Editor) 0
1993 Elsevier Science Publishers B.V. All rights reserved.
307
Information Theory and Neural Networks John G . Taylor” and Mark D. Plumbley” ”Centrefor Neural Networks, Department of Mathematics, King’s College London, Strand, London WC2R 2LS, UK
1
Introduction
Ever since Shannon’s “Mathematical Theory of Communication” [40] first appeared, information theory has been of interest to psychologists and physiologists, t o try to provide an explanation for the process of perception. Attneave [S] proposed that visual perception is the construction of an economical description of a scene from a very redundant initial representation. Barlow [9] suggested that lateral inhibition in the visual pathway may reduce the redundancy of an image, so information can be represented more efficiently. More recently, Linsker with his ‘Infomax’ principle [21, 221, Atick and Redlich (5, 61, and Plumbley and Fallside [30, 321 have continued with this approach with considerable success. There have also been important advances in data compression techniques associated with principal component analysis. The original work of Oja [23] has now been extended to the analysis of higher-order statistics by Taylor and Coombes [45], and these techniques are presently attracting a great amount of interest. They have implications for the rapid feature analysis of patterns which are not linearly separable, and for higher order curve and surface fitting. Finally there is a very general approach to the information-theoretic analysis of neural networks, using differential geometric ideas, by Amari [2]. This appears to be the deepest usage of information theory in neural networks, although only very preliminary results are available at present.
2 2.1
Information Theory and Learning Information Theory
Let us briefly introduce some concepts from information theory. In particular, we are interested in the entropy of a random variable, and the mutual information between two random variables [40].
308
Consider an experiment with n possible outcomes z, 1 5 z 5 n with respective probabilities p,. The entropy, H ( X ) , of this system is defined by the formula
Entropy gives us a measure of the uncertainty in a system. If the probabilities of the various outcomes are equal, then the entropy will be maximised. On the other hand, if one of the outcomes is certain to happen, so its probability is one and the other outcomes have probability zero, then the entropy will be zero. (For this to work properly in the limit, we take p , logp, + 0 in the limit p, + 0.) As an example, for a fair coin toss X , with n = 2 and Phead = pt,il = 1/2, we have 1
1
1
'>
H(R) = - (-log5+ 2 -log2 2 = log2.
However, suppose we have seen the coin land with e.g. heads uppermost: call this observation Y . We now know what the outcome of the toss is, so we have Phed = 1 and ptail = 0. If we measure the entropy now, we get
H ( X 1 Y ) = - ( l l o g l +OlogO) =
o+o=o
i.e. no uncertainty, since we know that the coin landed with heads uppermost. We use H ( X I Y ) to denote the entropy in X given that we know Y . We shall see shortly that the information in the observation X about the coin toss Y is the difference between these two entropies. More formally, if we have two random variables X and Y which take values z and y in the range 1 5 5 5 n and 1 5 y 5 m, and have joint probabilities pZy,their joint entropy is defined to be H ( X ,Y ) = lOgP,y (2)
CP", 2.Y
and the conditional entropy of X given Y is
H ( X 1 Y ) = - c p z y 1 0 g ~= H ( X , Y ) - H ( Y ) 2,Y PY
(3)
If S and Y are independent, then p,, = p,py, so
H ( X 1 Y ) = -Cp,,log- PzPy = - C p , l o g p , 2.Y
PY
=H(X).
2.Y
The mutual information I ( X ,Y) between X and I' is defined by the formula
I ( X , Y ) = H ( X ) - H ( X 1 Y ) = ~ p , y l o g P- X Y 2.Y PZP,
(4)
309
which is zero if X and Y are independent. Note that Z ( X , Y ) = I ( Y , X ) , so the information in Y about X is the same as the information in X about Y . Another information-theoretic measure is the I-divergence or cross-entropy distortion measure DCE from one probability distribution to another. The cross-entropy from a 'true' probability distribution p to an 'estimated' probability distribution q is defined by the expression Pi DCE(P, q ) = C P i log. ; t
which is zero if p and q are identical, but positive otherwise. Cross-entropy is asymmetric, so that in general D c ~ ( p , q# ) DcE(q,p). By substituting pi = p,, and qi = p,p, in the expression for D c E above, it can be seen that mutual information between X and Y is the cross-entropy from the true probability distribution p,, governing X and Y , to the probability distribution p,p, which would govern them if they were independent (but had the same marginal distributions). If the random variables X and Y are continuous, we use probability density p(x,y) instead of probability distribution p,, an integral instead of a sum in our expressions for entropy and mutual information:
H(X) = -
Jm
p(x) logp(x) dx
-m
For example, if X has an N-dimensional Gaussian probability density with covariance matrix C x , its entropy will be given by
H ( X ) = log ( ( 2 ~ e ) ~ / ' ( d eCX)"') t
.
( 7)
For the special case where Y is a function of X with some additive noise
Y =f
a, i.e.
( x ) +@
then the mutual information between Y and X is given by
I ( Y , X )= H ( Y )- H ( @ )
(8)
i.e. the entropy of the 'signal plus noise' less the entropy of the 'noise'. If both Y and are Gaussian with covariance matrices CY and C, respectively, we get
I()', S)= log ((det C\,)'/') - log ((det C,)'/')
2.2
Supervised Learning
2.2.1
BackProp with Mean Squared Error Distortion
.
One of the major uses for information theory has been in interpretation and guidance for unsupervised neural networks: networks which are not provided with a teacher or target output which they are to emulate. However, we can see how information relates
310
Figure 1: Supervised neural network with distortion measure D to the more familiar supervised learning schemes, and in particular the use of Error Back Propagation (‘BackProp’) to mininize mean squared error (MSE) in a multi-layer perceptron (MLP). Given a particular input X , supervised learning attempts to produce an output Y as ‘close’ as possible to some target output R, where closeness is measure with some distortion measure D (Fig. 1). Typically, the MSE distortion measure E = D M ~ E ( R , Y=) E (IY - 01’) is used. BackProp calculates the derivative of E with respect to each of the weights (the adjustable parameters) in the network, and adapts each weight to move ‘downhill’ in this error landscape towards the minimum distortion position [37]. Consider for the moment an alternative, information-theoretic, approach. Suppose we would like the output I’ to contain as much information as possible about the target R. We can show that t,he two approaches are related, in that using BackProp to minimise MSE represents an approximation to maximising the information I(R,Y) in Y about R. If we write ip = Y - R, minimising MSE amounts to minimising the term
E (1ip1’) = Trace(C@) where Ce = E(ipipT) is the correlation matrix of ip. On the other hand, maximising the information I ( R , Y )= H(R) - H(R(Y) is equivalent to minimising the entropy of R given Y , H(RlY), since H(R) is outside our control, and is independent of the changes made in the network. It is simple to show [28] that
H(R1Y) = H(ipIY) 5 H ( @ ) and
H ( @ )5 log ( ( 2 ~ e ) ~ / * ( dCe)”’) et where N is the dimensionality of the output. Now, since (det Ce)’IN5 Trace(Ce)/N by the arithmet,ic/geometric mean inequality, minimising the MSE Trace(Ce) will mininiise an upper bound on H(RlY), and thus maximise a lower bound on the information I(R, IT)in Y about R. Due to the number of inequalit,ies appearing in the process, this bound may not be particularly tight, especially when the error are ‘unbalanced’ in some way. See [28] for a discussion of some of the implications of this.
31 1
2.2.2
Cross-Entropy Distortion
Other distortion measures are possible in place of MSE. In particular, the informationtheoretic cross-entropy distortion [43] has been suggested. The use of this distortion measure relies on the both the output and the target representing probability distributions directly. Thus a single output must be restricted to lie between 0 and 1: the sigmoid exp(-u)) as the final stage in the output unit will conveniently function a(.) = 1/(1 achieve this. With more than one output unit, they can either be treated as independent features, in which case the total distortion measure is given by
+
or they can be treated as asingle probability distribition, in which case the total distortion
In the latter case, the additional restriction that the outputs must sum to one must be enforced, to ensure the outputs form a true probability distribution. Bridle’s Softmar function [13]
can be used to enforce this condition.
2.3 2.3.1
Unsupervised Learning: Direct Approaches G-Max: Maximising Cross-Entropy
An early unsupervised learning algorithm derived from an information-theoretic approach was the G-Maximization algorithm suggested by Pearlmutter and Hinton [27]. They used single binary neuron with N binary inputs xi and corresponding weights w,,i = 1,. . . ,N and a single output y, which can be either ‘0’or ‘1’. The probability that the output state is ‘1’is
where the n ( . ) is the logistic sigmoid function
+
1
u(x)= 1 exp(-z)’
(10)
They wanted the unit to discover regularities in the input patterns, i.e. patterns that occur more often than would be expected if the activities of the individual input lines were assumed to be independent.
312
To this end, they chose as their target function the cross-entropy distortion
between the true output probability distribution p, and an ‘independent input xtivity’ probability distribution q of the output. The algorithm is run in two phases. The first phase is run with with the real input data to accumulate the output statistics which define the output probability distribution p,. The probabilities of each of the inputs is also measured for use in the second phase. In the second phase, the inputs xi are set randomly to ‘1’or ‘0’ with the same probabilities that they had in the first phase, but independently of each other. The output statistics are measured in this second phase: this defines the probability distribution qy. Since the cross-entropy measure used can be differentiated with respect to each of the weights wi, a steepest descent search can then be used to modify the weights to maximise the G measure. When this algorithm was tested on a 10 x 10 ‘retina’ exposed to randomly positioned and oriented edges, the unit typically developed a centre-on surround-off response (or vice versa). This type of response is typical of retinal ganglion cells. Unfortunately, it was rather awkward to extend the G-max algorithm to more than one output unit, since some additional mechanism has to be employed to prevent all the units learning to respond to the same features of the input. 2.3.2
I-Max: Maximising Mutual Information between Output Units
More recently, Becker and Hinton [ll]suggested that the information betmeenoutput units could be used as the objective function for an unsupervised learning technique (Fig. 2). In a visual system, this scheme would attempt to extract higher-order features of the visual scene which are coherent over space (or time). For example, if two networks each produce a single output from two separate but neighbouring patches of a retina, the objective of their algorithm is to maximise the mutual information I(Y1,Yz) between these two outputs. A steepest-ascent procedure can be used to find the maximum of this mutual information function, both for binary- and real-valued units. One application of this principle is the extraction of depth from random-dot stereograms. Nearby patches in an image usually view objects of a similar depth, so if the mutual information between neighbouring patches is to be maximised, the outputs from both output units y1 and y2 should correspond to the information which is common between the patches, rather than that which is different. In other words the outputs should both learn to extract the common depth information rather than any other property of the random dot patterns. For binary-valued units, with each unit similar to that used by the G-max scheme described above, the mutual information I(Y1, Y2) between the two output units is
so if the probability distributions P(yl), P(y2) and P(ylry2) are measured, this mutual information can be calculated directly. Of course, it is sufficient to measure P(y1, yz) only,
313
Yl
t
Y2
Maximise mutual information
t
Retina
Figure 2: Maximising Mutual Information between Output Units since
1
P(Y1) =
1P(Yl,?/Z) yz=o
and similarly for P(y2). The derivative of (11) can be taken with respect to the weights in the network for each different input pattern, so enabling the steepest-ascent procedure to be used. For real-valued outputs it would be impossible to measure the entire probability distribution P(Y1, Y z ) , so instead it is assumed that the two outputs have a Gaussian probability distribution, and that one of the outputs is a noisy version of the other, with independent additive Gaussian noise. In this case, the information I(Y1,1’2) between the two outputs can be calculated from the variances of one of the outputs (the ‘signal’) and the variance of the difference between the outputs (the ‘noise’) as
where u;, is the variance of the output of the first unit, and u ; , - ~is the variance of the difference between the two outputs. If we accumulate the mean and variance of both Y1 and Y1 - Y2, it is possible to find the derivative of (12) for each input pattern, with respect to each weight value. Thus the weights in the network can be updated in a steepest-ascent procedure to maximise I(Yl,Y2), or at least the approximation to I(Yl,1’2) given by (12).
314
Becker and Hinton found that unsupervised networks using this principle could learn to extract depth information from random-dot stereograms with either binary- or continuousvalued shifts, as appropriate for the type of outputs used, although in some cases it helped to force the units to share weight values, further enforcing the idea that the units should calculate the same function of the input. They generalised their scheme t o allow networks with hidden layers, and also to allow multiple output units, with each unit maximising the mutual information between itself and a value predicted from its neighbouring units. This latter scheme allowed the system to discover an interpolation for curved surfaces. Subsequently, Zemel and Hinton [49] generalised this procedure t o allow for more than one output per module. In their network, they use 4 outputs per unit to attempt identify four degrees of freedom in 2-dimensional objects: horizontal and vertical position, orientation, and size. The mutual information measure is now
where C,;+V,is the covariance matrix of the sum of Y I and 12’ (now vectors, of course), and Cy,-l,, is the covariance matrix of their difference. By measuring the degree of mismatch between the two representations, the model can tell roughly how much one half of an object, is perturbed away from the position and orientation which is consistent with the other half of the object.
3 3.1
Predictive Coding Redundancy and Visual Scenes
Soon after the introduction of his Information Theory, Shannon [42] used a prediction ‘game’ to investigate the information content of English text. He showed that the entropy of text is just over 1 bit per letter, much less than the 4.7 bits (= logz26) which would be needed if the 26 letters of english text were independent and occurred with equal frequency. This is itself much less than the 8 bits per letter used to represent text in a typical computer. This shows that there is a significant amount of redundancy in a typical passage of text: redundancy is the ratio of the information capacity used to represent a signal to the information actually present in the signal. If a communication system could take advantage of this redundancy, text could be transmitted more economically than before, since each bit used costs a certain amount of resources to transmit it. Attneave [S] suggested that the process of perception may be creating an economical description of sensory inputs, by reducing the redundancy in the stimulus. As a concrete example, he considered a picture of an ink bottle 011 a desk, similar to Fig. 3. If a subject is asked to predict the colours in the picture, scanning along the rows from left to right, taking successive rows down the picture, most mistakes are made a t the edges and corners in the picture. Applying this ‘Shannon game’ on the scene suggests that most information is concentrated around the edges and corners of a scene, which fits well with psychological suggestions that these features are important in a visual scene. It is also strengthened by the later observations by Hubel and Wiesel [20] of ‘line detector’ cells in the visual cortex.
315
Figure 3: A redundant visual stimulus (After Attneave [ S ] )
Figure 4: Network output signal Y is corrupted by noise CJ. Using this redundancy reduction hypothesis, Barlow [9] suggested that lateral inhibition may be producing an economical description of the scene. By subtracting a weighted sum of surrounding sensor rcsponses, the resulting activations are more independent (less redundant) than the original stimulus, while the total information content remains unchanged. Using lateral inhibition, the visual scene can be represented more economically.
3.2
Whitening filters and predictive coding
The original justification for the redundancy reduction hypothesis considered for example signals with a finite number of levels, such as just noticeable diflerences (JNDs). However, a related approach was formulated by Shannon for economical transmission of signals through a noisy channel [41]. Consider a syst,em (a neural network in this case) which is transmitting its real-valued output signal Y through a channel where it will be corrupted by additive noise 0 (Fig. 4). If there were no restrictions on Y , we could simply amplify it until we had overcome as much of the noise as we like. However, suppose that there is a power cost
associated with transmitting the signal 1' t,lirough the channel, where S ( f ) is the power
316
spectral density of Y (the ‘signal’) at frequency f , and B is a bandwidth limit. Then we wish to maximise the transmitted information (assuming both signal and noise are Gaussian)
where N ( f ) is the power spectral density of ip (the ‘noise’), for a given power cost ST. Using the Lagrange multiplyer technique, we attempt to maximise
as a function of
S(f)for every 0 5 f 5 B. This is the case when S(f)+ N(f) = constant
(17)
so if N ( f ) is white noise, i.e. the power spectral density is uniform (or flat), the power
spectral density S(f)should also be uniform [41].A filter which performs this flattening is called a whitening filter. It is well known that a signal with flat power spectral density has an autocorrelation function R Z I y ( ~=) E ( Y ( t ) , Y ( t T ) ) which is proportional to a delta funct,ion b ( ~ ) In . other words, the time-varying output signal Y(t1) at any time t l should be uncorrelated with the signal Y ( t 2 ) at any other time t 2 # tl. This approach leads to an equivalent condition where the signal is represented over a regular grid of units in space instead of time, such as a signal from a grid of visual receptors. In this case the signal should be transformed so that the outputs of the transform are uncorrelated from each other. One way of achieving this decorrelation of outputs is to use the technique of linear predictive coding (See e.g [26]).To see how this works, consider a time-sequence of input values I ; , 1 i - 1 , . . . where zi is the most recent value. We can form the least mean square (LMS) linear prediction ?i of I ; from the previous values as follows:
+
2; = u15,-1
where the coefficients
uj
+ U 2 2 i - 2 + . ..
(18)
are chosen to minimise the expected squared error E
=E
(I;-2;)2.
(19)
Taking the derivative of this with respect, to each coefficient u j , the condition for this minimum is E [ ( T ; - 2;)1i-j] = 0 (20) for all j > 0. If we take the residual yi = I ; - ?i to be the output of our transform, the LMS linear prediction gives us E [ y i ~ i - j ]= 0 (21) for all j > 0, and therefore E [Yiyk] = 0 (22) for all k < i, since Y k = x k - (ulzk-1 a2xk-2 ..’). Thus linear predictive coding has given us the uncorrelated outputs we need.
+
+
317
Figure 5: Linear decorrelating networks ( M = 2).
3.3
Local Decorrelating Algorithms
One of the early suggestions for learning in neural networks was Hebb's (191 principle, that the effectiveness of the connection between two units should be increased when they are both active a t the same time. This has been used as the basis of a number of artificial neural network learning algorithms, so-called Hebbian algorithms, which increase a connection weight in proportion to the product of the unit activations at each end of the connection. If the connection weight decreases (or increases its inhibition) in proportion to the product of the unit activations, this is called anti-Hebbian learning. A number of anti-Hebbian algorithms have been proposed to perform decorrelation of output units. For example, Barlow and Foldigk [lo] have suggested a network with linear recurrent lateral inhibitory connections (Fig. 5(a)) with an anti-Hebbian local learning algorithm. In vector notation, we have an M-dimensional input vector z, an M-dimensional output vector y, and an M x M lateral connection matrix V. For a fixed input, the lateral connections cause the output values to evolve according to the expression (yt)t+l
= 2, - Cut3(y3)t
i.e.
&+l
=X-V&
(23)
3
at time step t , which settles to an equilibrium when 2 = .r - Vy, which we can write as
+
provided ( I M V) is positive definite. We assume that this settling happens virtually instantaneously. The matrix V is assumed to be symmetrical so that the inhibition from unit i to unit j is the same as the inhibition from j t o i, and for the moment we assume that there are no connections from a unit back to itself, so the diagonal entries of V are zero. Barlow and FoldiAk [lo] suggested that for each input z, the weights uj3 between different units should altered by a small change i#j Avij = V Y i Y j where 7 is a small update factor. In vector notation this is A V = Fffdiag(ygT)
(25)
318
since the diagonal entries of V remain fixed at zero. This algorithm converges when E(y,y,) = 0 for all i # j , and thus causes the outputs to become decorrelated [lo]. Atick and Redlich (71 considered a similar network, but with an integrating output d y / d t = c - Vy leading to y = V-'a when it has settled. They show that a similar algorithm for the lateral inhziitory connections between different output units leads to decorrelated outputs, while reducing a information-theoretic redundancy measure. The algorithms considered so far simply decorrelate their outputs, but ignore what happens to the diagonal entries of the covariance matrix. For a signal with statistics which are position-independent, such as images on a regularly-spaced grid of receptors, we can consider the problem in the spatial frequency domain. Decorrelation is optimal, as we have seen above, and the variance of all the outputs will happen to be equal. If we do not have position-independent statistics, we can go back to the power-limited noisy channel argument, but use the actual output covariance matrix instead of working in the frequency domain. For small output noise, we can express the transmitted information as
I ( q , X ) = 1/2logdetC,, - 1/210gdetCo
(27)
and the power cost as S, = Trace(Cy).
Using the Lagrange multiplier technique again, we wish to maximise
J = I ( @, X ) - 1/2XST which leads to the condition [30]
(29)
cy = l/XI&f.
In other words, not only should the outputs be decorrelated, but they should all have the same variance, E(y:) = 1 / X . The Barlow and FoldiAk [lo] algorithm can be modified to achieve this, if self-inhibitory connections from each unit back to itself are allowed [30]. The algorithm becomes A V ; ~= vyiyj - (l/X)S,j
i.e.
A V = v(g/gT- (l/X)I&f)
(31)
which monotonically increases J as it progresses. This is perhaps a little awkward, since the self-inhibitory connections have a different update algorithm to the normal lateral inhibitory connections. As an alternative, a linear network wit.11 inhibitory interneurons (Fig. 5(b)) can he used. After an initial transient, this network settles to y=c-Vg and z = V T y(32) i.e. y = ( I + VVT)-'c (33) where v i j is now the weight of the excitatory (positive) connection from yi to z j , and also the weight of the inhibitory (negative) connection back from to yi. Suppose that the weights in this network are updated according to the algorithm Avij = ? ~ ( y i-~ l/Xlfij) j
(34)
319
r - - - - -
I - - - - - - ,
sh
*
G(f) -
which is a Hebbian (or anti-Hebbian) algorithm with weight decay, and is
in vector notation. Then the algorithm will converge when Cy = l / A I n 4 , which is precisely what we need to maximise J . In fact, this algorithm will also monotonically increase J as it progresses. This network suggests that inhibitory interneurons, which are found in many places in sensory systems, may be performing some sort of decorrelation task. Not only does the condition of decorrelated equal variance output optimize information transmission for a given power cost, but it can be achieved by various biologically-plausible Hebb-like algorithms.
3.4
Optimal filtering
Srinivasan, Laughlin and Dubs [44]suggested that predictive coding is used in the fly’s visual system to perform decorrelation. They compared measurements from the fly with theoretical results based on predictive coding of typical scenes, and found reasonably good agreement at both high and low light levels. However, they did find a slight mismatch, in that the surrounding inhibition was a little more diffuse than t,he theory predicted. A possible problem with the original predictive coding approach is that only the output noise is considered in the calculation of information: the input noise is assumed to be part of the signal. At low light levels, where the input noise is a significant proportion of the input, the noise is simply considered t o change the input power spectrum, making it flatter [44].This assumption means that the predictive coding is an approximation to a true optimal filter: the approximation is likely to be worse for either high frequency components, where the original signal power spectral density is small, or for low light conditions, where all signal compoiient,s are small. In fact, it is possible to analyse the system for both input and output noise (Fig. 6). We can take a similar Lagrange multiplier approach as before, and attempt to maximise transmitted information for a fixed power cost. Omitting the details, we get the following quadratic equation to solve for this optimal filter a t every frequency f [33]
320
Figure 7: Typical optimal filter solution, for equal white receptor and channel noise. where R, is the channel signal to noise power spectral density ratio &IN,, and R, is the receptor signal to noise power spectral density ratio S,/N,, and y is a Lagrange multiplier which determines the particular optimal curve to be used. This leads to a non-zero filter gain Gh whenever R, > [(y/N,) - 11-l. For constant N, (corresponding to a flat channel noise spectrum) there is therefore a certain cut-off point below which noisy input signals will be suppressed. Fig. 7 shows a typical optimal solution, has been investigating modifications to the together with its asymptotes. Plumbley [29,31] decorrelating algorithms mentioned above which may learn to approximate this optimal filtering behaviour. Atick and Redlich [5] used a similar optimal filtering approach in their consideration of the mammalian visual system, minimising redundancy for fixed information rather than maximising information for fixed power. They compared their theory with the spatiotemporal response of the human visual system, and found a very good match [4]. These results suggest very strongly that economical transmission of information is a major factor in the organization of the visual system, and perhaps other sensory systems as well.
4
Principal Component Analysis and Infomax
Principal component analysis (PCA) is widely used for dimension reduction in data analysis and pre-processiiig, and is used under a variet,y of names such as the (discrete) Karhunen Lokve Transform (KLT), factor analysis, or the Hotelling Transform in image processing. Its primary use is to provide a reduction in the number of parameters used to represent a quantity, while minimising the error introduced by so doing. In the case
32 1
Figure 8: The Oja Neuron. of PCA, a purely linear transform is used to reduce the dimensionality of the data, and it is the transform which minimises the mean squared reconstruction error. This is the error which we get if we transform the output y back into the input domain to try to reconstruct the input g so that the error is minimised. Linsker’s principal of maximum information preservation, “Infomax” , can be applied to a number of different forms of neural network. The analysis, however, is much simpler when we are dealing with simple networks, such as binary or linear systems. It is instructive to look a t the linear case of PCA in some detail, since much effort in other fields has been directed at linear systems. We should not be too surprised to find a neural network system which can perform KLT and PCA. From one point of view, these conventional data processing methods let us know what to expect from a linear unsupervised neural network. However, the information theoretic approach to the neural network system can help us with the conventional data processing methods. In particular, we shall find that a dilemma in the use of PCA, known as the scaling problem, can be clarified with the help of information theory.
4.1
The Linear Neuron
Arguably the simplest form of unsupervised neural network is an N-input, single-output linear neuron (Fig. 8). Its output response y is simply the sum of the inputs zi multiplied by their respective weights wi,i.e. N
or, in vector notation, y =ZTZ
where u, = [wl, . . . ,w ~ and] = ~ [XI,.. . ,Z N ] ~are column vectors. The output y is thus the dot product -?:.u) of the input c with the weight vector u.If 0 is a unit vector, i.e. = 1, y is the component of .?: in the direction of u) (Fig. 9).
10
322
”?
x
Figure 9: Output y as a component of g,with unit weight vector.
We thus have a simple neuron which finds the component of the input g in a particular direction. We would now like to have a neural network learning rule for this system, which will modify the weight vector depending on the inputs which are presented to the neuron.
4.2
The Oja Principle Component Finder
A very simple form of Hebbian learning rule would be to update each weight by the product of the activations of the units at either end of the weight. For the single linear neuron (Fig. 8), this would result in a learning algorithm of the form AIUi
= qx,y
(39)
or in vector notation
A 0 = qgy. Unfort,unately, this learning algorithm alone would cause any weight to increase without bound, so some modificat,ion has to he used to prevent the weights from becoming too large. One possible solution is to limit t,he absolute vaJues that each weight 2ui can take 1461, while another is to renormalise the weight vector 0 to have unit length after each update [23]. ,4n alteruat,ive is to use a weight decay term which causes the weight vector to tend to haw unit length as the algorithm progresses, without explicitly normalising it. To see how t,liis works, consider the following weight update algorithm, due to Oja [23]:
A g = ~ ( gy w-y2) -
q(zzT w - w(OT g r T a)).
(41)
When t,he weight vector is small, the update algorithm is dominated by the first term on t,lie right hand side, which causes the weight to increase a.. for the unmodified Hebbian algorithm. However, as the weight vector increases, the second term (the ‘weight decay’
323
term) on the right hand side becomes more significant, and this tends t o keep the weight vector from becoming too large. To find the convergence conditions of the Oja algorithm, let us consider the average weight update over some number of input presentations. We shall assume the input vectors c have zero mean, and we shall also assume that the weight update factor is so small that the weight itself can be regarded as approximately constant over this number of presentations. Thus the mean update is given by
where X = gTC,u, and C, = E ( z g T )is the covariance matrix of the input data c. When the algorithm has converged, the average value of A 0 will be zero, so we have
C& = u x
(43)
i.e. the weight vector 0 is an eigenvector of the input covariance mat,rix C,. A perturbation analysis confirms that the only stable solution is for u to be the principal eigenvector of C,. To find the eventual length of 0 we simply substitute (43) into the expression for A , and we find that x = WT(C,W) = Z.T(aX) (44) i.e. provided X is non-zero, uTu,= 1 so the final weight vector has unit length. We have therefore seen that as the Oja algorithm progresses, the weight vector will converge to the normalised principal eigenvector of the input covariance matrix (or its negative) [23]. The component of the input which is extracted by this neuron, to be transmitted through its output y, is called t,he principal component of the input, and is the component with largest variance for any unit length weight vector.
4.3
Reconstruction Error
For out single-output syst,em, suppose we wish to find the best estimate 2 of the input g from the single output y = a T g . We form our reconstruction using the vector
as
follows: ?=gy
(45)
where 21 is to be adjusted to minimise the mean squared error
If we minimise 6 with respect to 14 for a given weight vector 0, we get a minimum for
E
at
324
where C, = E [ u T ]as before (assuming that g has zero mean). Our best estimate of g is then given by
where the matrix
is a projection operator, a matrix operator which has the property that Q2= Q. This means that the best estimate of the reconstruction vector &, from the output yx = I&, is 2, itself. Once this is established, it is possible to minimise E with respect to the original weight vector w. Provided the input covariance matrix C, is positive definite, this minimum occurs when the weight vector is the principal eigenvector of C,. Thus PCA minimises mean squared reconstruction error.
4.4
The Scaling Problem
Users of PCA are sometimes presented with a problem known as the scaling problem. The result of PCA, and related transforms such as KLT, is dependent on the scaling of the individual input components xi. When all of the input components come from a related source, such as light level receptors in an image processing system, then it is obvious that all the inputs should have the same scaling. However, when different inputs represent unrelated quantities, then the relative scaling which each input should be given is not so apparent. As an extreme example of this problem, consider two uncorrelated inputs which initially have equal variance. Whichever input has the largest scaling will become the principal component. While this extreme situation is unusual, the scaling problem does cause PCA to produce scaling-dependent results, which is rather unsatisfactory. Typically, this dilemma is solved by scaling each input to have the same variance as each other [47]. However, there is also a related problem which arises when multiple readings of the same quantity are available. These readings can either be averaged to form a single reading, or they can be used individually as separate inputs. If same-variance scaling is used, these two options again produce inconsistent results. Thus although PCA is used in many problem areas, these scaling problems may lead us not to trust it to give us a consistent result in an unsupervised learning system.
4.5
Information Maxmization
We have seen that the Oja neuron learns to perform a principal component analysis of its input, but that principal component analysis itself suffers from an inconsistency problem when the scaling of the input components is not well defined. In order to gain some insight to this problem, we shall apply Linsker’s Znfomax principle [21] to this situation. Consider a system with input X and output Y . Linsker’s Infomax principle states that a network should adjust itself so that the information I ( X ,Y ) transmitted to its output ’I’ about its input X should be maximised. This is equivalent to the information in the input S about the output ’I7, since Z(
325
However, if Y is a noiseless function of X, as is the case for our linear neuron
then there will be an infinite amount of information in the output Y about X, because Y represents X infinitely accurately. In order to proceed, we must assume that the input contains some noise 9 which prevents any of the input from being measured too accurately. Consider the case where the input signal g is zero mean Gaussian with covariance matrix C,, and the noise $ is also zero mean Gaussian, with covariance matrix C+= 021, so that the noise on each input component is uncorrelated with equal variance. The output of the neuron is then the weighted sum
Writing down the information in the output y about the input signal g, we get
I ( Y ; X )= ;log-
S+N N
where
s = E (IUTZl')
= WTC,u,
and T
N = E (lul
- 2 2 &I 2 ) 0 lull
Since (51) is monotonically increasing in S I N , I ( X , Y ) is maximised when u,is the principal eigenvector of C,, i.e. it is the principal component of the input. This is the same condition for minimising the mean squared reconstruction error considered above, but now we have an explicit condition on the noise on the input. The condition is that the noise on each input should be uncorrelated, and each input should have the same noise variance. The scaling problem of principal component analysis is now changed to one of guessing the noise on each of the inputs, and scaling them so that this noise is approximately equal. The standard approach of scaling all inputs so that their signal variance is equal is therefore equivalent to assuming that the signal to noise ratio of all inputs is equal [28]. Of course, the assumptions that the input signal and noise are Gaussian and zero mean are very strong, but can be relaxed somewhat if information loss is considered rather than transmitted information. However, the result in each case is the same: the Oja algorithm, which finds the principal component of the input, maximises information capacity on condition that the noise on the input components is equal.
4.6
Multi- dimensional P CA
There are a number of algorithms which extend Oja's algorithm to more than one output neuron. For these we need an output vector 2 and a weight matrix W , such that
326
If the Oja algorithm was used for each output neuron with no modification, each would find the same principal component of the input data. Some mechanism must be used to force the outputs to learn something different from each other. One possibility is to use a lateral inhibition network between the output neurons, which forces their outputs to be decorrelated [16]. An alternative is to modify the weight decay term of the Oja algorithm: this approach is used by William’s Symmetric Error Correction (SEC) algorithm (481, Oja and Karhunen’s M-output PCA algorithm (241, and Sanger’s Generalised Hebbian Algorithm (GHA) 1391. In fact, these algorithms have much in common. Although the weight vectors themselves have different algorithms, the subspace defined by the orthogonal projection
P = W(WTW)-’WT which is the subspace spanned by the weight vectors to each of the outputs, moves in exactly the same way for each of these algorithms [as].Since this subspace, rather than the weight vector itself, determines the change in information transmitted through the network, these three algorithms tend to increase the transmitted information in exactly the same way. All three lead to a set of weight vectors which spans the same space as the largest principal eigenvectors of the input covariance matrix, which is sufficient to maximise the transmitted information (under the equal noise conditions which we outlined above) [28]. The three algorithms differ only in the behaviour of the weight vectors themselves. In particular, the SEC algorithm [48] leads to weight vectors which are orthogonal and unit length, but have no particular relationship to the eigenvectors of the input covariance matrix. Oja and Karhunen’s algorithm [24] uses a Gramm-Schmidt Orthogonalisation (GSO) approach to find the principal components themselves, in order. Sanger’s algorithm (391 uses GSO in a slightly different way, but also finds the principal components in order.
4.7
Discussion
We have seen that linear neurons, with a modified form of Hebbian learning algorithm, can learn to find the principal component or principal subspace of its input data. We have also seen that this principal subspace maximises information capacity of the system, under the condition that the input components have uncorrelated, independent, equal-variance Gaussian noise on all of the components. When we perform principal component analysis in practice, we also tend to use a set of output which are decorrelated, or at the very least not highly correlated with each other. The algorithms considered here also do this, but this does not seem to be required to maximise information capacity. We should only have to find the principal subspace of the input data: correlation between the output components should be irrelevant. The puzzle here arises because we have neglected noise which may occur after the network which we are currently considering. As we have already seen in 53.2, decorrelated outputs tend to be better protected against later noise (which may be added noise or calcnlation errors) than outputs which are highly correlated [lo]. One of the authors has recently investigated combination algorithms which may be suitable with both input and
321
output noise (29, 311. It may be that real perceptual systems are able to take account of both noise sources at the same time.
Non-linear Principal Component Analysis
5 5.1
Introduction
In the previous section various algorithms were suggested so as to achieve principal component analysis on a set of input patterns and thereby allow for dimension reduction and data compression. At the same time it was pointed out in 94.5 that such analysis leads to maximisation of information capacity. All of this analysis was performed in the context of linear neurons, with 0utput.s given by (37) or (38). The statistics of the input data being used was only up to second order, whilst it is suspected that higher order statistics may be being analysed by higher layers in visual cortex, each of which has at least a quadratic output from complex cells [35]. As such, a t each layer visual cortex may be working on up to fourth order statistics. This could rapidly build up to quite high order analysis in several layers, as is clearly appropriate for an effective object recognition system.
5.2
Non-linear PCA
The extension of the Oja principal component analyser described in $4 to non-linear neurons has been performed in (451. This uses, in the case of the linear and quadratic neuron, the output rule y = w,xi W i j X i X j (53) (where the summation convention is being used, with summation on any twice repeated index) and extension of the Hebbian updat,e rule (41) to
+
AW;j = q(y/”;~j- y2W..) ‘3 (54) On averaging over the input patterns it is possible to write an extension of (42) which preserves its form. This if the two-component object R = (“) and the (N N 2 ) x w., (N N 2 ) matrix C = be introduced, where (C2)ij = E ( z , x j ) (C2 = C, of 54), ( C 3 ) i j k = E ( z ; x j z k ) , ( C 4 ) i j k e = E(zixjzkxe), then (41) and (54), with y given by (53), become no = V(CR - An) (55) where X = RTCR. The equation (55) has identical form to equation (42), but now involves up to fourth order statistics of the input patterns. It is evident that this process can be continued by adding higher order terms still into the right hand side of (53). If terms of order TI are included then the object RT is extended to RT = ( W i , W;,,. . . ,Wil..,jn),whilst C is enlarged to
+
+
(22)
c,,,
...
...
328
Equation (55) remains unchanged. Extensions may be given along the lines of 34.6 to learning eigenvectors corresponding to lower eigenvalues of C. These will be of importance in giving a better reconstructed image. However we will not discuss these in any detail here.
Pattern Non-linear Reconstruction In all of these cases the learning proceeds till R becomes the normalised eigenvector of 5.3
C with maximal eigenvalue. As in the linear case, principal component analysis may be shown to minimise the mean-squared reconstruction error, so leading again to important data compression. The reduction is not completely trivial, so we will give a brief demonstration here which extends that of 34.3 for the linear case [15]. The optimal reconstruction is expected to be a non-linear extension of (45) [34], so taking the form Py = yaGi + yay"? + . . . (56) where a is the pattern label. The value of f a can be expressed in vector form using the notation qa(YIT = (Y",ya'Yp, Y a Y v ,' . .) (57a) -T
G
=
-
(c,@,cPy ...)
(a,a3x;, a3-3kx;x;,. . .).
(57b)
In terms of (57), 2" may be written as P" = Z.?"(y)
(58)
where the dot product in (58) is over the vector indices square reconstruction error, extending (46), is E
= E(ll.- - 211*) = E(llz-
P,-y etc in (57a,b). The mean
G.?(Y)1l2)
(59)
where y is given by the non-linear expression, extending (53), as y = w,x,
+ WIJX,XJ + . . . .
If we denote combinations of indices 31 . . . j , by j , of patterns 71,. . . ,-ym by P xfj . . . xf; by x~ - then (59) may be written
Variations of
E
in (61) with respect to a,, leads to the matrix equation atLA,k =
(60)
2,and
329
where
Then a solution of (62) is
which is a non-linear extension of (47) (and reduces to it in the linear case). The minimum error for this solution may be obtained from (59) and (63) as
-1 [C(,)2 - (RTC20)/(RTCR)] 2 0
(64)
This is minimised on the original weight vector when R is chosen as the principle eigenvector of C. In fact there are more general solutions a;&= (Cn)if~J0C%) to equation (62), but they all lead to the same minimum error (64), and hence also to the principle component choice. As in the linear case, the scaling problem enters, albeit now in a nonlinear fashion. It is not as easy to resolve as in the linear case by considering additive input noise, and scaling so that input noise variances are equal. This is because it is no longer possible to write down the analogue of (51) in the non-linear case. It may be possible to give an approximation to this for small additive noise, although no definite results are available on this yet.
5.4
Non-linear Pattern Reconstruction
An important question to be resolved is as to the quantitative benefit of using higher order statistics in pattern reconstruction. This was analysed above using only the mean square error (MSE) E of equation (59). The MSE can be reduced to zero by using all of the eigenvectors in the linear case [38]. This case can only take account of C2, so cannot be expected to reproduce higher order statistics correctly. To assess these higher orders a suitable extension of the MSE must be introduced. One such is the Kullback-Liebler distance D(P,P) between the input probability distribution P ( g ) and the reconstruction distribution P ( 2 ) of $2.2.2, defined as
In the linear case one has, from (45)
P(&)=
S"(& - g ( u . g ) ) P ( gd"g )
330 Regularisation of the &function and use Of a Gaussian distribution for P(.) allows a Gaussian distribution to be obtained for P , and the result that d ( P , P ) is a minimum when 11 is the principal eigenvector of C2, as is g.The extension of (66) to be non-linear case is P ( 2 ) = /6"(- - G.Y(y(z))P(g) d"c (67)
--
where we are using the notation of the previous subsection. This approach appears to be the most natural one in the context of the differential geometric approach to estimation theory 121 t o be outlined in the next section; it has still to be pursued much further in this context. It is possible to indicate the power of the non-linear PCA approach outlined above if an extension is made to the error term (59). This modification is designed so error minimisation is achieved for all of the statistics of the input, and not just the quadratic part. In terms of the notation of s5.3 the new error function is
where xi = x,,x,, . . . x,", and 67 = (al, a,z!, a,kxfzz,. . .). On variations with respect to the free parameters of (57b),which are now u2 (where the subscripts i run over the same set of labels as in the summation in ( 6 8 ) but the indices J are multi-vectoral as described below), the natural extension of (62) is obtained as a@& _ _
The indices j-, b in (69) are now multi-vectoral, with j- = J1,. . . ,&),& = some P and m, but only denotes a single vector index, and Q&
= (O'Q)
rp), s
(kl,...
rpQ)kr 7
where
As in the case of ( 6 4 ) , the miiiimised error from (68) becomes
This has minimum when R is along the eigenvector direction with largest eigenvalue which is, exactly the principle component of C being learnt by the rule ( 5 5 ) . thus this
33 1
rule enables minimum reconstruction error (in the MSE sense) of the input patterns and their higher statistics (up to the order contained in C). The above analysis can be extended immediately to the case of learning lower components, by means of extending the single non-linear neuron (60) to a set of M of them. The resulting set of QT = (O(l),. . . ,O ( M ) )is a multi-tensor, and the learning rule, to one or other of the PCA-subspace rules, is of the form
Numerous extensions of ( 7 3 ) are possible to obtain orthogonal decompositions of the PCA subspace; they were briefly considered in 84.6, and will not be discussed here further.
5.5
Discussion
It is relevant to note here that neurons in general are non-linear, and the effect of a sigmoid non-linearity on the neuron output on determination of the principal components of the second order input statistics have already been studied [25]. In general improved stability of the PCA algorithms were found to result from this non-linearity. In this section we have not discussed this aspect of non-linear neurons in the PCA context, but focussed on the ability of such neurons, with suitably adaptive higher-order weights, to learn higher than second order statistics of their inputs. This was first presented in [45], and practical aspects will be discussed more fully in [15]. We have indicated here briefly how to achieve the non-linear extension of much of the linear PCA analysis of the previous section. MSE approach to the reconstruction problem indicates how to achieve better assessment of the importance of the higher order statistics in the pattern reconstruction process. It will be explored more detail elsewhere [15].
6 6.1
Differential Geometry of the Manifold of Networks Introduction
Any neural network of a given architecture is a parameterised form of mapping
z 4F ( u , c )= Y
(74)
from the space of inputs g to that of the outputs y, where the parameters u are a finite set of real valued quantities, usually the weights and thresholds of the separate neurons. As w varies, the set of such parameterised maps F forms a space of maps N characterising all neural networks with the given architecture. It is highly relevant to discover suitable tools for describing the intrinsic structure of N , and also the manner in which it is embedded in the space S of all mappings from a to y. Such structure is of importance in describing how training algorithms change the network along a trajectory on N , or more general
332
architecture optimisation strategies modify the network inside the more general space S of all maps. An important approach to these questions has been developed by Amari [2] in terms of differential geometric concepts associated with statistical estimation theory. This approach uses suitable parametrisations of N , so that will be considered first. We will then, in the following sub-sections discuss the appropriate differential geometry and consider neural networks in that framework.
6.2
Network Parametrizations
The simplest and most complete parametrization is for the case of stochastic neurons with n binary inputs 14 and a single output y. In that case the probability of emitting a one is a quantity which we can denote ag: proh(y = 11%)= ax
(75a)
prob(y = 01%)= 1 - a,-
(75b)
The 2“ set of quantities = {a,} give a complete description of the neuron. Such a neuron has been termed a PRAM [45] and discussed extensively in a series of papers (see [17] or [18] for a review). The PRAM has the particularly attractive features of (a) having a simple hardware realisability; (b) possessing continuously variable “connection weights” ax,which can be trained by reward learning in a hardware-realisable manner [14]; and (c) representing the stochastic response arising from noisy quanta1 synaptic transmission in living neurons. As such, PRAMS completely fill the space S,RAM of binary input and output stochastic neurons. Subspaces Nk of S can be formed by those PRAMS which only have non-zero memory contents a, for 1111 5 k , so that
Networks of PRAMS can be built [14], and are parametrised by the a ’ s of the respective PRAM components; they also will in general be subsets of .S ,,, Other subspaces of S,R,M exist, such as that formed by linear weighted summed neurons with
where 20, t are connection weights and threshold respectively. Parametrization of graded input or output neurons can be developed in a similar manner. The neurons of 54.1 are parametrised by a single vector, whilst the non-linear neuron of 55 by the multi-tensor quantity R = ( w ; ,w i j , w i j b , . . .). In general a neuron may have an input-output transform which is parametrised by any number of parameters, so that for such cases N can be infinite-dimensional. However it would be usual to have a bound on the number of adaptive parameters in a neural net, so only finite dimensional subspaces of S would arise.
333
6.3
Differential Geometry
The output of the stochastic neuron of the previous subsection can be interpreted as the random variable ru for binary input I,so that a 2"-component random variable 2: = { r g } ensues, with probability
The above expression gives a family of probability distributions for 11, co-ordinatised by of the stochastic neuron. It is possible to introduce an exponential the parameters family of co-ordinates Ox, with ax = (1 e-'=)-' (79)
+
in terms of which a dually-flat Riemannian structure can be defined. A dually flat manifold is defined in terms of two special sets of dually coupled coordinates @ and 4. Linear curves in @ or in t are geodesics which are dual to each other. This duality is defined by the tangent vectors gU along - along the co-ordinates 8% and the aprwith the condition &.g2 = 6; The structure of the manifold is determined by potential functions $(@) and 4(8) with
8" = (6/6au)4(a) = (6/68")$(8)
(8la) (81b)
An important result [l]is that a dually flat manifold admits a unique invariant divergent measure D(P , Q) between any two points P, Q with value
If the manifold is that of probability distributions then this divergence is identical with the Kullback-Liebler divergence which was mentioned briefly in the previous section. The divergence is itself an extension of the Euclidean distance to the case of the metric (82). It has the useful generalised Pythagorean property
if the @geodesicconnecting P and Q is orthogonal at Q to the dual g-geodesic connecting Q and R. Moreover it has the further valuable projection property that the point Q p in a sub-manifold M of S which minimises the divergence to any point P in S is given by the t-geodesic projection of P onto M .
334
6.4
Geometry of the Neuron Manifold
In the case of the PRAM of 55.2, the 0' and a, co-ordinates were already defined by equation (75) and (79). The corresponding metric and potential functions of 55.2 are then (21
$(e) =
c [8. U -
+ l n ( l + e-")]
(84a)
where guv -_ is the Fisher information metric and its inverse. For N observations, the estimation error of an output ru is given by
E((+% - .J2)
1 = -a (1 - a,) N L
(85)
The invariant divergence between two points P ( a ) and Q(Q) is given simply by the Kullback-Liebler information distance
It is possible to use the geometric theorems of the previous subsection to show 121 how the approximation error D ( P * ,p k ) , for a neuron p k of order k, so with a, = 0 for 1341 > k , can be decomposed into a sum of distances between approximations to neurons of successively higher orders
c qp;,,,
n-1
D ( P * ,Pk)=
p;)
+ D ( S ,Pk)
(87)
m=k
with P; = P'. The above approach has been applied to analyse learning in the Boltzman machine [3] and to obtain [2] the error between P and the maximum likelihood estimate P * , obtained when T is the number of training examples and d the number of free parameters (2" in this case), to be D( P, P') = (d/2T) This is an important result on generalisation, and is at the basis of the "Rule of Thumb" that there must be an order of magnitude more training patterns than free parameters in order to guarantee a generalisation error of less than 5%. More detailed analysis have been given of learning in particular situations 131, to which attention is directed. It is also possible to extend the structure to graded inputs and/or outputs, when the spaces of networks becomes an infinite-dimensional function space. The associated expression (78) becomes the weighted log likelihood
335
with p ( c ) the input probability distribution, a(c)is the output for c, and T ( E ) is the random variable with prob(r(c) = 1) = a@) (90) The above framework can evidently be extended to more than one output in an evident manner.
7 Discussion It should be clear from the foregoing that information-theoretic approaches are making important progress in analysing both artificial neural networks and possible processing strategies in early vision (at both retinal and visual cortical levels). This is particularly true for understanding preprocessing, in terms of predictive coding in the retina and PCA and decorrelation processing in early visual cortex. However there is a difficult question which must be considered before we can be satisfied with the explanations given in this chapter of neurobiological information processing, which arises because living neurons and their nets are far more complex than those considered in the present analyses. Even if we are including higher orders of non-linearity in the response function in 55 there is no inclusion of complex temporal features as might arise from channel openings and closings (alpha functions) or from neuronal geometry, or of the details of stochastic synaptic transmission, or of the many other features of living neurons. The question is therefore if the explanations of retinal and early visual cortical processing, given in terms of informationtheoretic principles, and on the basis of non-physiological realistic learning rules, will still be valid when more realistic neurons and nets are used? Moreover learning rules themselves must have a biological reality. Will there be realistic rules which will lead to the desired connection weights? It is clear that some of the discussion in the earlier sections contravenes known biological features. Thus the anti-Hebbian learning rules of (2.12) of ref [34], and of equation (25) here, use adaptivity of inhibitory synapses, which is a feature with no experimental validation. It is possible to avoid this problem by using fixed inhibitory feedback but variable lateral excitatory connections on the inhibitory neurons. This corresponds to having the adaptive connection weight matrix V for the lateral excitatory connections in fig 3(b), but the fixed inhibitory connection matrix-C for the feedback. Thus the equation (23) becomes y =x -
cz,
and (24) y = (1
2
= VTy
+ CVT)-'s
Then the learning algorithm of equations (25) or (26) will again give uncorrelated output when learning has been completed, but now only the excitatory synapses have been trained. The inhibitory interneurons will still perform decorrelation on the outputs. However care would have to be taken to ensure that the weights v,j do not go negative, so that equation (25) would have to be modified so as to be clipped when the w;j become zero. The resultant level of decorrelation has yet to be analysed.
336 Another aspect where biological realism could be introduced is associated with PCA in both sections 4 and 5. This in particular involves the presence of the decay term proportional to the square of the output in (41) or (54). Both of these expressions involve a global assessment of output, and its use in a local manner at each synapse. To avoid this one could take a decay term depending only on the weight value at the synapse [36], as
Awij = q ( ~ , y j w;)
(93)
This has been shown to result in asymptotic weight values proportional to a high root of the principal eigenvector components of the correlation function; its value for PCA is not yet clarified. These modifications only go a very short way towards responding to the equation raised earlier. There is no reason, however, why approximations to living neurons cannot be used to advance understanding in this area; the use of more realistic neurons will ultimately have to be faced up to. Use of non-trivial temporal features [12] and other properties are increasingly being modelled, so that this aspect of the program can be started. Finally we note there are many directions for future work. We have only scratched the surface of this very important approach to neural networks, leaving out numerous avenues being actively followed by others. However we hope that we have given in this chapter enough of a survey to indicate the nature of the approach.
Acknowledgements One of the authors (MDP) is supported by a Temporary Lectureship from the Academic Initiative of the University of London. The other author (JGT) would like to thank DRA for financial support to allow part of this work to be done.
References [l] S.-I. Amari. Differential geometry of a parametric family of invertible linear systems-
Riemannian metric, dual affine connections and divergence. Mathematical Systems Theory, 20:53-82, 1987. [2] S.-I. Amari. Dualistic geometry of the manifold of higher-order neurons. Neural Networks, 4:443-451, 1991. [3] S.-I. Amari, I<. Kwata, and H. Nagaoka. Geometry of Boltzmann machine manifolds. Mathematical Engineering Technical Report 90-19, Univeristy of Tokyo, Faculty of Engineering, 1990. [4] J. J. Atick and A. N. Redlich. Quantitative tests of a theory of retinal processing: Contrast sensitivity curves. Technical Report IASSNS-HEP-90/51, NYU-NN-90/2, School of Natural Sciences, Institute for Advanced Study, Princeton; Center for Neural Science, New York University, 1990.
337
[5] J. J. Atick and A. N. Redlich. Towards a theory of early visual processing. Neural Cornputation, 2:308-320, 1990. (61 J. J. Atick and A. N. Redlich. What does the retina know about natural scenes? Neural Computation, 4:196-210, 1992. [7] J. J. Atick and A. N. Redlich. Convergent algorithm for sensory receptive field development. Neural Computation, 5:45-60, 1993. (81 F. Attneave. Some informational aspects of visual perception. Psychological review, 61:183-193, 1954. [9] H. B. Barlow. Three points about lateral inhibition. In W. Rosenblith, editor, Sensory Communication, pages 782-786. MIT Press, 1961.
[lo] H. B. Barlow and P. FoldiAk. Adaptation and decorrelation in the cortex. In The Computing Neuron, pages 54-72. Addison-Wesley, Wokingham, England, 1989. [ll] S. Becker and G. E. Hinton. Spatial coherence as an internal teacher for a neural network. Technical Report CRG-TR-89-7, Department of Computer Science, University of Toronto, Dec. 1989. [12] P. C. Bressloff and J. G. Taylor. Spatio-temporal pattern processing in a compartmental model neuron. Physics Review E, 1993. (in press). [13] J. S. Bridle. Probabilistic interpretation of feedfonvard classification network outputs, with relationships to statistical pattern recognition. In F. F. SouliC and J. Herault, editors, Neurocomputing - Algwithms, Architectures and Applacations, pages 227236, Berlin, 1990. Springer-Verlag. (141 T. G. Clarkson, D. Gorse, and J. G. Taylor. Hardware realisable models of neural processing. In Proceedings of the IEE F m t International Conference on Artificial Neural Networks, pages 242-246, 1989. I151 S. Coombes, R. Petersen, J. G. Taylor, and S. Wright. Non-linear principal component analysis and pattern reconstruction. In preparation. (161 P. Foldikk. Adaptive network for optimal linear feature extraction. In Proceedings of the International Joint Conference on Neural Networks, IJCNN-89, pages 401-405, Washington DC, 18-22 June 1989. [17] D. Gorse and J. G. Taylor. On the equivalence and properties of noisy neural and probabilistic RAM nets. Physics Letters A , 131:326-332, 1988. (181 D. Gorse and J. G. Taylor. A review of the theory of PRAMS. In Proceedings of the Weightless Neural Networks Conference, York, UK, Apr. 1993. (To appear). [19] D. 0. Hebb. The Organization of Behavior. Wiley, New York, 1949.
338
[20] D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148:574-591, 1959. (211 R. Linsker. Self-organization in a perceptual network. IEEE Computer, 21(3):105117, Mar. 1988. [22] R. Linsker. Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4:691-702, 1992. [23] E. Oja. A simplified neuron model as a principal component analyser. Journal of Mathematical Biology, 15267-273, 1982. [24] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106:69-84, 1985. [25] E. Oja, H. Ogawa, and J. Wanguiwattana. Learning in nonlinear constrained Hebbian networks. In T. Kohonen et al., editors, Artificial Neuml Networks, pages 385-390. Elsevier, 1991. [26] A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill, second edition, 1984. (271 B. A. Pearlmutter and G. E. Hinton. G-maximization: An unsupervised learning procedure for discovering regularities. In J. S. Denker, editor, Proceedings ef Neuml Networks for Computing, pages 333-338. American Institute of Physics, 1986. [28] M. D. Plumbley. On information theory and unsupervised neural networks. Technical
Report CUED/F-INFENG/TR.78, Cambridge University Engineering Department, Cambridge, UK, 1991. [29] M. D. Plumbley. Approximating optimal information transmission using local Hebbian algorithms in a double feedback loop. In Proceedings of the International Conference on Artificial Neural Networks, ICANN’93, Amsterdam, Sept. 1993. (To appear). [30] M. D. Plumbley. Efficient information transfer and anti-Hebbian neural networks. Neural Networks, 1993. (in press). [31] M. D. Plumbley. A Hebbian/anti-Hebbian network which optimizes information capacity by orthonormalizing the principal subspace. In Proceedings of the IEE Artificial Neural Networks Conference, ANN-93, Brighton, UK, May 1993. (To appear).
[32] M. D. Plumbley and F. Fallside. An information-theoretic approach to unsupervised connectionist models. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 239-245, San Mateo, CA., 1988. Morgan-Kaufmann.
339
[33] M. D. Plumbley and F. Fallside. The effect of receptor signal-to-noise levels on optimal filtering in a sensory system. In Proceedings of the Internation Conference on Acoustics, Speech, and Signal Processing. ICASSP-91, volume 4, pages 2321-2324, Toronto, Canada, May 1991. [34] T. Poggio. On optimal nonlinear associative recall. Biol. Cybernetics, 19:201-209, 1975.
[35) D. A. Pollen, J . P. G a s h , and L. D. Jacobson. Physiological constraints on models of vision. In R. M. J. Cotterill, editor, Models of Brain Function, pages 115-136. Cambridge University Press, 1989.
[36] H. Riedel and D. Schild. The dynamics of Hebbian synapses can be stabilized by a nonlinear decay term. Neural Networks, 5:459-463, 1992. (371 D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations. Bradford BooksfMIT Press, Cambridge, MA, 1986. [38] T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2:459-473, 1989. [39] T. D. Sanger. An optimality principle for unsupervised learning. In D. S. Touretzky, editor, Advances i n Neural Information Processing Systems 1. pages 11-19. Morgan Kaufmann, San Mateo, CA, 1989. [40] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379-423,623456, 1948. [41] C. E. Shannon. Communication in the presence of noise. Proceedings of the IRE, 37:lO-21, 1949. [42] C. E. Shannon. Prediction and entropy of printed english. Bell System Technical Journal, 30:50-64, 1951. [43] S. A. Solla, E. Levin, and M. Fleisher. Accelerated learning in layered neural networks. Complex Systems, 2, 1988. (441 M. V. Srinivasan, S. B. Laughlin, and A. Dubs. Predictive coding; a fresh view of inhibition in the retina. Proceedings of the Royal Society of London, Series B, 216:427-459,1982. [45] J . G. Taylor and S. Coombes. Learning higher order correlations. Neural Networks, 6(3):423-428, 1993. [46] C. von der Malsburg. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14:85-100, 1973.
340
(471 S. Watanabe. Pattern Recognition: Human and Mechanical. John Wiley & Sons, New York, 1985. [48] R. J. Williams. Feature discovery through error-correction learning. ICS Report 8501, University of California, San Diego, 1985. [49] R. S. Zemel and G. E. Hinton. Discovering viewpoint-invariant relationships that characterize objects. In Advances in Neural Information Processing Systems 3, San Mateo, CA, 1991. Morgan Kaufmann.
Mathematical Approaches to Neural Networks
J.G.Taylor (Editor) 0 1993 Elsevier Science Publishers B.V. All rights reserved.
34 1
Mathematical Analysis of a Competitive Network for Attention J. G. Taylor and F. N. Alavi Centre for Neural Networks, King’s College London, The Strand, London WC2R 2LS, United Kingdom
Abstract A neurobiologically based model of a competitive network is analysed mathematically. The network is shown to possess the property of sustaining global competition between inputs. Various qualitative aspects of the network activity are supported by simulations.
1
The Nature of the problem
Attention requires some sort of global competition to be occuring amongst the choices of objects to attend to in the external world or in the internal signals from the body. This competitive aspect has already been noted as a fundamental constituent of many papers on the subject, both from the general fields of psychology and physiology and from the more specific aspect of connectionist modelling. However, none of these references, and the many more using competitive nets to explain other aspects of neurobiological processing, appear to have any strong foundation in neurobiological fact. A non-trivial proportion of neurons in cortex, suggested at about 25%, are inhibitory, but long-range connections between cortical areas appear to be excitatory (Douglas and Martin 1991), although with feedback inhibition coming from local lateral inhibition. There is good evidence for inhibitory effects in orientation sensitivity in early cortex (Martin 1988) but this has at most been suggested as arising from half a hypercolumn distance, or about 0.5 mm, and giving fine-tuning of earlier sensitivity (Wortgotter et al. 1991). Another place in which competition may be searched for is in the thalamus. This is composed of numerous nuclei acting in some manner as way-stations for input going to primary cortical areas (vision through the lateral geniculate nucleus LGN, for example) and as transit stations for activity in later cortical areas. There is also a considerable amount of input to thalamus from the mid-brain and the systems in the cortex denoted as the limbic systems and occupying the rim around the cortex (including at least the
342
hippocampus, amygdala, hypothalamus and cingulate gyrus). There are inhibitory interneurons in the thalamus, but very little in the way of lateral connections between different thalamic nuclei. Even in a given nucleus, any inhibitory interneurons only have short-ranged ramifications, nor is there little in the way of long-range connections between the excitatory relay cells. It appears, therefore, that the absence of any long-range inhibitory connections leading to the production of competition between activity in the numerous areas (occipital parietal, temporal and frontal lobe) involved in guiding the various aspects of attention (Posner and Petersen 1990) requires that one search elsewhere than in the cortex. A particularly desirable feature of the competitive mechanism would be that it leads not only to efficient competition but also to global guidance. By this we mean the ability of a winning input to a portion of cortex to control activity in other parts of cortex at the same (or in very closely subsequent) time. This latter activity which loses the competition is not destroyed, but is allowed to exist only in certain forms, guided by the winning input. Global guidance is not a standard property of WTA-type networks'. Indeed it appears to be very different and not compatible with any known WTA architecture. One possible class of neural networks able to implement global guidance to some degree is that of nets with spatial instability. Such nets have been known since the time of the work on visual hallucinations (Ermentrout and Cowan 1978) add from the wealth of examples of instability and pattern formation in reaction-diffusion equations (Murray 1989). We present here results on a particular brain network, the nucleus reticularis thalami (NRT), which we propose functions by means of its special lateral inhibitory connectivity so as to achieve global guidance as well as a more local competitive form of action. The manner in which such global guidance can be used in attentive processing is still to be analysed.
1.1
The Nucleus Reticularis of the Thalamus (NRT)
The NRT, present in all mammals, is a thin, curved sheet of cells so situated between thalamic relay cells and their cortical target sites as to be highly suggestive of its possible controlling influence on cortical input and activity. It is pierced by thalamwcortical and returning cortico-thalamic fibres in a roughly topographic fashion, especially for inputs to and from primary sensory cortical areas. The main organisational principle (Jones 1975) is that the NRT can be considered as a series of overlapping sectors, each related to a particular dorsal thalamic nucleus (or nuclei). The axons penetrating NRT give off excitatory collaterals to it, whilst NRT cells themselves only feed back inhibitorily in a roughly topographic fashion to the thalamic relay cells from which they have received 'WTA = Winner Takes All
343
their collaterals. The main NRT cell neurotransmitter is GABA, which is a well-known inhibitor. The NRT structure eminently qualifies it to be some sort of integrative filter modulating thalamic and cortical activity, the filter itself controlled by cortical, mid-brain and some brain stem structures (which also have inputs to NRT). That NRT performs a control function has been remarked on briefly in numerous papers. For example, as noted by Schiebel (1980): "Situated like a thin nuclear sheet draped over the lateral and anterior surfaces of the thalamus, it has been likened to the screen grid interposed between cathode and anode in the triode or pentode vacuum tube". There are various features of the detailed circuitry and functionality which must be properly accounted for in any serious modelling:
1. the nature of the inter-connectivity on the NRT sheet is itself species-specific. Thus in the rat, fine structure analysis shows only ax-dendritic synapses of presumed excitatory or inhibitory type (Ohara and Lieberman 1985), but that of the cat and monkey have very clear dendro-dendritic synapses (Ohara 1988; Deschknes et al. 1989), some of these even being reciprocal (Deschknes et al. 1989). It should be added that these dendro-dendritic synapses occur near cell bodies, and not on their distal dendrites.
2. the exact nature of the inhibitory effect of the GABA-rgic
NRT cells is unclear. Thus evidence has been presented (McCormick and Prince 1986; Spreafico et al. 1988) that local application of GABA to NRT neurons causes depolarisation of cell membrane rather than hyperpolarisation. Direct evidence of lateral inhibition of NRT neurons on each other was earlier provided by Ahlskn and Lindstriim (1982). Moreover, the equilibrium potential Ecl of the chloride conductance (the ion channel by which GABA influences the membrane potential) is about -65mV (McCormick and Prince 1986) whilst the average membrane potential is about -56mV (Avanzini et al. 1989). Thus the effect of GABA on neurons at or above their resting potential will be expected to be inhibitory. It is interesting to note that this effect will become excitatory if the membrane potential goes below Eel. This may be a useful control mechanism to prevent NRT neurons becoming so hyperpolarised as to be in their bursting state. Throughout this paper we will assume the action of GABA on NRT is inhibitory, and that the neurons have membrane potentials always above E a (although this latter will not be explicitly discussed in the modelling). The effect of GABA-rgic feedback in thalamic relay cells and interneurons will be considered later.
3. the continuous nature of the NRT sheet is reasonably well supported by anatomical
344
and cytoarchitectonic evidence, except for that part of it adjacent to the visual input (through the lateral geniculate nucleus, LGN) termed the perigeniculate nucleus (PGN). Presently, the weight of opinion is in favour of complete connectivity between PGN and NRT. 4. there are at least two modes of action of thalamic relay and NRT neurons:
(i) relay-like behaviour, defined by tonic firing in response to inputs. (ii) a phasic bursting discharge behaviour, with the ability to maintain rhythmic burst discharges at about 6 to 8 Hz. There is great relevance of mode (ii) in thalamic and cerebral spindling (sequences of rhythmic bursting) activity in sleep and in the sleep-wake transition (Avanzini et al. 1989). Since we are mainly interested here in the waking state, only mode (i) of the NRT (and thalamic) neurons will be considered explicitly.
5. there is an important species-specific differencein the NRT feedback to thalamus. In the rat, only LGN has inhibitory interneurons, but other thalamic nuclei with output to NRT are known to possess very few non-relay cells (Barbaresi et al. 1986; Harris and Hendrickson 1987). In the cat and monkey, however, such interneurons are apparently widespread. Inhibitory NRT feedback on these latter interneurons was used as an important part of the model of Steriade, Domich and Oakson (1986) and later by La Berge (1990). This model assumed disinhibition of interneuronal input control by inhibitory NRT feedback. However, in rat, without such interneurons, one might expect NRT inhibition to feed back directly on to input thalamic relay cells, so having just the opposite effect on thalamic inputs. We will discuss this problem when we turn to the details of our model, and in particular consider it in the context of stability. 6. there are also interesting features of the global wiring diagram in which NRT is concerned. Thus NRT is claimed not to be connected at all to the anterior thalamic nuclei in cats (Jones 1975; Pard et al. 1987), but there is apparently such connectivity in rats (Ohara and Lieberman 1985). The question of how important NRT inputs are to conscious processing will be discussed later in the paper. 7. there is (Spreafico et al. 1991) a clear change of cellular type in NRT from cells with round dendritic fields (termed R-type), of about 200 pm across, at the anterior pole to more elongated large fusiform cells (F-type) moving posteriorly, to even more elongated small fusiform cells (f-type) on the medial and lateral borders of the sheet. F and f-type cells have dendrites running either vertically or horizontally
345
[ I Property of NRT 1 1. 1 dendrdendritic synapses .
-
2. inhibitory interneurons in
I Rat I No
No non-sensory thalamic nuclei 3. inhibitory interneurons in LGN Yes 4. NRT connected to anterior Yes thalamic nuclei
Cat, Primate Yes Yes Yes No
Table 1: Species differences between various structural properties of NRT. in the plane of the NRT sheet (the elongated dendritic fields of f-type usually being horizontal), the f-type extending up to 300400 pm in length from the cell body. The f-cells seem to be located especially in the region of the NRT related to sensory thalamic nuclei, and receive afferents from sensory cortex.
8. it has also been discovered in the rabbit (Montero et al. 1977; Crabtree and Killackey 1989; Crabtree 1992) and in the cat (Crabtree and Killackey 1989; Crabtree 1989, 1991, 1992) that inputs from local regions of cortex lie parallel to the plane of the nucleus creating maps perpendicular to the reticular sheet. Thus correct modelling of NRT should take account of this three-dimensional character of the nucleus. However, our modelling of the cortex to be presented herewith is only at the level of considering it as a tw-dimensional sheet; to be consistent, NRT can therefore only be regarded as two-dimensional. It is hoped that the NRT model presented here will be extended to a three-dimensional structure related to the cortical structure in the appropriate manner (Crabtree 1992). The most important aspects of these features of NRT are summarised in Table 1.
1.2
Neural Modelling
In order to deduce properties of neural models it is usual to do so either by general arguments or, alternatively and more precisely, to simulate them using simplified models of the neurons which they contain. The latter approach becomes difficult if there are many neurons and/or many coupled nets. Since we wish to attempt to consider both of these latter cases we appear to be forced to attempt to use the more general method. We will try to be more precise, however, by using the techniques of dynamical systems theory to allow us to deduce qualitative results for a class of models. This is a method which has now become well established and for which there are numerous texts and reviews. The main concepts are attractors (fixed points, cycles, strange attractors) and stability (bifurcations of various types). These have been discussed in the cortical context in a
346
general fashion in Ermentrout and Cowan (1978), and many more discussions are now appearing, which are too numerous to mention. In specific cases quantitative results are being obtained which permit even more precise descriptions of various biological neural processes. The main model of the neuron we will use is the leaky integrator neuron, with output, describing its mean firing rate, given by a suitable sigmoidal function of the membrane potential. The input is a linear sum of the outputs of other neurons. This is now standard in the field of artificial neural networks, and has proved increasingly effective in many industrial applications. The model is obviously an enormous simpzfication of any biologically realistic neuron. However, we argue in support of simplicity on two grounds. Firstly, general features of simulations of nets of highly complex neurons has been duplicated by using the simple neurons we are advocating (Lijenstrom 1991). Secondly, if it is possible to achieve insight into basic principles for new paradigms of information processing that might be used by the brain in terms of such simple model neurons, then it could give a valuable guidance to further research. One would then have to ascertain if the use of biologically more realistic neurons would make such processing as, if not more than, effective. It also might help in developing further new paradigms for ever more global information processing, again with the purpose of guiding understanding of additional styles of brain processing. These new paradigms could occur on using more global wiring diagrams to understand the functionality of an even larger range of brain regions than those under the initial detailed scrutiny, but involved with these latter in an important manner. We will describe such extensions later. Part of the important advances being made by artificial neural networks is in their ability to learn any suitably smooth function with simple sigmoidally responding neurons and with a one hidden layer feedforward architecture (Hornik, Stinchcombe and White 1989). This aspect of neural networks allows an adaptive feedforward net to be arbitrarily flexible. On the face of it, such powers would seem to make easier the problem of modelling attention. However, such a method appears difficult to make effective, since the criteria for an attentional net are not simple to specify. One might claim that a single competitive net will achieve effective results (Koch and Ullmann 1985). However, it is well known (Posner and Petersen 1990) that attention has several stages, containing at least the steps of disengagement, movement and reengagement. There is also the difference between exogenous and endogenous attention. Several nets therefore must be involved in the overall action of attending. This makes the desired input-output transforms of each net less obvious. Moreover, it is likely that not until after being attended to does recognition and memory storage of an object occur. Thus the on-going net learning should not
347
take place until after attention. Adaptive processes may be necessary to develop suitable feature detectors for preprocessing (Linsker 1988) but that need not be regarded as part of the activity we must consider most specifically during actual attentional processing; the study of the development of feature detectors has not led to any perceptible insight into attention. In the face of these difficulties, we will use the hints which can be gleaned from lesion studies and gross circuit diagrams (Posner and Petersen 1990) and from the details of the micro-neuroanatomy we have presented above, and the neurophysiologyof the appropriate neural circuits, as contained, for example, in Steriade et al. (1986) and Steriade et al. (1991).
2
TheModel
In order to carry out a competition between several neurons, it is necessary to have a mechanism for enhancing the activity of the most active one (or ones) in comparison to the remainder. Such a process could be achieved solely by increasing the winnning neurons’ activity over that of the remainder. However, such augmentation would lead to the danger of an epileptic-type of response ensuing, brought about by runaway feedback excitation. To avoid this, some measure of lateral inhibition appears to be essential, and has always been present in computational models of neuron processing. The standard mechanism for this has been a Mexican hat form for the dependence of the lateral connection weights, with a short-ranged excitation giving way at longer distances to inhibition. The former enhances activity in a local region, the latter achieves the reduction of lesser, distant, activity. It was argued earlier that the competition in neural networks in the brain necessary to implement attention can only arise from long-ranged inhibition. This particularly seems needed to achieve the focussing of attention on a single object. Moreover, the inability to attend to more than one single theme at a time, seems only to emphasise the property of attention as a very narrow filter. However, since various features of an object may be attended to at one and the same time, it would seem that the input to the attentional filter must come from different processing areas. Such an aspect is an intrinsic part, for example, of the feature integration theory of Treisman (Treisman and Gelade 1980). Moreover, different objects are expected to be represented in different parts of associative areas. The inhibition between different object representations competing for attention is then expected to require long-ranged inhibitory connections.
348 It is difficult to discover such long-ranged inhibitory effects in cortex or thalamus, whilst thc general properties of a lateral inhibitory sheet of neurons, the NRT, were presented in the previous section; NRT seems a natural candidate to fit our requirements. It is the NRT which we posit achieves the long-ranged lateral inhibition needed by attention. This is not a new hypothesis, at least as far as the involvement in NRT as a gate for controlling the input to cortex from thalamus is concerned. A quotation was already given from Schiebel (1980), and there are a host of others in the literature of neuroanatomy and neurophysiology which argue the same feature (Yingling and Skinner 1977). What is claimed to be new here is the possibility that the NRT functions not only as a gate but also as a global gate. In other words, by virtue of the very nature of the structure and action of NRT, astride the main thalamo-cortical and corticc-thalamic pathways, NRT acts so as to: (1) allow only a single object at a time to be processed by crucial brain structures during directed attention (2) help change the particular focus of attention from one to another of a set of inputs, in particular the reengage part of the directed attention process (Posner and Petersen 1990) (3) allow for rapid response to significant or novel objects outside the directed attention paradigm (in the exogenous or ‘bottom-up’ attentional phase). To show how these three different processing activities could be achieved by means of NRT may be regarded as premature. The recent discovertis of Crabtree et al. (1989, 1991, 1992) indicate that the NRT has a more complex structure then had originally been contemplated. The studies of Llinas and Ribary (1991) and many others indicate that the nature of the thalamic relay cell transmission, both on anatomical and physiological grounds, has still to be properly understood. In the remainder of this section, we will present an extremely simplified version of our model of the thalmus-NRT-cortex (TNC) complex. We expect that this still contains the basic principles of the more complete living system. Our stripped down model has to achieve three processing activities, the first a simpler version of the second: (a) to enable competition between different thalamic inputs or cortical activities to be carried out, (b) to enable global competition to arise between all thalamic inputs and cortical activities, (c) to respond so that a novel input rapidly wins the competition. Our first simplification will neglect cortical activity altogether, so that the basic structure we are considering is that of Figure 1. Input Ij enters the thalamic relay cells T,, whose excitatory output Oj feeds to the corresponding NRT cell Nj vertically above
349
Tj. The cell Nj sends excitatory activity solely back to Tj, but also sends inhibitory signals to other nearby cells in NRT. The NRT cells thereby violate Dales's Law (that all outputs from a given neuron have the same sign), but that can easily be corrected by including inhibitory interneurons in the thalamus for which the effect of NRT is disinhibition of thalamic cells; the inclusion of such interneurons is done in the more complete model in the next section, but will be neglected here initially for reasons of simplicity. That the interneurons do lead to a net disinhibitory effect of NRT on thalamus is supported be experimental evidence (Steriade et al. 1986). The neglect of lateral connections of thalamus to NRT or NRT feedback to thalamus may be justified on the basis that the neurons of Figure 1 are regarded as corresponding to the averaged activity of groups of neurons in thalamus or NRT. Alternatively, it may be claimed that lateral connections, say with a Gaussian spread, will tend to smear activity out somewhat; the model of Figure 1 contains the ideal, sharper, mode of information processing which may be satisfactory for a simplified set of inputs. Let us suppose that all cells have roughly the same thresholds and connection strengths in Figure 1. We may then argue that at the position of maximum of input I, the resultant excitation of the NRT cell N will be greater than that of the other neighbouring NRT cells. After transient activity, the membrane potential of the thalamic cells will stabilise at values for which that of the cell at the maximum of the input will be larger than those of the others by a margin greater than that by which the maximum input exceeds the lower, nearby, inputs. This can be worked out mathematically in the simple case of linear neurons; for suitable excitatory symmetric connection weights between thalamus and NRT, the magnification coefficient m, equal to the ratio of activity in T2to that in TI (for only two cells in Figure 1) is equal to (11- 1)/(1 - I),where Z is the ratio Il/ZZ, and I = (1 - a2 - A B ) / a A B ,where A and B are the connection weights from the thalamus to NRT and NRT to thalamus, respectively, and a is the lateral inhibitory weights from 1 to 2 or vice versa. For example, for A = B = a = f , m takes the value (41 - 1)/(4 - I),which gets arbitrarily large as I approaches 4 '(with saturation effects, neglected in the linear analysis, ultimately putting an upper bound on m). We conclude that symmetric inhibitory lateral connections between NRT cells, with excitatory NRT-thalamic feedback, sets up competition between thalamic inputs. There is enhancement of the direct input by the feedback, together with a reduced inhibitory contribution from nearby input. In the above example, with the numbers quoted, the and fZz - $11 respectively. potentials V T ~ V, T ~of the thalamic cells 1 and 2 are !I1 The magnification factor increases the excitatory input on TI relative to that on T2, whilst the inhibitory contributions will be larger at the weaker input than the stronger
350 one. The above features, of strengthening the stronger input, and larger inhibition from that input being fed back onto the positions of weaker input, can be seen in the more general case of linear neurons with general connection matrices A (NRT to thalamus), B (thalamus to NRT) and -C (NRT to NRT). The static membrane potential equations from Figure 1, are VT = I A .VT, (1)
+
VN
=B
'VT - C 'VN,
(2)
where VT and V N are the vectors of membrane potentials for thalamic and NRT cells with solution VT = [l - A(l C)-'B]-'I, (3) I ~ ( i C)-'BI + .. .,
+
+
+
+
where the third expression in (3) is valid for small A or B and (1 C)-' exists for positive definite C. For A and B close to diagonal matrices, we can discern in (3) the same effect as in the simpler model, with enhancement of inputs by roughly the factor [det(l C)]-', and inhibition by neighbouring inputs with coefficients of order the elements of the offdiagonal elements of adj(1 C) times [det(l C)]-' with further factors equal to the diagonal elements of A and B. Thus again the architecture of Figure 1 will lead to competition on inputs.
+
+
2.1
+
The Third Simplified Model
It is possible to enhance the sensitivity of the system by introducing non-linear neurons, both by means of the gain factor and more particularly by a thresholding effect. This is seen clearly for the thalamic cells, since the lateral inhibitory spread on NRT could reduce membrane potentials on those cells for which the input is non-maximal below their effective thresholds, and so cut off their outputs completely. This can be seen in detail for the case of the two sets of NRT-thalamus neurons treated earlier in the linear case, where the extension of (1) and (2) is the system (for no lateral spread of thalamusNRT connections) given by:
VTZ
- Af
B ( m ( f ( V T 2 )
- f(VT1))) = 12,
(5)
with f a non-linear sigmoid function with effective threshold t (so f(z)< 0 for z < t , and f ( m ) = +1, f ( - w ) = -1). The equations (4) and (5) are seen to have the consistent
35 1 solutions
-
1,
f ( ~ ~ 1 ) ~ ( v T Z )
(h
-
-1, provided that
2B
(m))1,
f
(12
+ Af ( 1- 5 -1, )) a2
hl
It is clear that there is a range of values of the parameters A, B, a,t , and inputs Il > 4 , for which (6) has a solution, in particular by choosing A and B large enough or a close enough to 1. These features are expected to extend to more general non-linear competitive implementation of the model.
2.2
A Simplified Global Competitive Model
The features of the model of Figure 1 we have explored so far lead to competition between thalamic neurons for which there is lateral inhibitory connection between corresponding cells on the NRT net. However, NRT cells have not been observed to have connections across the whole NRT sheet, in spite of their extensive dendrite trees (especially in the rostra1 part of NRT). Thus the range over which this competition can effectively take place will be limited. The localised properties of such competition have already been analysed over a decade ago (Ermentrout and Cowan 1978), in terms of the even more simplified model of a single sheet with excitatory input and a Mexican hat style of lateral coupling; this model results from that of Figure 1 by collapsing the thalamic cells onto the corresponding NRT ones, and further violating Dale’s Law. The range over which the competition can occur is of order that of the range of connections on the sheet. This feature of the sheet is not satisfactory for global control of the form of global guidance described earlier. It would seem that without considerable extensions of the lateral connections on NRT (or a small enough sheet, as might be the case in the rat) global guidance could not arise, and attentional control would be weak. A number of localised activities could then be supported on cortex, in disagreement to the ability to attend only to a single object at a time. The important feature of NRT, noted at least for more advanced mammals, above the rat, in Table 1, is the presence of dendrc-dendritic synapses. These latter allow the NRT to be considered as a totally connected net even without the presence of axon collaterals of the form considered in Figure 1. Such dendro-dendritic synapses arise between horizontal cells in the outer plexiform layer of the retina (Dowling 1987), in the form of electrical gap junctions. They were modelled as linear resistors in the mathematical model of the retina in Taylor (1990). On NRT, the dendro-dendritic synapses appear to be chemical ones (as high-magnification electromicrographs show vesicles on or on both sides of the synapses). Such synapses need to be modelled in a non-linear fashion. In general, the dendro-dendritic synaptic contributions to the membrane potential v T ( r ) at
352
a particular cell at the point r on the NRT sheet might be approximated as a sum of contributions from the nearby synapsing cells, each depending on the membrane potential differences between the cells. Thus, a typical form would be
in terms of some non-linear function F. Working with only small changes of potentials, F can be linearised, to give the contributions
where G = F’(0) is positive in the case of inhibitory action in the dendredendritic synapse. For values of r’ close to r, the continuum limit can be taken of the NRT system, and the expression (8) reduces to (Taylor 1990): -a2V2vT
+ O(a4),
(9)
where Vz is the two-dimensional Laplacian operator in Cartesian coordinates and a is a real constant determined by the average spacing between the neurons of the net. The resulting equations for the action of linearised cells is the extension of (2) by adding the dendredendritic synaptic contribution on the R.H.S. This leads to (1) and (2) modified by (9): VT
VN
=I
+A .
VN,
+ a Z V Z v N= B ’vT - c - v N .
(1’)
(2’)
Upon neglect of the lateral connection matrix C, and with A and B diagonal, we obtain the simpler equation V N 4- b 2 V Z V N = J, (10) where bZ = (1 - AB)-’aZ,J = (1 - AB)-’I, and we assume AB < 1 to prevent infinite gain in the thalamus-NRT feedback loop system. The expression (10) is the basis of the simple version of the global gating model. What can we deduce about the dependence of the response of the NRT cells’ potentials (and hence that of the thalamic cells by (1)) from (10). We claim that (10) instantiates a form of competition in the spatial wavelength parameters of incoming inputs J. Physical systems with this underlying description have been investigated in a number of cases: spatially inhomogeneous superconducting states on tunnel injection of quasparticles (Iguchi and Langenburg 1980); a Peierls insulator under strong dynamic excitation of electron-hole pairs or in the presence of electromagnetic radiation (Berggren and
353 Huberman 1978). These models, and related ones for growth and dispersal in a population (Cohen and Murrray 1981; Murray 1989). We may see precise forms of competition arising from the NRT modelled by (11) or (10) by looking at these equations for inputs made of sums of plane waves. Thus, if J is composed of a set of separate waves of wavenumbers kl, . . .,kn, so J = Cj cj sin kj . r, then for suitable coefficients kj, from (lo), n
'J sink, - r ,
j=1
Thus, NRT activity will also display the same spatial oscillation as the input, but now with amplification of those waves with 1 k? = -
b2'
3
Such augmentation corresponds to a process of competition on the space of the Fourier transform of inputs, where the Fourier transform f(k) for an input I(r) involves the global recombination 1 f(k) = - d2r eik'rI(r). 2* It is in this manner that we can see how NRT can exercise global control, by singling out those components f(k) of I(r) by (13) for which (12) is true. Other values of k do not have such amplification. In other words, it would appear that the NRT would oscillate spatially with wavelength 2rb, with net strength given by the component of the input with the same wavelength. The way in which global control arises now becomes clear. Only those inputs which have special spatial wavelength oscillations are allowed through to the cortex, or are allowed to persist in those regions of cortex strongly connected to the NRT: the thalamusNRT system acts as a spatial Fourier filter. There is evidence for this in that feature detectors occur in a regular manner across striate cortex (Hubel and Weisel 1962) as well as facecoding appears to have a spatial lattice structure (Harries and Perrett 1991). Other explanations of the spatial periodicity of striate cortex feature detectors have been proposed (Durbin and Mitchison 1990) but these are consistent with our present proposal which may only add a further spatial instability to that explored in those references by non-NRT processes. It would seem that the model of Figure 1 with dendro-dendritic synapses, as described by equations (1) and (2), can satisfy criterion (b) of $2, at least to a limited extent. The model can be extended immediately to allow activity (c) of $2 to be implemented by addition of a further direct input arising from a net assessment of peripheral visual inputs.
1
354
There are known diffuse neurochemically identifiable afferents (cholinergic, GABAergic, serotonergic, and noradregernic) which enter NRT. These are known to modify NRT cell firing. In some parts, these inputs may be used to switch off on-going NRT activity (by inhibition) or win any on-going competition (by excitation). This could be modelled by an additional input on the R.H.S. of (2). It is relevant to note that NRT activity is expected to be more sensitive to such inputs than that arising indirectly from T in (1). There is still the question of stability of the solutions. We also have to discuss how flexible is the competitive process. That will be in terms of the more complete model which we outline in the next section. This will indicate that global control may also depend on the input amplitude, as well as their wavelengths, in the non-linear regime.
3
Modelling the Global Competitive Gate
3.1
The Model
3.1.1
Formulation
We discussed in the previous section various simplified versions of the overall model to be presented in this chapter. There were also some unanswered questions raised by these models, and in particular, that of the stability and flexibility of the system. In this section, we wish to present the more complete model and discuss in general how it functions. Furthermore, we will show how it possesses features able to give a broad range of responses to inputs. Most specifically, we wish to show that parameter ranges can be varied by internal and external factors. The basis of the model is best seen in terms of its wiring diagram in Figure 2. The model is an extension of that in Figure 1 by means of : (i) addition of the cortical layer C, (ii) addition of interneurons IN in thalamus, (iii) extension of the input to IN cells as well as the T cells, and NRT feedback solely taken to IN cells. The latter conectivity is an approximation to the results of the analysis of Steriade and colleagues (1986), in which the effect of GABA on GABAergic cells (from N cell feedback to IN cell) is claimed to be an order of magnitude more effective than of GABA on excitatory cells. We note that the model may be regarded as the framework for more extensive modelling with layered cortex and NRT. However, at this stage, too much complexity would be introduced too rapidly by such structural richness. We now turn to the detail of the equations used to describe the thalamus-NRT-
355
cortex interacting system, whose general structure is shown in Figure 2. Excitatory neurons are assumed to be at a coordinate position r (using the same coordinate frame for thalamus, NRT and cortex) on the thalamic and cortical sheets and to have membrane potentials labelled uT(r) and uc(r), and outputs f,(u;), (i = T and C,respectively), where f;(z) = [l exp with threshold 0; and temperature 5";. Inhibitory neurons in the thalamus and on the NRT (with the same coordinate positions) have membrane potentials denoted by vT(r), v N ( r ) , with similar sigmoidal outputs g T , g N to the other neurons. Connection weights from the j'th excitatory (inhibitory) neuron at r' to the i'th excitatory (inhibitory) neuron at r are denoted by WtE(r, r'), W r ( r , r'), W4E(r,r'), W4r(r,r'). The resulting leaky integrator neurons (LINs) satisfy
+
,'-I)?(
1 G T ( r ) = --uT(r) 7T
+1c
[@g(r,r')fC(UC(r'))
1 + w%r, r ' ) g T ( v T ( r ' ) ) ] + -l(r) TT
(17)
r'
(where we are using the notation 4 = &/at). We have rescaled all the connection weights, and the input on the thalamic cells, so that the decay constants r;, r+ drop out for stationary activity. Moreover the dendredenritic contribution has yet to be included in (15), it being given by the analysis of Taylor (1990), as cited earlier, to be equal in the linearised limit to -Gz(NvN(r) -
vN(r')).
(18)
A purely rectangular net with four neighbours at horizontal and vertical distances a*, b* (so r' = (r (a*, b i ) ) ) gives, in the continuum limit (Taylor 1990), the dendro-dendritic term is 1 + G z ( v . V V N -A2 . V z v ~+) higher order terms (19) 2 where v = (a+ - a _ , b+ - b - ) T , A' = (a: + a!, b: b'_)=, V2 = ( 3 z , d i ) T , and G2 is a negatively-valued constant in (19). The equation (15) thus reduces to a negative Laplacian net, with a linear derivative term proportional to v . The inputs to the net depend, however, in a non-linear manner on ON as given by (14), (16), (17).
+
+
+
356 3.1.2
Analysis of the Model
The model of Figure 2, expressed mathematically by equations (14)-(19), is expected to have similar properties to those of the various models associated with Figure 1 and discussed in 52. That is clear for the asymptotic solutions satisfying the equations obtained by setting the left hand sides of (14)-(17) all to zero. The main differences between this static system and the earlier equations of 52 are: (i) the presence of the C-layer and its feedback, (ii) non-linearity of all neurons, (iii) the presence of the inhibitory interneurons IN, and NRT feedback to them rather than the T-cells (as in Figure 1). We will briefly discuss each of these features in turn. The cortical layer can be seen as a mechanism for achieving extra input amplification, by means of the thalamus-cortex-thalamus feedback loop, as has already been discussed by La Berge and colleagues (1992). Indeed, the thalamus-NRT feedback loop is functioning in a similar manner, where the factor 1/(1 - AB) arises as the gain factor in the model of 52. This amplification may make the competition on NRT more effective (La Berge et d.1992). This will be borne out by simulation of our complete system. Nontrivial transforms in C will be expected to modify this result, but will not be considered here. The non-linear functionality of neurons was already argued to be a means of increasing the sensitivity of the system in 52. We will appeal to this same argument here. The inhibitory interneurons have been included so as to be able to implement Dale’s Law properly. This is clearly satisfied in Figure 2 and equations (14)-(17). The effect of NRT activity on the IN cells will therefore be that of disinhibition, which is a mode of action used in numerous parts of the brain. Its net effect, for suitable parameter ranges of cell thresholds is similar to that of excitation of NRT activity directly on the thalamic relay cells, as observed experimentally for example by Steriade et al. (1986). In this range of parameters, then, the disinhibitory NRT-IN-T activity of Figure 2 reduces to the net excitatory NRT-T activity of Figure 1. It would appear that the model of Figure 2 should therefore have similar properties to that of Figure 1, in particular supporting both local and global competition and being able to account for endogenous activity by addition of direct input to NRT on the R.H.S. of (15). However, there is still the crucial question of stability and more detailed inputdependence, of plane waves excited globally across the NRT. Stability can be analysed by looking at higher order terms that might arise in (19), and also consider a linearised analysis of the lateral connection term involving the con-
351
nection weight W"(r,r') on the R.H.S. of (15). The former of these gives (assuming symmetry in the x and y directions)
ahd the latter, for a Gaussian spread of lateral inhibition, the convolution product
where
with b < a. The dispersion relation for the temporal dependence ex* of the NRT membrane potential in (15), neglecting all but the lateral connections as a first approximation (since the other terms do not play a crucial role) becomes for the Fourier mode k, A = -G2k. k - c2(k. k)2 - F ( k ) ,
(23)
with
In Figure 3(a), the general shape of F ( k ) is plotted, and in Figure 3(b), that of A(k). Instability arises for values of k for which X(k) 2 0. This is the interval (kl,k z ) in Figure 3(b). As argued with clarity in Murray (1989),inhomogeneous NRT activity, with wave numbers in the interval (kl,kz) will be expected to grow. The stability of the resulting globally inhomogeneous activity depends on the more detailed non-linear system (14)(18), and presently can only be ascertained by simulation. Results of the latter will be presented in $3.2; they will show the system is stable. The dependence of the winning activity on the maximum of the wavelength number being singled out by the competition on NRT will be discussed further in the next sub-section. There is finally the non-trivial dependence on the inputs of the inhomogeneous mode winning the competition on the NRT sheet. Thus, whilst equation (23) would seem to indicate lack of dependence of this mode on the input, the full non-linear system does indeed have such a dependence. There does not seem to be much mathematical analysis of this problem in the literature, although dependence on boundary and initial conditions for reaction-diffusion models has been studied (Arcuri and Murray 1986). We will turn to answer this question by simulation in the next sub-section; we propose to consider it mathematically elsewhere.
358
3.2 3.2.1
Simulations The Simulation Model
The simulation model we investigated is illustrated in Figure 2. It is essentially onedimensional, which corresponds to lines of thalamic, NRT and cortical neurons. In future work, it is intended to extend the simulation work to the more realistic case of tww dimensional sheets of these neurons. A simplified version of Figure 2 is presented in Figure 4. It is useful for simulation purposes to think of the boxed subsystem in Figure 4 as a single module that can be linearly replicated (with bidirectional lateral links among the NRT units). This allows the size of the simulation to be scaled to suit the available computing power. Within each module, we have inter-unit signal flow as specified by the arrow-tipped curves. Every tip labelled with a ‘+’ signifies an excitatory signal, while every ‘-’ corresponds to an inhibitory signal. Within every module, the time evolution of each neural unit’s voltage is determined by the solution of equations (14)-(17). Within a computing context, of course, a discretised version of these equations has to be integrated over some suitable small time step (we used the RungeKutta fourthmder integration routine). Such a scheme entails in practice that we replace the hard-limiting non-linearity by a smooth analytical function. We adopted the following form, as used by La Berge et al. (1992): f(.A)
= hAY(.A
- 0,)
[I - exp { - p A ( . A
- OA)}]
(25)
where h~ is a scaling parameter, Y ( o )is the step function, 6, is the threshold for unit A and pa is its inverse temperature. In order to proceed with the simulations, it was found necessary to obtain limits on the ranges of the large number of parameters involved. To obtain an idea of what constitutes ‘good’ parameter ranges, consider the equations (14)-(17) in the static limit (obtained by setting d u ; / d t = dvi fdt = 0, or, equivalently,by allowing the ri’s, i = T, N, C to go to zero): uc = w c c y ( U c VN
= wNIY(UT VT
UT
=Ik
+
-
- ec),
(14’)
eT)+ w N c y ( U c - ~ c ) ,
= w T N y ( v N - ON),
WTTY(VT
- 0,)
+WTCy(UC-
(15’) (16’)
@C),
(17‘)
Note that in these equations, we have taken the full non-linearity Y ( o )for the function in (25). To distinguish between zero and finite external inputs, we define further Z i = z k = 0 and 1; = I k # 0. In order to set up the subsystem so that we can achieve global control,
359
we require that units T, Nk and C be switched ‘off’ whenever unit I is ‘on’ in the case of zero external input, and vice versa in the case of finite external input2. Looking at (17‘) separately for the case when I&= I; and I&= I t , it is easy to show that the thresholds of each of the units have to lie in the ranges
3.2.2
Simulation Results
The series of simulation results we carried out reflect in part the major aim of establishing the existence of the global control model for the NRT. In Figures 5-15, we present the results obtained from actual simulation runs. The simulation code we developed was very modular in design, allowing for easy scaling to more powerful simulations, which could, with some reconfiguration, also be performed on parallel processing machines. It is hoped in future to extend the simulations to the more realistic case of two-dimensional sheets of thalamus, NRT and cortical tissue, and this can easily be accommodated. The z-axes in these plots represent the spatial positions of a line of 100 neurons. The y-axes represent OUT(UT) and OUT(VT) for the cases of the thalamic excitatory and inhibitory neurons respectively (where OUT has its usual ANN interpretation), and user-scaled3 raw voltage output from the UN and Uc excitatory neurons in the NRT and cortex respectively. Time delays for signal propagation between neurons were set to zero (although they could easily be incorporated in any more realistic future runs). The very simplest situation one can envisage with this system is that where there is no lateral coupling between NRT neurons. Such a system (Figure 5) is very useful for evaluating the effect of feedback strength between neurons of the three layers. We see that each vertical module acts in essence like an amplifier for its particular input signal. Since there exist a large number of free parameters in this system (between 2040, depending on how the simulations are configured), it is useful to determine suitable ranges for the allowed values these can take, in addition to the constraints in (26). We do this by introducing lateral connectivity (given by equations (18), (22)) and experimenting with different values of the amplitudes, thresholds and temperatures of the neurons, and the range of the spreads for the lateral terms. The last of these turns out to be a stability ’The ‘on’ state corresponds to OUT(unit)=l, while the ‘off’state is OUT(unit)=O, with OUT having its usual artificial neural network meaning. ’As a result, the vertical scales in any of the plots are not in proportion. This, however, is not a concern, since we are interested in the general behaviour of the system rather than particular values of output voltages.
360
parameter, in that excessively long-ranged influence of the lateral connectivity terms leads to unrealistically large output voltages in the NRT neurons. We identify such behaviour as the nonlinear regime of operation. This behaviour is illustrated in Figures 6-8, where we successively increase the spread from 5 to 50. Every positively valued segment of the NRT wave acts to allow an input getting through to the cortex, while every negatively valued segment acts to restrict it. This is best illustrated in Figure 7. For moderate values of the spread, we find that the spread has a second attribute as a mechanism for local control. For large values of the spread, the NRT activity begins to grow disproportionately (Figure 8). The spread also plays a part in the outcome of amplitude competition, as in Figure 9, for instance. Of the two plateaus representing strong and weak inputs, the stronger input dominates the weaker one. The spatial wavelengths set up in the NRT region are much smaller than the spread of the plateaus, yet exercise strong control over the allowed cortical response. There is partially global control over the allowed set of inputs propagating up to the cortex, determined principally by wavelike activity in the NRT, itself a function of the spread. It is interesting to note here the relative effects of the difference of Gaussian (DOG) and dendro-dendritic terms in the emergence of waves on the NRT sheet. Figure 10 illustrates the result we obtain be eliminating the dendro-dendritic term, while Figure 11 shows the activity upon removal of the DOG term. We see that the DOG term is clearly less influential in both spatial waveform generation and the development of strong patterns of activity in the cortical layer. This is to be expected, however, since axon collaterals in the DOG representation are not long-ranged. The global control mechanism that is predicted in $2 is illustrated in Figures 12-15. The spatially global wave of activity on the NRT exercises wavelength dependent control on the signals propagating forth to the cortex. The cortical activity persisting does not reflect the inputs very strongly (Figure 13), as it did for the partially global control in Figure 9, being influenced instead by the oscillatory character of the NRT activity A significant aspect of this system, when operating in a global control phase, is its sensitive dependence on classes of inputs. We have seen that during such a phase, the cortex only sees the winner of the competition taking place on the NRT. Equivalently, the NRT (according to the theory of $2) is acting like non-linear filter that allows one out of all the Fourier components presented to it to propagate through. We expect therefore that the selection of this particular Fourier component would be critically influenced by factors such as the wavelength and the amplitude of the input (we noted this towards the end of $2 as well). It is difficult to establish with certainty, however, which of these two
36 1 are the more significant, since our simulation results shows only trends in the output, and not actual magnitudes thereof, as mentioned earlier.
4
Discussion
The theoretical and simulation results presented in the previous section show that the simplified model of the thalamus-NRT-cortex complex does achieve the requirements (a), (b) and (c) outlined in $2, to allow rapid local and global competition to occur between thalamic inputs. There is still much analysis to be done on this model, both theoretically and by simulation. Thus the following questions, among many, need to be answered: 1. delineation of the domain of parameter space in which the competition between in-
puts occurs in momentum space or instead of on the separate coordinate-dependent amplitudes, due to non-linearities, or vice versa;
2. the crucial parameters on which the speed of the outcome of the competition depends on; 3. the effects of including more realistic properties of neurons, such as stochasticity, spikes, neuronal geometry, etc.; 4. the effect of the more complete structure of cortex and NRT, and of more realistic
modelling of thalamic glomeruli, on the nature of the system;
5. the effects of information transformation in cortex on the details of the operation of the competition. The answer to 2 is of relevance to ongoing experimental work on the bringing to sensory awareness of sensations of touch by direct cortical electrical stimulations (Libet et aJ. 1964), as has been pointed out recently by one of us (Taylor 1993a). The answer is expected to be in terms of an exponential increase in time, with a time-constant depending upon the injected current, as the stability analysis of $2.1 indicates. The answer to 3 is relevant to the level of neuro-biological realism of the modelling, but is not expected to make much difference to the principles of the system. Questions 4 and 5 may allow considerable extension of the model, especially if one takes account of cortical activity at working memory centres being injected as thalamic input; these working memory units may, along with primary cortical input areas, be considered as the source of sensory awareness (Taylor 1993b). Finally, question 1 requires further work
362
also involved with investigations of the other questions. W e hope to be able to consider answers to these questions in due course.
Acknoweledgements One of us (F.N.A.) would like to thank the Science and Engineering Research Council (SERC) of the U.K. for grant GR/F92251 which enabled part of this work to be carried out.
363
p’
; \i
.;, I2
P’
I3
Figure 1. The structure of the first model of the thalamus-NRT-cortex complex, in which the cortex is dropped, and competition occurs only between the inputs Ij to the thalamic relay cell Tj;the strengths Oj of the outputs indicate the winners and losers of the competition carried out between the corresponding NRT neurons Nj by means of their inhibitory lateral connections.
364
Figure 2. The wiring diagram of the main model of the thalamus-NRT-cortex complex. Input Ij is sent both the the thalamic relay cell Tj and the inhibitory interneuron I N j , which latter cell also feeds to Ti. Output from Tj goes up to the corresponding cortical cell Cj, which returns its output to Tj. Both the axons TjCj and CjTj send axon collaterals to the corresponding NRT cell N j . There is axonal output from Nj to I N j , as well as collaterals to neighbouring NRT cells; there are also dendro-dendritic synapses between the NRT cells.
365
Figure Sa. Dependence of the Fourier transform @(k) of the lateral NRT connection weights on the wave variable k. Initially positive, the value of @(k) becomes negative. so producing possible instability according to equation (23).
Figure Sb. Dispersion relation for U N in wavenumber space. The interval (kl,122) contains the wave numbers for which instability occurs, and spatially inhomogeneous activity is expected to arise.
367
r (k-- - - ~ r - - - (k+~ r - - Kth subsystem
1)'lh Subsystem
1)'th Subsystem
1
I I
I
I
I 1
_ - -
- - _ I
(k-1)'th Input
k'th Input
(k+ 1)'th Input
Figure 4. Schematic of the "modularised"version of Figure 2, as used in obtaining the results of the computer simulations.
368
Figure 5. Simulation run with no lateral coupling between NRT neurons. The input essentially feeds through to the cortex, as might be expected.
1
NRT(vo1ts) OUT(Tha1am.) OUT(lnhi6.) -
Input(v0lts) Figure 6. Simulation run with lateral connectivity (both dendro-dendtitic and DOG) intrclduced. The value chosen for the spread is small here. Wave activity is beginning to appear on NRT.
Figure 7. As for Figure 6, but with moderate values for the spread. The NRT is clearly influencing what is dowed to propagate through to the cortex.
37 1
Figure 8. As for Figure 6, but with large values for the spread. The activity on the NRT is beginning to take on a non-linear mode of operation. There is still, however, control over what is allowed to go through to the cortex.
312
Figure 9. Simulation run showing amplitude competition. Of the strong and weak inputs being fed in, only the strong survives the journey to the cortex. There is partially global control exercised over this by the activity on the NRT.
373
Figure 10. Simulation run showing the development of activity on NRT with only a DOG form for the lateral connectivity. (Compare with Figure 11).
374
T
Y
.c
0
S:_"I
P
IIIIIIIIII
IIIIIIIIII
IIIIIIIIIII
IIIIIIIIII
.r
Figure 11. Simulation run showing the development of activity on NRT with only a dendro-
dendritic form for the lateral connectivity. (Compare with Figure 10).
315
Figure 12. Simulation run showing full global control with a spatially constant input. The activity on the cortex reflects the activity on the NRT, and is not dependent on the form of the input.
316
Figure 13. Simulation run showing full global control with semi-constant spatial input. Again, the cortex activity is influenced by the NRT alone.
377
Figure 14. Simulation run showing full global control with short-wavelength periodic input.
378
Figure 15. Simulation run showing full global control with medium-wavelength periodic input.
319
5
References
AhlsBn, G. and Lindstijm, S. (1982). Mutual Inhibition between Perigeniculate Neurons, Brain Res., 236,482-486. Arcuri, P.and Murray, J. D. (1986). Pattern sensitivity to boundary and initial conditions in reaction-diffusion models, Math. Biol., 24, 141-165. Avanzini, G., de Curtis, M., Ferruccio, P. and Spreafico, R. (1989). Intrinsic Properties of Nucleus Reticularis Thalami Neurons of the Rat Studied in uitro, J. Physiol., 416, 111-122. Berggren, K. F. and Huberman, B. A. (1978). Peierls state far from equilibrium, Phys. Rev. B, 18,3369-3375. Barbaresi, P., Spreafico, R.,Frassoni, C. and Rustioni, A. (1986). GABA-ergic neurons are present in the dorsal column nuclei but not in the ventroposterior complex of rats, Brain Res., 382,305-326. Cohen, D. S. and Murray, J. (1981). A Generalised Diffusion Model for growth and dispersal in a population, Math. Biol., 12, 237-249. Crabtree, J. W. (1989). Evidence for topographic maps within the visual and somatosensory sectors of the thalamic reticular nucleus: A comparison of cat and rabbit, SOC. Neurosci. Abs., 15, 1393. Crabtree, J. W. (1991). Maps within the cat’s somatosensory thalamus, SOC. Neurosci. Abs., 17,623. Crabtree, J. W. (1992). The somatotopic organization within the rabbit’s thalamic reticular nucleus, Eur. J. Neurosci., 4, 1343-1351. Crabtree, J. W. and Killackey, H. P. (1989). The topographic organization and axis of projection within the visual sector of the rabbit’s thalamic reticular nucleus, Eur. J. Neurosci., 1, 94-109. Deschknes, M., Madariaga-Domich, A. and Steriade, M. (1989). Dendro-dendritic synapses in the cat reticularis thalami nucleus: A Structural basis for Thalamic Spindle Synchronisation, Brain Res., 334,165-168. Douglas, R. J. and Martin, K. A. (1991). A Functional Microcircuit for Cat Visual Cortex,
380
J. Physiol., 440, 735-769. Dowling, J. (1987). The Retina. Harvard University Press. Durbin, R. and Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps, Nature, 343, 644-647. Ermentrout, G. B. and Cowan, J. D. (1978). Some Aspects of the ‘Eigenbehaviour of Neural Nets’, Studies in Mathematics: The Mathematical Assoc. of America, 15, 67-117. Harries, M. H. and Perret, D. I. (1991). Visual Processing of Faces in Temporal Cortex: Physiological Evidence for a Modular Organisation and Possible Anatomical Correlates, J. Cog. Neurosci., 3, 9-23. Harris, R. M. and Hendrickson, A. E. (1987). Local circuit neurons in the rat ventrobasal thalamus - A GABA immunocytochemical study, Neurosci., 21, 229-236. Hornik, K., Stinchcombe, M. and White, H. (1989). Multilayer feedforward networks are universal approximators, Neural Networks, 2, 359-368. Hubel, D. H. and Weisel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, J. Phys., 160,106-154. Iguchi, 1. and Langenburg, D. N. (1980). Diffusive Quasiparticle Instability toward Multiple-Gap States in a Tunnel-Injected Nonequilibrium Superconductor, Phys. Rev. Lett., 44, 486-489. Jones, E. G. (1975). Some Aspects of the Organisation of the Thalamic Reticular Complex, J. Comp. Neurobiol., 162,285-308. Koch, C. and Ullman, S. (1985). Shifts in Selective Visual Attention: Towards the Underlying Circuitry, Human Neurobiol., 4, 219-227. La Berge, D. (1990). Thalamic and Cortical Mechanisms of Attention suggested by recent Positron Emission Tomographic Experiments, J. Cog. Neurosci., 2,358-373. La Berge, D., Carter, M. and Brown, V. (1992). A Network Simulation of Thalamic Circuit Operations in Selective Attention, Neural Computation, 4, 318-331. Libet, B., Alberts, W. W., Wright, E. W., Delattre, L. D., Levin, G. and Feinstein, B. (1964). Production of threshold levels of conscious sensation by electrical stimulation of human somatc-sensory cortex, J. Neurophys., 27, 546-578.
38 1
Lijenstr6m, H. (1991). Modelling the Dynamics of Olfactory Cortex using Simplified Network Units and Realistic Architectures, Int. J. Neurosci., 1-2, (to appear). Linsker, R. (1988). Self-organisation in a Perceptual Network, Computer, 21, 105-1 17. Llinas, R. and Ribary, U. (1991). Ch. 7 in Induced Rhythms in the Brain. Eds. E. Basar and T. Bullock, Birkhauser, Boston. Martin, K. A. (1988). From Single Cells to Single Circuits in the Cerebral Cortex, Quart. J. Exp. Physiol., 73,637-702. McCormick, D. A. and Prince, D. A. (1986). Acetylcholine induces burst firing in thalamic reticular neurons by activating a potassium conductance, Nature, 319,402-405. Montero, V. M., Guillery, R. W. and Woolsey, C. N. (1977). Retinotopic organization within the thalamic reticular nucleus demonstrated by a double label autoradiographic technique, Brain Res., 138,407-421. Murray, J. M. (1989). Mathematical Biology, Springer-Verlag. Ohara, P. T. and Lieberman, A. R. (1985). The Thalamic Reticular Nucleus of the Adult Rat: experimental anatomical studies, J. Neurocyt., 14,365-411. Ohara, P. T. (1988). Synaptic Organisation of the Thalamic Reticular Necleus, J. Elect. Mic. Tech., 10,283-292. Park, D., Steriade, M., Deschdnes, M. and Oakson, G. (1987). Physiological Characteristics of Anterior Thalamic Nuclei, a group devoid of inputs from Reticular Thalamic Nucleus, J. Neurophysiol., 57,1669-1 685. Posner, M. I. and Petersen, S. E. (1990). The Attention System of the Human Being, Ann. Rev. Neurosci., 13,25-42. Schiebel, A. B. (1980), in Reticular Formation Revisited, eds. J. A. Hobson and B. A. Brazier, Raven Press, New York. Spreafico, F., De Curtis, M., Frassoni, C. and Avanzini, G. (1988). Electrophysiological Characteristics of Morphologically identified Reticular Thalamic Neurons from Rat Slices, Neurosci., 27, 629-638. Spreafico, F., Battaglia, G. and Frassoni, C. (1991). The Reticular Thalamic Nucleus (RTN) of the Rat: Cytoarchitectural, Golgi, Immunocytochemical and Horseradish Per-
382
oxidase Study, J. Comp. Neuro., 304, 478-490. Steriade, M., Domich, L. and Oakson, G. (1986). Reticularis Thalami Neurons Revisited: Activity Changes During Shifts in States of Vigilance, J. Neurosci., 6,68-81. Steriade, M., Currc-Dossi, R. and Oakson, G. (1991). Fast Oscillations (20-40 Hz), in thalamocortical systems and their potentiation by Mesopontine Cholinergic Nuclei in the Cat, Proc. Nat. Acad. Sci., 88, 4396-4400. Taylor, J. G. (1990). A Silicon Model of Vertebrate Retinal Processing, Neural Networks, 3,171-178. Taylor, J. G. (1993a). A Competition for Sensory Awareness? King’s College London Preprint. Taylor, J. G. (1993b). Goals, Drives and Consciousness. King’s College London Preprint. Triesman, A. and Gelade, G. (1980). Cog. Sci. 12,99-136. WGrgiitter, F., Niebur, E. and Koch, C. (1991). Isotropic Connections Generate Functional Asymmetrical Behaviour in Visual Cortical Cells, J. Neurophysiol., (in press). Yingling, C. D. and Skinner, J. E. (1977). Gating of Thalamic Inputs to Cerebral Cortex by Nuclear Reticularis Thalami, Prog. Clin. Neurophysiol., 1, 70-96.