Neural Networks and Pattern Recognition
This Page Intentionally Left Blank
Neural Networks and Pattern Recognition Edited by Omid Omidvar National Institute of Standards and Technology Judith Dayhoff University of Maryland
ACADEMIC PRESS San Diego London Boston New York Sydney Tokyo Toronto
This book is printed on acid-free paper. @ Copyright ©1998 by Academic Press All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. ACADEMIC PRESS 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA 1300 Boylston Street, Chestnut Hill, MA 02167, USA http://www.apnet.com ACADEMIC PRESS LIMITED 24-28 Oval Road, London NWl 7DX, UK http://www.hbuk.co.uk/ap/ Library of Congress Cataloging-in-Publication Data Omidvar, Omid. Neural networks and pattern recognition / Omid Omidvar, Judith Dayhoff. p. cm. Includes bibliographical references and index. ISBN 0-12-526420-8 (alk. paper) 1. Neural networks (Computer science) 2. Pattern recognition systems. I. Dayhoff, Judith E. II. Title. QA76.87.045 1998 006.3'2—dc21 97-25466 CIP Printed in the United States of America 97 98 99 00 01 IC 9 8 7 6 5 4 3 2 1
Contents Preface
ix
Contributors
xiii
1
Pulse-Coupled Neural Networks 1 Introduction 2 Basic Model 3 Multiple Pulses 4 Multiple Receptive Field Inputs 5 Time Evolution of Two Cells 6 Space to Time 7 Linking Waves and Time Scales 8 Groups 9 Invariances 10 Segmentation 11 Adaptation 12 Time to Space 13 Implementations 14 Integration into Systems 15 Concluding Remarks 16 References
1 2 3 10 13 13 18 21 22 25 34 44 48 50 51 53 54
2
A Neural Network Model for Optical Flow Computation 1 Introduction 2 Theoretical Background 3 Discussion on the Reformulation 4 Choosing Regularization Parameters 5 A Recurrent Neural Network Model 6 Experiments 7 Comparison to Other Work 8 Summary and Discussion 9 References
57 57 59 62 63 65 68 68 72 74
3
Temporal Pattern Matching Using an Artificial Neural Network 1 Introduction 2 Solving Optimization Problems Using the Hopfield Network
77 77 79
V
3 4 5 6 4
5
Dynamic Time Warping Using Hopfield Network Computer Simulation Results Conclusions References
81 88 95 103
Patterns of Dynamic Activity and Timing in Neural Network Processing 1 Introduction 2 Dynamic Networks 3 Chaotic Attractors and Attractor Locking 4 Developing Multiple Attractors 5 Attractor Basins and Dynamic Binary Networks 6 Time Delay Mechanisms and Attractor Training 7 Timing of Action Potentials in Impulse Trains 8 Discussion 9 Acknowledgments 10 References
105 105 108 114 120 124 129 131 134 136 136
A Macroscopic Model of Oscillation in Ensembles of Inhibitory and Excitatory Neurons 1 Introduction 2 A Macroscopic Model for Cell Assemblies 3 Interactions between Two Neural Groups 4 Stability of Equilibrium States 5 Oscillation Frequency Estimation 6 Experimental Validation 7 Conclusion 8 Appendix 9 References
143 143 146 151 156 159 161 162 166 166
6
Finite State Machines and Recurrent Neural N e t w o r k s — Automata and Dynamical Systems Approaches 171 1 Introduction 171 2 State Machines 173 3 Dynamical Systems 175 4 Recurrent Neural Networks 177 5 RNN as a State Machine 179 6 RNN as a Collection of Dynamical Systems 186 7 RNN with Two State Neurons 191 8 Experiments—Learning Loops of FSM 201 9 Discussion 211 10 References 215
7
Biased Random-Walk Learning: A Neurobiological
Vll
Correlate to Trial-and-Error 1 Introduction 2 Hebb's Rule 3 Theoretical Learning Rules 4 Biological Evidence 5 Conclusions 6 Acknowledgments 7 References
221 221 222 225 231 234 234 235
8
Using S O N N E T 1 to Segment Continuous Sequences of Items 245 1 Introduction 245 2 Learning Isolated and Embedded Spatial Patterns 250 3 Storing Items with Decreasing Activity 252 4 The LTM Invariance Principle 254 5 Using Rehearsal to Process Arbitrarily Long Lists 258 6 Implementing the LTM Invariance Principle 260 7 Resetting Items Once They Can Be Classified 264 8 Properties of a Classifying System 267 9 Simulations 274 10 Discussion 280 11 References 281
9
On the U s e of High-Level Petri N e t s in the Modeling of Biological Neural Networks 285 1 Introduction 285 2 Fundamentals of PNs 287 3 Modeling of Biological Neural Systems with High-Level PNs 292 4 New/Modified Elements Added to HPNs to Model BNNs . 296 5 Example of a BNN: The Olfactory Bulb 299 6 Conclusions 307 7 References 307
10 Locally Recurrent Networks: The G a m m a Operator, Properties, and Extensions 1 Introduction 2 Linear Finite Dimensional Memory Structures 3 The Gamma Neural Network 4 Applications of the Gamma Memory 5 Interpretations of the Gamma Memory 6 Laguerre and Gamma II Memories 7 Analog VLSI Implementations of the Gamma Filter . . . . 8 Conclusions 9 References
311 311 312 317 320 322 330 335 337 340
This Page Intentionally Left Blank
Preface This volume represents a turning point in neural network advancements. The first neural networks posed, such as the multilayer perceptron, were static networks that classified static patterns—fixed vectors—and resulted in a network output that was yet another static pattern, another fixedvalued vector. Neither pattern changed with time. Today the field of neural networks is advancing beyond these static neural networks, to more advanced concepts that incorporate time-dynamics in their inputs, outputs, and internal processing. Neural networks now can accept, as input, time-varying signals, even multichannel signals that correspond to a vector or image that changes over time, and often provide classification of data that varies over time. Some networks produce results that are time-dynamic, including oscillations and temporal patterns, and sometimes self-sustained activity can be a signature unique to the network's structure or to the patterns that stimulate the network. What are the elements and architectures that make it possible to advance from static architectures to dynamic computation? What approaches provide increased capabilities for neural networks? These questions are answered, in part, by this volume. Pulse-coupled neural networks incorporate processing elements, neurons, that communicate by sending pulses to one another. Pulse-coupled neural networks can represent spatial information in the time structure of their output pulse trains and can segment an image into multi-neuron timesynchronous groups. Johnson, Ranganath, Kuntimad, and Caulfield, in Chapter 1, illustrate these capabilities and show the architectural structure of the pulse-coupled networks. Motion perception is an essential capability for advanced organisms, yet the ability to detect motions and image flow computationally is a difficult problem. In Chapter 2, Li and Wang propose a recurrent neural network model that can be operated asynchronously in parallel to achieve a realtime solution. In Chapter 3, temporal pattern matching is performed when dynamic time warping is combined with a Hopfield network. Unal and Tepedelenlioglu show how a dynamic programming algorithm that compares an input test signal with a reference template signal, reducing the nonlinear time misalignments between the two patterns, can be implemented with a neural network approach to achieve an optimum match between two patterns. Dynamic attractors in neural networks with prolonged, self-sustained
X
Omidvar and DayhofF
activity are the subject of Chapter 4. Different attractors can be evoked by different network structures and different stimulus patterns, with a wide range of flexibility. Dynamic attractors can also be trained into a network. Authors Dayhoff, Palmadesso, Richards, and Lin demonstrate potential enhancements in computational paradigms for which dynamic networks show promise. A macroscopic model of oscillations in ensembles of neurons that characterizes very large networks of neurons is presented in Chapter 5. In this chapter, Ghosh, Chang, and Liano study the interaction between two neuron groups and show how to predict the presence of oscillations and their frequencies. The relationship between automata and recurrent neural networks is developed in Chapter 6 by Tino, Home, Giles, and Collingwood. Recurrent neural networks can be trained to mimick finite state machines, and mathematical relationships that demonstrate their ability to act as automata can be proven. The enormous potential, then, of appropriately trained recurrent networks becomes apparent. In Chapter 7, Anderson shows a putative neurobiological model that correlates with trial-and-error learning. He demonstrates the plausibility for synaptic weights to be trained during random fluctuations in their strengths and concomitant changes in the synapses. He argues for the biological plausibility of such a model. Segmentation of continuous sequences is addressed in Chapter 8, with the SONNET 1 network, which incorporates temporal decay on the input activation values. These networks learn to segment temporal patterns as the patterns are presented (e.g., as temporal signals) and learn to segment the patterns with no a priori knowledge of when a pattern begins or ends. In this case, the network performs a transformation of temporal events into spatial patterns of activity. Models of living neural systems are related to models developed for complex engineering operations in Chapter 9, where Venkatesh, Pandya, and Hsu show how to extend the concepts of Petri nets to encompass high-level structures found in biological neurons and in biological neural systems. The result is a new class of high-level Petri nets (HPNs). Chapter 10 attests to the high potential of locally recurrent networks for processing time-varying signals. In this chapter, Principe, Celebi, DeVries, and Harris review the gamma neural network structure and show variations such as the Laguerre and Gamma II memory networks. The functionality of these networks is identified, and their structure is described as a class of neural topologies that are intermediate between purely feedforward static networks and globally recurrent networks. The gamma operators are capable of adapting the time scale of the memory to best match the properties of the data. Altogether, this volume incorporates landmark results on how neural
Preface
xi
network models have evolved from simple feedforward systems with no temporally dynamic activity into advanced neural architectures with selfsustained activity patterns, simple and complicated oscillations, specialized time elements, and new capabilities for analysis and processing of timevarying signals. The enormous potential of these advanced architectures becomes apparent through the compendium of applications that appear here, including speech recognition, pattern classification, image analysis, and temporal pattern matching, and the modeling of neurobiological systems. Judith Dayhoff Omid Omidvar
This Page Intentionally Left Blank
Contributors • Russell W. Anderson Smith-Kettlewell Eye Research Institute 2232 Webster Street San Francisco, CA 94115 and Biomedical Engineering University of Northern California Petaluma, CA E-mail:
[email protected] • H. J. Caulfield Alabama A&M University Department of Physics Normal, AL 35762 • Samel Celebi Lucent Technologies-Bell Labs Innovations Middletown, NJ 07748 E-mail:
[email protected] • Hung-Jen Chang Department of Molecular and Cell Biology University of California at Berkeley Berkeley, CA 94720 E-mail:
[email protected] • Pete C. Collingwood School of Computing & Management Sciences Sheffield Hallam University Hallam Business Park 100 Napier St. Sheffield, S l l 8HD United Kingdom E-mail:
[email protected] • Judith E. Dayhoff Institute for System Research University of Maryland College Park, MD 20742 E-mail:
[email protected]
\r
Contributors • Joydeep Ghosh Department of Electrical and Computer Engineering Engineering Sciences Building The University of Texas at Austin Austin, TX 78712-1084 E-mail:
[email protected] • C. Lee Giles NEC Research Institute 4 Independence Way Princeton, NJ 08540 Institute for Advanced Computer Studies University of Maryland College Park, MD 20742 E-mail:
[email protected] .nec.com • John G. Harris Department of Electrical and Computer Engineering University of Florida Gainesville, FL 32611 E-mail:
[email protected] • Bill G. Home NEC Research Institute 4 Independence Way Princeton, NJ 08540 E-mail:
[email protected] .nec.com • Sam Hsu Department of Computer Science and Engineering Florida Atlantic University Boca Raton, FL 33431
[email protected] • J. L. Johnson U. S. Army Missile Command Weapons Sciences Directorate AMSMI-RD-WS-PL Redstone Arsenal, AL 35898-5248 • Govinda Kuntimad Boeing North America Rocketdyne Division Huntsville, AL 35806 USA E-mail:
[email protected]
Contributors • Hua Li Computer Engineering Department College of Engineering San Jose State University San Jose, California 95192 E-mail:
[email protected] • Kadir Liano Pavilion Technologies Austin, Texas • Daw-Tung Lin Computer Science Department Chung Hua Polytechnic University Hsin-Chu, 30 Tung-Shiang Taiwan E-mail:
[email protected] • Albert Nigrin 6942 Clearwind Ct. Baltimore, MD 21209 E-mail:
[email protected] • Peter J. Palmadesso Plasma Physics Division Naval Research Laboratory Washington, D.C. 20375 E-mail:
[email protected] • Abhijit Pandya Department of Computer Science and Engineering Florida Atlantic University Boca Raton, PL 33431 E-mail:
[email protected] • Jose C. Principe Department of Electrical and Computer Engineering University of Florida Gainesville, FL 32611 E-mail:
[email protected] • H. Ranganath University of Alabama in Huntsville Computer Sciences Department Huntsville, AL E-mail:
[email protected]
i
Contributors • Fred Richards Entropic Research Laboratory, Inc. 600 Pennsylvania Ave. S.E., Suite 202 Washington, D.C. 20003 • Nazif Tepedelenhoglu Department of Electrical and Computer Engineering Florida Institute of Technology 150 W. University Blvd. Melbourne, FL 32901 E-mail:
[email protected] • Peter Tino Dept. of Computer Science and Engineering Slovak Technical University Ilkovicova 3 812 19 Bratislava, Slovakia NEC Research Institute 4 Independence Way Princeton, NJ 08540 E-mail:
[email protected] • Fatih A. Unal National Semiconductor National Semiconductor Drive Mail Stop C1-495 Santa Clara, CA 95052 E-mail:
[email protected] • Kurapati Venkatesh Center for Manufacturing Systems Department of Mechanical Engineering New Jersey Institute of Technology Newark, NJ 07104 • Bert De Vries David Sarnoff Research Center CN5300 Princeton, NJ 08543-5300 E-mail:
[email protected] • Jun Wang Industrial Technology Department The University of North Dakota Grand Forks, ND 58202 E-mail:
[email protected]
Chapter 1 Pulse-Coupled Neural Networks J. L. Johnson H. Ranganath G. Kuntimad H. J. Caulfield ABSTRACT A pulse-coupled neural network using the Eckhorn linking field coupling [1] is shown to contain invariant spatial information in the phase structure of the output pulse trains. The time domain signals are directly related to the intensity histogram of an input spatial distribution and have complex phase factors that specify the spatial location of the histogram elements. Two time scales are identified. On the fast time scale the linking produces dynamic, quasi-periodic, fringe-like traveling waves [2] that can carry information beyond the physical limits of the receptive fields. These waves contain the morphological connectivity structure of image elements. The slow time scale is set by the pulse generator, and on that scale the image is segmented into multineuron time-synchronous groups. These groups act as giant neurons, firing together, and by the same linking field mechanism as for the linking waves can form quasi-periodic pulse structures whose relative phases encode the location of the groups with respect to one another. These time signals are a unique, object-specific, and roughly invariant time signature for their corresponding input spatial image or distribution [3]. The details of the model are discussed, giving the basic Eckhorn linking field, extensions, generation of time series in the limit of very weak linking, invariances from the symmetries of the receptive fields, time scales, waves, and signatures. Multirule logical systems are shown to exist on single neurons. Adaptation is discussed. The pulse-coupled nets axe compatible with standard nonpulsed adaptive nets rather than competitive with them in the sense that any learning law can be used. Their temporal nature results in adaptive associations in time as well as over space, and they are similar to the time-sequence learning models of Reiss and Taylor [4]. Hardware implementations, optical and electronic, aie reviewed. Segmentation, object identification, and location methods are discussed and current results given. The conjugate basic problem of transforming a time signal into a spatial distribution, comparable in importance to the transformation of a spatial distribution into a time signal, is discussed. It maps the invariant time sig-
Johnson, Ranganath, Kuntimad, and Caulfield nature into a phase versus frequency spatial distribution and is the spatial representation of the complex histogram. A method of generating this map is discussed. Image pattern recognition using this network is shown to have the power of syntactical pattern recognition and the simplicity of statistical pattern recognition.
1
Introduction
The linking field model of Eckhorn et al. [1] was proposed as a minimal model to explain the experimentally observed synchronous featuredependent activity of neural assemblies over large cortical distances in the cat cortex [5]. It is a cortical model. It emphasizes synchronizations of oscillatory spindles that occur in the limit of strong linking fields and distinguishes two major types: (1) forced, or stimulus-locked, synchronous activity and (2) induced synchronous activity . Forced activity is produced by abrupt temporal changes such a^s movement. Induced activity occurs when the pulse train structure of the outputs of groups of cells are similar [6]. The model is called "linking field" because it uses a secondary receptive field's input to modulate a primary receptive field's input by multiplication in order to obtain the necessary coupling that links the pulse activity into synchronicity. This paper is concerned with the behavior of the linking field model in the limit of weak-to-moderate linking strengths [2],[7]. Strong linking is characterized by synchronous bursts of pulses. When the linking strength is reduced, the neurons no longer fire in bursts but still have a high degree of phase and frequency locking. This is the regime of moderate linking strength. Further reduction continuously lowers the degree of linking to a situation where locking can occur only for small phase and frequency differences. This is the weak linking regime. A major result of this research is the finding that in the weak linking regime it is possible to encode spatial input distributions into corresponding temporal patterns with enough structure to have object-specific time series for each input pattern. The pulse phase patterns in the time series are often found to be periodic. In both simulations and in an optical hybrid laboratory demonstration system, periodicity is observed to be the rule rather than the exception. The time series can be made insensitive to translation, rotation, and scale changes of the input image disrtibution by an appropriate choice of the structure of the receptive field weight patterns. Substantial insensitivity against scene illumination and image distortion has also been observed in simulations.
1. P u l s e - C o u p l e d Neural N e t w o r k s
3
Linking
Inputs from other neurons
Threshold
1 + PjLj
Inputs from other neurons
^y^0—*. Step Function
DENDRITIC TREE
LINKING
Output to other neurons
PULSE GENERATOR
FIGURE 1. The model neuron. The model neuron has three parts: The dendritic tree, the linking, and the pulse generator. The dendritic tree is subdivided into two channels, linking and feeding. All synapses are leaky integrator connections. The inputs are pulses from other neurons and the output is a pulse. The linking input modulates the feeding input. When a pulse occurs in the linking input it briefly raises the total internal activity Uj and can cause the model neuron to fire at that time, thus synchronizing it with the neuron transmitting the linking pulse. (Reprinted with permission from [1]).
2
Basic Model
This section reviews t h e basic model as discussed in Eckhorn et al. [1], [5], [6], [8], [9], a n d [10]. T h e model neuron is a neuromime [11], modified w i t h two receptive fields per neuron a n d a linking mechanism added. It is shown in Figure 1. T h e r e are t h r e e p a r t s t o t h e model neuron: t h e dendritic tree, t h e linking modulation, and t h e pulse generator. Each p a r t will b e described separately, a n d t h e n t h e operation of t h e complete model will b e discussed.
2.1
The Dendritic Tree
T h e dendritic tree is divided into two principal branches in order t o m a k e two distinct inputs t o t h e linking p a r t of t h e j t h neuron. T h e y are t h e prim a r y i n p u t , t e r m e d t h e feeding i n p u t F j , a n d t h e secondary i n p u t , which is the linking input Lj. These are given in equations 1 a n d 2, respectively, for the case of continuous time. For discrete t i m e steps, t h e digital filter
4
Johnson, Ranganath, Kuntimad, and Caulfield
model is used, as given in the appendix of Eckhorn et al. [1]. (The simulations reported here used the discrete model. The equations are given in Section 9.) Each input is a weighted sum from the synaptic connections on its dendritic branch. The synapses themselves are modeled as leaky integrators. An electrical version of a leaky integrator is a capacitor and a resistor in parallel, charged by a brief voltage pulse and decaying exponentially. Likewise, when a synapse receives a pulse, it is charged, and its output amplitude rises steeply. The amount of rise depends on the amplitude gain factor assigned to the synapse. It then decays exponentially according to its time constant. These postsynaptic signals are summed to form the total signal out of that branch of the dendritic tree, as indicated in Figure 1. The amplitude gain factors and the decay time constants of the synapses characterize the signals. The synapses in the feeding branch are assumed [1] to have smaller time constants than those of the linking branch. This assumption lets the feeding signal have a long decay tail on which the spikelike linking input can operate through the linking modulation process. The linking and feeding inputs are given by
Li
=
$]i*,=^(W^*,e-<')*nW, k
Fj
=
'£F,j k
(1)
k
= '£{Mkje-<')*Yk{t)
+ Ij,
(2)
k
where Wkj and Mkj are the synaptic gain strengths, or weights, for the fcth synapse of the linking and feeding receptive fields, respectively, to the j t h neuron. Yk{t) is the input pulse, or pulse train, from the fcth neuron; a^j and a^j are the time constants; and / l * / 2 denotes the convolution integral operation for any two functions / I and / 2 . Note that both the feeding and linking fields can recieve inputs from the A:th neuron. Ij is an analog feeding input to the jth neuron. It is shown here as a distinct single term but in general can be a weighted sum like the pulsed inputs. If the inputs Yk{t) are allowed to be arbitrary functions of time, then Ij can be included in the weighted sum over the F's as a step function in time Step{t — to). Each neuron thus has two receptive fields, linking and feeding. Both fields are dendritic tree structures and can overlay the same areas around the neuron. However, their weighted sums enter the neuron via distinct channels and are combined internally by the linking, as discussed below.
2.2
The Linking
The linking modulation (see Figure 1) is obtained by adding a constant positive bias to the linking input and multiplying that by the feeding input. The bias is taken to be unity. This bias has many uses as we will see, and
1. Pulse-Coupled Neural Networks
5
one of them is obvious. The hnking input cannot drive the internal activity to zero. The total internal activity Uj of the neuron is Uj=FJil
+ 0jLj)
(3)
where f3j is the linking strength. For convenience, it is broken out separately here, but strictly speaking, it could be incorporated in the synaptic weights. Uj is a function of time. Under the above assumption that the feeding input has a smaller time constant than that of the linking input, the general behavior of Uj is that the linking inputs appear as spike-like modulations riding on a quasi-constant carrier formed by the feeding input. The internal activity Uj thus is briefly raised above the feeding input level whenever a linking input occurs (Figure 2), and it can then trigger the neuron to fire. This effect is responsible for the synchronous activity found in the network as a whole. Equation 3 also establishes a correspondence between the linking field model and higher-order networks. If equations 1 and 2 are inserted into equation 3, there will be product terms of the form MkjWijYkYi within a double sum. This is a second-order network [12]. This implies that if a pulse output model rather than an average firing rate output model is used in higher-order nets, time-synchronous behavior should be observed. The work on adaptive higher-order nets [13] may be applicable to adaptation in pulse-coupled nets as well.
2.3
The Pulse Generator
The pulse generator uses a threshold discriminator followed by a pulse former, and a variable threshold that is dependent upon the prior pulse output of the generator itself. When the neuron emits a pulse, a portion of it feeds back to the threshold, which is yet another leaky integrator, as shown in Figure 1. One or more output pulses recharge the threshold to a high level. This quickly raises it above the current value of the internal activity Uj, which in turn causes the threshold discriminator to turn off the pulse former, and the neuron stops emitting pulses. The recharged threshold then decays exponentially according to its time constant and amplitude gain factor until it drops below the internal activity again, triggering a new output pulse or pulse burst from the neuron (Figure 2). This is the pulse generator model illustrated in Eckhorn et al [1], [8], [9], and [10], and given analytically in the appendix of [3]. One important result of the model is that under constant stimulation, the pulse former produces a train of uniformly spaced pulses. The spacing represents the refractory period r^ of the neuron within which time a new pulse cannot occur. This will give an upper saturation limit to the maximum output pulse frequency. The pulse generator is modeled by a leaky integrator threshold 0j (equation (4)), a threshold discriminator in the form of a sigmoidal envelope (equation (5)),
6
Johnson, Ranganath, Kuntimad, and Caulfield
Pulse Period Tj
r*
^ Threshold 9,
t
Pulse Capture
Output Pulse Yj Output I
FIGURE 2. Pulse generation and linking. The threshold is recharged when it decays below the internal activity Uj = Fj{l-\-PjLj). The output pulse is formed as the threshold turns the step function of equation (5) on and then off as the threshold goes below Uj, starts recharging, and then rises above Uj. If a linking pulse occurs in the capture zone time, it causes the threshold to recharge sooner than otherwise, and the neuron fires a pulse synchronized with the arrival of the linking pulse. (Reprinted with permission from [3]). and a pulse former (equations 6 and 8): Oj = Yj{t)
=
^ ( 0
—
{VTe-^')*Yj{t)-^0o, {Sig{Uj{t)-0j{t))P{t))*e-r^), /]pulse{t
—
UTr),
(4) (5) (6)
where Sig{z) is a hyperbolic tangent sigmoidal envelope for the pulse train P{t) out of the pulse former. The sigmoid function and the pulse function pulse{t — uTr) are Sig{z)
=
— UTr)
=
1 1 + e-^^ K
pulse{t
{S{t' - riTr) - S{t' - UTr - T^))dt'.
(7) (8)
J —o
Equation (8) defines a square pulse of height K whose leading and trailing edges are formed by two delta functions separated by width r^ . It has a
1. Pulse-Coupled Neural Networks
7
constant area of K.VT and aj are the amplitude gain and the time constant of the leaky integrator threshold, and OQ is a threshold offset. A is the scale of the sigmoid argument, and aJ is the time constant for the convolution of equation (5). The number n refers to the pulse number. In order to have a good dynamic range of pulse periods it is desirable to require ajrr < 1.
(9)
The system of equations (4)-(8) exphcitly shows the causality in the pulse generator and that the pulses are finite. Now idealize it. First let r^j go to zero. This makes P{t) into a train of delta function pulses. Perform the convolution of equation (5) and take the limit of both aJ and K going to infinity, in such a way that their ratio is constant, to obtain yet another delta function limit, and finally, take A approaching infinity to get a single equation for Yj that replaces equations (5)-(8): Yj{t) = Yl^{t-
nrr)Step{Uj{t)
- Ojit)).
(10)
n
Step{ ) is defined as 1 when its argument is positive, and 0 otherwise. Equations (10) and (4) are the ideahzed pulse generator. Its input has a lower limit of ^o- An upper limit can be established by asking for the largest value of the input that will just barely recharge the threshold back to that level in a decay time Tr, the minimum pulse period. Equation (4) gives f / e - ^ ^ ^ + y / > f/, from which U<—p^^^Uma.^^ 1 —e
J
(11)
^j'T
under the dynamic range requirement of equation (9). Figure 3 summarizes the properties of the pulse generator. There, the digital filter form (equation (29)) of the time convolutions was used.
2.4
Pulse Periods
The firing rate of a single neuron is a sigmoidal function of the feeding input. This is shown by obtaining the pulse period TJ of a neuron. It is the time required for the threshold to decay from its recharged initial height down to the internal activity level (Figure 2). Consider equation (4) when the threshold is recharged with a single pulse by an amount VT- For a constant feeding input F and a zero linking input, the decay time back down to F is
Johnson, Ranganath, Kuntimad, and Caulfield threshold
1w
J Vj 1 ky
e cJecay loop -1 ^^ +1 ^^ ky
r\
A ^^ ^^
sigmoid enve lope (Jecay loop
imi
'1 1
pulse former
i
I
[
The pulse generator. The low-pass filter decay loops correspond to the time constants in the convolution integrals.
FIGURE 3. The pulse generator. The internal activity U feeds a sigmoidal envelope. When U > 6 the envelope becomes high, allowing the pulse former to make an output of uniformly spaced pulses. These are the cell's output. The envelope and pulse former are in a decay loop with a large time constant. This loop ensures causality, i.e., it gives a small time delay between the pulse output and the recharging of the threshold (upper feedback loop). The threshold is another leaky integrator, recharged by the pulse output. An idealization (see text) reduces the sigmoidal envelope to a step function and makes the pulse former's output into a train of delta function pulses. (Reprinted with permission from [26]). The refractory period is added to the decay time to obtain the total pulse period. The pulse firing rate fj is then fj =
{rj+TrY
(13)
As shown in Figure 4, it is a sigmoid function [14]: it increases more slowly than linear up to ^o, then rises quickly (this is the center of the "S" shape), and finally goes to saturation. Its monotonically increasing behavior shows that the original input feeding distribution can be approximately recovered at any time by taking an average over many pulse periods, because the pulse frequency is faster for stronger (more intense) feeding inputs. The sigmoidal nonlinearity will cut off values below 6o and act as a squashing function near saturation, so the overall function is a sigmoidal mapping of the internal activity to the output when pulse-averaging is done. When linking pulses are present, their strength defines a capture zone in the neuron receiving the linking pulse. From Figure 2, the capture zone
1. P u l s e - C o u p l e d Neural N e t w o r k s
>-
o
INTERNAL ACTIVITY
Uj
FIGURE 4. Pulse frequency fj as a function of the internal activity Uj. The pulse frequency is a sigmoidal function of the internal activity. Addition of a refractory time period Tr makes the frequency saturate at the refractory frequency. A bias offset ^0 will shift the curve's origin to that bias point. (Reprinted with permission from [3]). time interval is •PL), (14) Tc = - ^ l n ( l -h F-Oi a^ where /? is t h e linking strength. If a linking pulse is received in this interval, it will briefly raise t h e internal activity level and cause t h e receiving neuron to fire at t h e arrival time of t h e linking pulse (Figure 2). T h e receiving neuron will frequency lock to t h e t r a n s m i t t i n g neuron if their pulse r a t e s Ti and T2 are similar. If t h e neurons have t h e same frequency (ri = T2), they will phase lock when their phase difference <j) is within t h e c a p t u r e zone time period: Frequency lock : P h a s e lock :
1^2 - T i l < Tc, \(t>\ < OL^Tc.
(15)
T h e r e is also a forbidden zone immediately after each linking pulse. For a^ much greater t h a n a ^ , t h e length of t h e forbidden zone is equal t o t h a t of t h e c a p t u r e zone (see Figure 2). This completes t h e description of t h e basic model neuron. T h e threshold time constants used by Eckhorn are intermediate in value between t h e linking a n d feeding time constants. T h e pulse-coupled linking field model
10
Johnson, Ranganath, Kuntimad, and Caulfield
contains synaptic weights but does not require any learning law. On the other hand, any learning law can be used. The frequency function of equation (12) gives the desired nonlinear response in the limit of averaging over many pulses, so this model reduces to the usual nonpulsed networks in that limit. It has the weighted interconnects, the internal sums, and the sigmoid nonlinearity. The simple pulse generator used in simulations by Eckhorn and others [1] corresponds to a two-cell oscillator [15], [16] where the threshold acts as an inhibitor cell with a slow response and the step function as an excitatory cell with a fast response. The three parts of the model (the dendritic tree, the linking, and the pulse generator) act together to weight and sum inputs in the receptive fields, modulate one input channel with a second input channel, and form the output pulses, which in turn are received by other neurons through their receptive fields. In the remainder of this paper the same threshold time constant a^ will be used for all neurons, the same linking time constant a^ used for all linking fields, the same feeding time constant a^ used for all feeding fields, and the same linking strength /? used for all neurons unless otherwise stated. The subscript j will be suppressed except where necessary.
3
Multiple Pulses
Suppose that at time zero a cell receives linking pulses from N other cells, all arriving at the same time, and that a single firing is inadequate to raise its threshold above the composite linking pulse. It will continue to fire until it exceeds the linking pulse height, as shown in Figure 5(a). Let M be the number of pulses required. For simplicity take ^ = F at ^ = 0. Then from equations (1), (3), and (4), r{M-l)Tr ^0
^-1 m=0
>F(l4-iV/3e-^^^''(^-^>), where M — 1 has been used because the time interval for the cell to fire M times is (M — l)r^. The left-hand side yields a finite sum of exponential decays. Expressing this in closed form leads to the result that 1 -_
^^(
^-aTTriM-l)
l-e-"Tr.
) ^ i^^iVe-«^^^(^-i).
(16)
This gives M in terms of A^. If M is small enough so that all the exponentials can be expanded (see condition of equation (9)), then M is approximately given by
1. Pulse-Coupled Neural Networks
11
M^
^-^ . VT + aLTr0FN But it is not that simple. Suppose that the N pulses came from the same group containing the cell and they all had the same feeding input F. Then every cell in the group must send M pulses to the others. The situation, shown in Figure 5(b), is that each cell receives A^ pulses at a time, N being the number of cells in the group, for M times, with a separation of Tr between times. The cells must pulse their way over a much larger linking pulse than in the previous case. Let M' be the number of pulses required. The linking pulse is now, at t = {M' — l)rr, ^'-}
1 _ pOCLTrM'
m=0
Applying the condition that the threshold must be greater than this gives, after some rearrangement. l_e-«^^-^' 1 _ ^-OCLTrM'
>-
PFN
A-e
y^ • ( T 3 7 ^ ^ ) -
(17)
Unfortunately, since as shown in Figure 5(b) this condition depends on the gradual saturation of the envelope of the linking pulses, a first-order expansion may be inappropriate for the left-hand side. An asymptotic approximation comes from noting that the left-hand side is of order unity if M' ^ \. This gives a rough upper limit of
1>
K^)N.
This is similar to equation (11) when equation (3) is used in it to explicitly include F and (i. The limit of equation (17) is above that of equation (11), which was the pulse saturation limit. This shows that the model can handle all multiple pulses under the pulse saturation limit. A somewhat better approximation is to assume that arTrM' is small. This allows the expansion to first order of the numerator on the left hand side of equation (17):
which is of the form X > a ( l - e~^), where x = aiTrM' and a = /3FN/VT. Finally, the value of A^ can be related to the receptive field kernel (equation (2)) as N^NRF=
f I
WL(f-?)¥{?,t)(fr',
12
J o h n s o n , Ranganath, K u n t i m a d , and Caulfield
(a) A cell receives a composite linking pulse from an external group and fires M times for the threshold to exceed the internal activity U.
(b) A cell receives a composite linking pulse from its own group. It fires M' times, as do all the other cells in the group, causing more linking pulses. The linking pulse envelope saturates, allowing the threshold to finally exceed the internal activity U. FIGURE 5. Multiple pulses. Two cases are shown. In 5(a), a cell receives A^ linking pulses simultaneously, as would occur when the cell is not part of the group of N cells making the pulses. It must fire M times to overcome the composite linking pulse. In 5(b), the cell is a member of a group of iV + 1 cells. Since every member must fire multiple pulses, each fires M ' times, and each firing generates an additional linking pulse of size iV, which the cell must attempt to overcome by firing again. It succeeds eventually because the linking pulse train envelope saturates more quickly that the threshold pulse train, allowing the threshold to catch up after M' pulses. (Reprinted with permission from [26]).
1. Pulse-Coupled Neural Networks
13
which, with equation (11) or (17), shows that the integral of the receptive field kernel W needs to be finite if the slab is not bounded.
4
Multiple Receptive Field Inputs
The pulse-coupled neural network is a dendritic model. The inputs from the receptive fields enter along the length of the dendrite, and the linking modulation occurs at the point of entry, the dendritic signal flows toward the cell body. There can be many inputs. The internal activity U is in general of the form f/ = F ( l + /3iLi)(l + (32L2){1 + 03L3)...-{1 + PnLn).
(18)
This is for one dendrite. A cell can have many dendrites. They are summed to form the total input to the cell, and can be excitatory or inhibitory. If the products are carried out, the internal activity has all possible products of all the receptive fields. These are products of weighted sums of inputs, as shown in equations (1) and (2). It is seen, then, that these are general higherorder networks. Eckhorn argues that the inputs far out on the dendrite have small synaptic time constants, while those close to the cell body have large synaptic time constants, so there is a transition from "feeding" to "linking" inputs along the length of the dendrite. The receptive fields can overlap, they can be offset, and each one can have its own kernel defining its weight distribution. Now, a given weight distribution W can give the same weighted response for more than one input distribution. This corresponds to a logical "OR" gate in that sense. The linking modulation uses an algebraic product, which corresponds to a logical "AND" gate. The inhibitory inputs give logical complementation. In this view (Figure 6), each neuron is a large multirule logical system. This property was used to achieve exact scale, translation, and rotation invariance as shown by the simulations discussed later.
5
Time Evolution of Two Cells
This section shows how to follow the time evolution of the pulse outputs of a two-cell system. As each cell fires, it can capture the other cell and shift its phase. By constructing an iterative map of the phase shifts from one pulse to the next, the time of firing can be predicted. The map plots the current phase versus the next phase. The simplest form of the map, discussed here, is constrained to one-way linking. There are two cells. The first one has a feeding input Fi and the second has F2. The first cell sends a linking input to the second, but not vice versa. It is assumed that the
14
Johnson, Ranganath, Kuntimad, and Caulfield RECEPTIVE FIELDS
Yk
DENDRITE
CELL BODY
Yk —
I
n-1
.^
I
n
^ ^
I ,
o n DENDRITE
n+1 n ^ n U: =Ui(l+PiL:)
EACH DENDRITE IS A LOGICAL RULE RF weighted sums
-^^^^^ "OR"
Linking product
-^^^^
"AND"
FIGURE 6. The linking field model neuron is a multirule logical system. A dendrite receives inputs from many receptive fields along its length. Each input modulates the dendritic signal by the factor (1 -\- PjUj) for the nth input. The receptive fields can give the same signal for more than one input distribution and thus correspond to a logical "OR". The product term in the modulation factors corresponds to a logical "AND". These logic gate correspondences are not exact, but they can be used effectively, as shown by the example discussed in the text. Reprinted with permission from [3]). linking pulses are Kronecker delta functions (0 or 1), with no decay tail. The threshold is assumed to recharge instantly by an amount VT from the point where it intersected the internal activity. In this case the forbidden zone is equal to the capture zone. To form the map, first construct the threshold diagram of Figure 7. Pulses can intersect the internal activity outside the forbidden zone, including on the leading vertical edge of the linking pulse. This then defines an upper trace, where the recharged threshold can begin its decay back down to the internal activity. The upper trace is simply the lower one, raised up by a distance VT- It is effectively a launch platform from which the recharged threshold begins its downward decay. When the threshold again intersects the lower trace, it recharges and comes to a new location on one of the upper traces at a later time. This generates a mapping from one upper trace to another, and it can be used to make the iterative map with which to follow the time behavior of the system. Let the total length along the trace be X. Note that this consists of a horizontal (H) section followed by a short vertical (V) section corresponding to the
1. Pulse-Coupled Neural Networks
15
leading edge of the linking pulse (Figure 7). Let the remapped length be Y, If the threshold launches from the horizontal part of X , it can hit either a horizontal or vertical part of F , and the same is true for launch from the vertical part of X. The mapping accordingly will be linear (horizontal to horizontal, vertical to vertical), exponential (horizontal to vertical), or logarithmic (vertical to horizontal). There are five distinct cases, depending on where the mapping starts and ends. They are Case I Case II Case III Case IV CaseV
HV - HH - VH - VV HV - HH - VH HH - HV - HH - VH HH - HV - VV - VH HH - HV ~ VV
The iterative map for Case I is shown in Figure 8. It is piece wise continuous and has an upper section and a lower section. All the curve sections can be written parametrically in terms of the inputs Fi, F2, the time constants Q T , QL, the linking strength /3, the Unking period TL and pulse period TT, the capture zone time period TC (which is also the forbidden zone in this case), and the number N of linking periods spanned by the threshold pulse period. The map of Figure 8 can be followed, step by step, by reference to the traces shown in Figure 7. Suppose a pulse begins on the upper trace's horizontal region and maps to the next lower trace's vertical region, following the b - b decay curve of Figure 7, for example. This would be an HV transition in Figure 8. It is reset by Vr to the corresponding upper trace. From there, it decays and hits the horizontal section of the next lower trace, as indicated by the e-e decay curve of Figure 7. This is a VH transition. It is again reset to the upper trace by Vr, decays to a horizontal section through an HH transition (the a-a decay curve of Figure 7), resets to the upper trace, again decays to another horizontal section (HH), resets, and this time maps from a horizontal section to a vertical section (HV) as shown in Figure 8. This follows the two-cell system through one cycle around the phase map of Figure 8. Note that although it has again reached an HV transition, it occurs at a different point than the first HV transition. If the system approaches a limit cycle in Figure 8, this means that the corresponding cell has a periodic pulse train output.
5,1
The Linking Decay Tail Is an Unstable Region
A geometrical argument can be used to show that the linking decay tail is an unstable region. Suppose there are two mutually linked cells, both fed by the same input F. Then they pulse at the same basic frequency. Now suppose that they are out of phase such that they link on each other's linking decay
16
J o h n s o n , R a n g a n a t h , K u n t i m a d , and Caulfield
!L
upper trace
F+Wj
lower trace
time
cc' d
FIGURE 7. Two cells with one-way linking. The top figure shows the threshold diagram for the cell receiving an idealized linking pulse from the other cell. The second cell does not receive linking from the first cell (two-way linking is shown in Figure 9). The threshold recharges from the lower trace by Vr, defining an upper trace as well. When the threshold decays from the upper trace to the lower and then is recharged back to the upper trace, it defines a mapping between upper traces that can be used to track the time evolution of the pulse history of the system. (Reprinted with permission from [26]). tail, as shown in Figure 9(a): Each cell's threshold intersects t h e internal activity level of t h e other cell beyond t h e c a p t u r e zone. Consider first cell # 1 . It links on t h e decay tail of t h e linking input from cell # 2 a t point A i , recharges to t h e upper trace, decays, and links again at point Bi. T h e diagram shows a composite trace combining t h e upper and lower traces for cell # 1 , with points Ai and Bi b o t h on it. A similar composite trace is true for cell # 2 . Now consider b o t h cells, as shown in Figure 9(b). T h e difference A2 — A i is the change in time separation between t h e firing of t h e two cells. Due to t h e difference in t h e height of t h e linking trace at points Ai and A2, A2 — A i will in general not be zero. (There is a single point on t h e decay tail where this difference is zero, b u t it is an unstable point.)
1. Pulse-Coupled Neural Networks
17
FIGURE 8. Iterative map. The horizontal axis is the total distance along the upper trace of Figure 7, from which the threshold can begin its decay, and the vertical axis is the distance along the upper trace where the pulse returns after it has recharged. There are five distinct cases, and each case is defined by the particular values of the two-cell system and its two feeding input strengths. For each case there are four possible transitions, HH, HV,VH, VV, corresponding to the initial and final locations on the traces of Figure 7. H indicates horizontal, V indicates vertical. These transitions are discussed in the text. (Reprinted with permission from [26]). It is clear from the diagram that the firing time Bi of cell # 1 will move closer to the leading edge of the linking pulse from cell # 2 by an amount A2 — Ai. The same is true for ^2- The cells constantly try to catch up with each other by firing more frequently, but each one's gain helps the other one gain more, and the overall eflFect is that they repel each other out of the decay tail region. After several cycles, one of the cells' thresholds will decay into the leading edge of the linking pulse from the other cell and thus will fire at essentially the same time as that cell. Since both have the same
18
Johnson, Ranganath, Kuntimad, and Caulfield
feeding input, they will be phase locked together from this time on. This shows how two cells with the same feeding input will always become phase locked together, regardless of their initial phase difference, due to the finite decay tails of the linking pulses.
6
Space to Time
Consider a group of weakly linked neurons. Suppose at time zero all the neurons fire together. As time goes by they will occasionally link in different combinations, as illustrated in Figure 10. Each neuron has its own basic firing rate due to its particular feeding input. Suppose further that at time T the neurons' combined firing rates and linking interactions cause them all to fire together a second time. This duplicates the initial state at time zero. Then everything that happened during time T will happen again in the same order, and all the neurons will fire together again at time 2T. This will continue, resulting in periodic behavior of the group with period T. The assumption of a single exact repetition of a given state (all the neurons fire together, for example) leads to the conclusion that everything that happened between the repetitions must necessarily happen again in the same order, in a permanently periodic way, for every neuron in that group. If all the outputs of the group are linearly summed, the result will be a single periodic time series that is the signature of that spatial input distribution. This is the time series S{t) for that group of neurons [7]. The length of time required for periodicity is primarily governed by the ratio '^c/Ttyp where rtyp is the characteristic pulse period of the input image. (For large P the ratio can be much greater than one, in which case the group links on every pulse and is completely periodic.) Two other factors that promote periodicity in a two-neuron system are linking in quasiharmonic ratios and linking on the decay tail of the linking pulses. For quasiharmonic pulse rates such that \mT2 — nril < TC^
m,n
are integers,
(19)
the two neurons will periodically link about every mr2 seconds. When two mutually linked neurons link on the decay tails of the linking pulses (Figure 9), the cycle-to-cycle behavior is that they actively expel each other from this region into the leading edge linking region. While both effects promote periodicity they do not guarantee it. The time required to achieve periodicity, and the overall period length, can be large for large, weakly-linked slabs. The following interpretation of the time series relates it to the input image's intensity histogram. The network maps intensity to frequency. The size of an isointensity image patch determines how many neurons fire at that
1. P u l s e - C o u p l e d Neural N e t w o r k s
19
(a) Threshold diagram for cell #1, showing origin of composite trace diagram.
Cell#l
Cell #2
(b) Interaction of cell #1 and cell #2. B2 actually occurs in time on the next cycle, at the point (B2).
FIGURE 9. Two cells each linking on the other's linking pulse decay tail. Upper and lower traces are defined for each cell, and a composite trace is constructed that shows for each cell its map points A and B from one recharging point to the next (a). Both cells have the same feeding input strength F. Figure 9(b) uses the composite traces for both cells to show their interaction. Each cell's second recharging point B shifts the linking pulse time for the pulse that it sends to the other, with the result that both cells' firing points steadily move closer to the leading edge of the linking pulses until one or the other locks in the capture zone. The cells are then phase locked. When finite linking decay exists, as assumed here, this interaction shows that two cells with the same feeding input strength will always become phase locked. (Reprinted with permission from [26]).
20
Johnson, Ranganath, Kuntimad, and Caulfield
G> GVL GvL
I U I J_JJ II 1^ I* i.ii n
11^
I A
mt
I II I
SUM 1+2+3+4 t=0
t=T
FIGURE 10. Formation of a periodic time series. Neurons 1-4 all fire together at t = 0. As time passes, they occasionally link in various combinations. If at time T they again link as so to fire together, the situation will be the same as at t = 0. The system will repeat its behavior, generating a time series. The linear sum of the group's outputs is the periodic time signature of the input distribution to neurons 1-4. (Reprinted with permission from [3]). corresponding frequency, so patch size maps to amplitude. The image's intensity histogram counts the number of pixels with a given intensity, while the amplitude of a given frequency counts the number of neurons firing at that rate. The frequency spectrum of the time signal is the intensity histogram of the input image as mapped through the sigmoidal response. Although this argument holds exactly only for a system with zero linking, a linked system will generate an intensity-quantized histogram whose envelope generally follows that of the analog input image. This is true for discrete pulse models and for continuous oscillator models, and for any other model where the output frequency is proportional to the input signal strength. For a linked slab, the coherent periodicity of the time signal suggests that there must exist phase factors as well as frequency and amplitude. Suppose that the time signal S{t) is expressed as a sum of delta function pulses: K
^w = EE«*^(^-^^-^^)' n
(20)
k=l
where T is the periodicity, ak is the amplitude of the A;th subgroup, and (pk is the time offset of the subgroup of cells with amplitude ak. The time offset
1. Pulse-Coupled Neural Networks
21
is between zero and T, and there are K subgroups that are Unked into the overall repetition period T. If a fourier transform is taken, it factors into a sum of complex phases and a sum representing the repetition period: K
F.T.(S) =[J2^ke'''^'] [^e'''^'^]. k=l
(21)
n
The corresponding "histogram" must in some form include the phases as well as the amplitudes. Other transforms may be more appropriate; the Fourier transform was used here for illustrative purposes. This discussion shows that the geometrical content of an image, as well as its intensity, is encoded in the time signal, and that distance-dependent linking action provides a way to include syntactical information. The time signals are object-specific. They are a signature, or identification code, that represents a two-dimensional image as a time-domain signal that can be processed by the neural net. The signatures have some degeneracy, but this can be an advantage rather than a drawback, since certain classes of degeneracy can also be viewed as invariance.
7
Linking Waves and T i m e Scales
The linking pulses are transmitted very quickly as compared to the firing rates of the cells. If the receiving cells are within their capture zone, they will be induced to fire upon receipt of the linking inputs, and their output pulses can in turn capture other cells. This causes a wave of pulses to sweep across a region of the slab. The propagation of the wave will follow the underlying feeding input distribution, generally flowing down gradients and firing larger or smaller areas of cells according to how many are within their capture zones. The time profile of the firing history will reflect the shape of the underlying feeding spatial distribution, and, for the case of the feeding input being an image intensity pattern, be related to the geometry of the image, as shown in Figure 11. The repetition rate of a linking wave, e.g., how often it sweeps through an area, is determined by the intensity in that area. On a time scale that shows the linking wave profiles, the profiles can be taken as elementary signatures identifying their areas. On a time scale that compresses the linking wave profiles into a single time bin, the repetition period of each area can be used to segment that subregion of the total image. These segmented areas are in eff^ect "giant neurons," i.e., synchronous groups. The linking still exists, and these groups transmit and receive composite linking pulses. They have their own group capture zones and behave like single neurons in many ways, with the exception that their output pulse is no longer a binary 1 or 0 but instead has an amplitude that is equal to the number of individual cells comprising the synchronous group.
22
J o h n s o n , Ranganath, K u n t i m a d , and Caulfield
Accordingly, group linking waves can exist. This is discussed in the next section. The time profile on this scale is the signature of the group of linked groups, and on yet another still-larger time scale the repetition period of the group of groups can be used to segment it into a supergroup. At this point the interpretation from an image processing standpoint is that the syntactic information of a large composite image has been encoded into an object-specific signature for that image. In principle, further time scales can be incorporated indefinitely in a self-similar manner, leading to groups of supergroups, supergroups of supergroups, and so on, each having its own time signature and segmentation time scale. This is indicated by Figure 12. It reduces the fundamental problem of image understanding to one of time correlation of time signatures, which may be a solvable problem. It has implications for how the brain works to send and receive signals. The Eckhorn linking field and in general all higher-order networks when used with pulsed neuronal models provide a specific mechanism to generate the essential time signals that carry syntatic information about arbitrary spatial distributions.
8
Groups
On a time scale that segments groups of cells, multiple pulses occur even for very weak linking strengths. Consider an idealized situation (Figure 13) where there are two groups having A and B numbers of cells in groups A and B, respectively. Assume for simplicity that each group sends a linking pulse of amplitude A' or B' to the other. Look at a cell in group A. Let M^ be the number of multiple pulses of group A. Then equation (17) gives an estimate M^ = /3FAA'/Vr for large numbers of multiple pulses. The repetition period of group A is longer than that for an individual cell because its threshold must rise via multiple pulses within the group to overcome A'. Approximately, it can be obtained from equation (12) by substituting M'J^VT for Vr- Now look at the linking inputs, and write the total internal activity: UA = FA{1 + P{A'YA
+ B'YB)).
(22)
The F's give the moments in time when the groups' pulses occur, each at its own characteristic period. The groups A and B rescale all their characteristic times in proportion to the group sizes. The capture zone for group A with respect to group B, for example, is now
OLT
1. P u l s e - C o u p l e d Neural N e t w o r k s
23
AT
Signature on time scale Ax.
illlu
JM
Ill
^^
i Segmentation on time scale X.
FIGURE 11. Linking waves. An elementary region generates a linking wave that sweeps through it. The time history of the wave amplitude as summed over the region depends on the geometry of the area and is its signature. The repetition rate of the wave defines a time scale on which the elementary area can be segmented. (Reprinted with permission from [26]).
and the decay time of group A is
TA = — l n ( l + ^ ^ ) OCT
i* A
= — ln(l + OLT
/3A').
(23)
24
J o h n s o n , Ranganath, K u n t i m a d , and Caulfield
Elementary image patch
Jk
Image feature
Composite object
LL 7
m
/ /
mm y^^^Y/^ /^y\
FIGURE 12. Time scales. Linking waves for elementary areas make signatures for them. On a time scale where these areas are segmented, the signatures are compressed into a single time bin and become a composite pulse. The composite pulses link as groups (see Figure 13) to make linking waves on a group of elementary groups. The time history of the amplitude of these waves is the signature for the group of groups. Increasing the time scale so that these signatures are in turn compressed into a single time bin leads to supergroups, which in turn link together and form linking waves on that time scale. The process continues, leading to signatures for entire images as suggested by the figure. (Reprinted with permission from [26]).
1. Pulse-Coupled Neural Networks
25
The period of group A is the sum of the time required for the pulse burst and the decay time. This is a major change from the operation in the single pulse regime. There, the period depended on the individual cells' feeding inputs, while here it depends on the linking input from its own group. Since that linking input will be proportional to the area of the group and not its intensity, the behavior of a system of groups in the multiple pulse regime is driven by the sizes of the areas rather than only by their intensities. The intensity, however, will partially control the number of pulses in the bursts from each group (see equation (17)) and thus will enter into the period via M'. The size of the capture zone is still a function of the linking input, so the ratio of it to the group's period will determine the degree of linking among groups. This ratio can still be small, which defines the linking to be in a "weak linking" regime again. Even though the system emits multiple pulses and synchronous bursts, it is still in a "weak linking" mode on this larger time scale of group interactions. The system for groups is scaled in proportion to the number of cells in each group (with allowance for multiple pulses), giving a larger time scale on which linking among groups occurs, but in the same way as linking occurs for individual cells. This is illustrated in Figure 13.
9
Invariances
If there are symmetries in a receptive field weight pattern such that any two parts of it are the same, then an exchange of the corresponding parts of an input image distribution will not change the sum of the product of the field and the image. The exchanged parts of the input image will still be multiplied by the same weight values, because the weight pattern wa^ the same in those two regions. The exchange symmetry of the weight pattern makes the output of that field invariant against the corresponding exchange operation acting on the input image. This is because the neuron's output is determined by the internal activity Uj, which is a function of the feeding and linking inputs. They, in turn, are weighted sums. In general, if the image changes in a way that fits the symmetries of the feeding and linking receptive fields so that the internal activity doesn't change, then the neuronal output will be invariant against those changes. The utility of this is that the symmetries of the receptive fields then correspond to invariances of the time signal generated by the input image [7] because the time signal is driven by the internal activity. This is a very general principle. It can be used to make desirable time signal invariances by an appropriate choice of receptive field symmetries. The pulse-coupled network produces time series that encode in their phase structure the two-dimensional spatial input distribution, including its geometrical connectivity relationships. Symmetries
26
Johnson, Ranganath, Kuntimad, and Caulfield F^pA' triggered by F^pB'
,
-->^
time \ FIGURE 13. Group linking. Two groups A and B send linking pulses to each other. Their thresholds must recharge to a height that exceeds their own group action (Figure 5), and so they reach heights approximately equal to their group linking amplitudes. These are much greater than for single-cell recharging. But the inter-group linking pulses are also much larger, and as a result the relative heights of both the thresholds and the linking inputs scale with group size. The ratio of the capture zone and the group periods can still be small, giving effectively "weak linking" despite the presence of multiple pulses. The detailed structure of the amplitudes is shown in Figure 5; it is simplified here for clarity. (Reprinted with permission from [26]). can be introduced in the receptive fields to make the time signature of an image invariant against translation, rotation, and scale. Simulation results also show that there is a significant insensitivity to scene illumination and distortion, and further that there is some limited insensitivity to changes in the overlying patterns (shadows) on a given image shape. The design objective is to make the internal activity invariant by introducing geometrical symmetries into the receptive field weight pattern. (1) For translational invariance use the same receptive field weight pattern at every neuron. (2) For rotational invariance make the receptive field patterns circularly symmetric. A translated and rotated image then covers a different set of neurons, but due to the translational and rotational symmetry of their receptive fields, sees the same receptive field patterns as before. The time signal is a sum over all the neurons, so it doesn't matter which neurons are used. (3) For scale invariance use an inverse square radial falloff. This does not make the internal activity invariant against distances r, but rather against scale changes as represented by the factor k in the rescaled distance kr. To see this, consider an optical image that is rescaled by a change in the object distance (Figure 14). In this case, the intensity per
1. Pulse-Coupled Neural Networks
27
image patch is constant. The number of cells affected by the rescaled patch is changed, but not their output pulse frequency. A neuron receiving the input at the rescaled location of the original image patch is driven by the same intensity as the neuron at the original location. For a rescaling factor of k, Y{kR) = Y{R). The linking input to that neuron, using an inverse square kernel, is /•27r
L{kR)
/"OO
= Jo J On /O Jpo r27r rcx)
=
1
——Y(k{R
1 1 Jo
+ p))kpkdpde
( ^M
-^Y{k{R + ^)pdpd9 = L{R).
(24)
Jpo P
This removes the scale factor dependence k from the integrand. The lower integration limit po is fixed and does not scale, so the above relation is not an exact equality in some cases. This will be discussed below.
Image patch
Original Optical Image
Rescaled Optical Image
FIGURE 14. Geometry for scale invariance. A neuron at R receives a linking contribution from a neuron at p . When the image is rescaled, the image patch at R goes to kR and the patch at p goes to kp . Only the latter patch is shown. For the case of an optical image rescaled by a change in the object distance, the intensity per image patch is constant. The object is to design a linking receptive field such that L(kR) = L{R). (Reprinted with permission from [3]). If the feeding field is a single pixel (this is not essential and is done here only for simplicity), then F{kR) = F{R). The internal activity of the rescaled image is thus the same as that for the unsealed image: U{kR)
=
F{kR){l
-f
pL{kR))
28
Johnson, Ranganath, Kuntimad, and Caulfield =
F{R){l-{-PL{R))
= U{R).
(25)
There is a problem that must be resolved before complete scale invariance is achieved. It appears to be less important for large images on a fine grid of cells, but when the isointensity patch size covers less than approximately 10 X 10 cells in the simulations, it has some effect. The problem is that the local group around a neuron also changes in scale. The linking input due to the local group accordingly varies with scale, making the internal activity change as well. The cause is the fixed inner edge po of the linking field. It does not scale. External groups do not have this property because all their boundaries shift accordingly as the image scale is changed. For simplicity consider a neuron at the center of its local patch, which is surrounded by an external patch, making two concentric circles, as shown in Figure 15. Let Po be the fixed inner edge of the local patch, and Yi and I2 the pulse activities in the local and external patches, respectively. Then pr y = 2TX I -4pdp Jpo P^
pR y + 27r / -4pdp = Jr
P^
2-KYI
r H In — + 2nY2 In —. Po
(26)
r
Under a scale change, r and R become kr and kR, but po is fixed. The linking input to the center neuron then has a scale factor dependence proportional to Yi ln{k). This is the problem. The solution is to make the internal activity distinguish between the local and the external groups, and to make both scale-invariant. The local group can be made independent of scale by using a nearest-neighbor receptive field with a fixed outer limit so it fits in the image's characteristic isointensity patch size. To distinguish between local and external groups, however, it is necessary to use the generalized linking field model with multiple linking fields as well as excitatory and inhibitory dendritic inputs. The dendritic signals are summed in the cell body and can be either excitatory or inhibitory. The weighted sums in the receptive fields correspond to fuzzy OR-gates, while the products from the linking modulation correspond to fuzzy AND-gates. This view will be used to construct a "semi-exclusive OR" to let the neuron distinguish between the local and the external linking inputs. Use two dendrites, each having two linking inputs. One dendrite is excitatory, the other is inhibitory. The same linking inputs Li and L2 are used on both, and both are fed by the same feeding input F , but the linking strength coefficients are all different: C/exc
=
+aiF(l+/3lLi)(l+^2l^2),
Uinh
=
- a 2 F ( l + /33Li)(H-/?4L2),
Utotal
=
Uexc-^Uinh'
(27)
Choose the a's and /3's such that they are all positive and such that the
1. Pulse-Coupled Neural Networks
29
FIGURE 15. Geometry used to show that the fixed inner radius po of the local group Li causes a dependency on the rescaling factor k. The external group L2 is in the annulus from r to R^ while Li extends from po to r. (Reprinted with permission from [3]). total internal activity has the form Utotai = F{1 + /3Li + /3'[1 - L i / ( L i ^ _ ^ ) ] L 2 ) .
(28)
For the values /3 = 0.2, / 3 ' = 0 . 3 , and Li^^^^^ = 4 0 used in the simulations, one possible set of coefficients is ai = 2, 02 = 1, /3i = I, f32 = 219/640, ^3 = 1.8, and /34 = 123/320. i^i(^^,) is the maximum possible value of the local-neighborhood linking input Li, and L2 is a linking input from a larger and more extended receptive field such as the inverse square field. Li gives the input from the local group, and L2 gives the input from external groups that do not contain the neuron being linked. When the entire local group fires, Li = ^i(^a«)5 sind the neuron sees only its nearest neighbors. When the local group is quiet, Li = 0, and the neuron can receive the L2 linking from the external groups. Suppose the rescaled image patch now makes several new adjacent groups out of the local group, all with the same frequency. If they are in phase, the neuron's local group will mask them. If they are not in phcise then they will link with the local group through the second linking input and be captured by the local group. Then they
30
Johnson, Ranganath, Kuntimad, and CaulReld
will be in phase, and the local group has effectively enlarged to include them but without altering the internal activity seen by a given neuron. When the outer limit of Li is chosen to overlap the inner limit on L2, the inner boundary of the external group is always the outer boundary of the composite local group, as desired. The system's architecture has translation, rotation, and scale invariance. It is a third-order network, which has been shown [17] to be the minimum order necessary for achieving these invariances all at the same time. An open problem is to derive specific geometrical rules in terms of the synaptic weights through equations (1), (2), and the internal activity equation, for these invariances.
9.1
Invariance Simulation Results
This model was simulated [3] on a PC. The array size was 33 x 33, and the images were made of five blocks, each with its own intensity, and the blocks rearranged to form the different test images. A cross shape and a "T" shape were used. They differed only in their geometrical arrangement, or syntax, an observation that will turn out to be of vital importance in our discussion of pattern recognition. Each block contained from five to eleven cells on a side, depending on the scale factor, and the background was set to zero in all cases. No noise was added. Analysis of the grid size indicated that reasonable results could be expected down to a 5 x 5 block size for rotation, and the scale increments were chosen so that the blocks varied in size by 5, 7, 9, and 11 cells on a side. The nearest neighbor linking field for Li was a 3 x 3 square (center excluded), while the outer radius of the inverse square linking field for L2 was fixed at 10 and the inner radius at 1. The simulation's equations were written for discrete time steps using the digital filter form from reference [1]. They are F
=
ImageO',A:)/255,
Llocalit-\- I)
=
Ai J^local
Lext{t-\-l) e{t + l)
= =
AiLext{t)-\-VLL2{t), A20{t) +
VTY{t),
Y{t) = Step{Utotai{t)-e{t)), (29) where Utotai is given by equation (18). The parameter values were Ai = exp{-l/ti),A2 = exp{-l/t2), ti = IM = 5, VL = b.Vr = 20, /? = 0.2,/3' = 0.3, I/i(^^^) = 40, and Image(j, A:) was the input image. The results are shown in Figures 16 through 21. The most important result was that the time signatures were object-specific. Each test image generated a distinct periodic time signal that would never be confused with the signal from the other class (cross or "T"). This showed that the pulse-coupled net encoded the images in accordance with their geometrical configuration.
1. Pulse-Coupled Neural Networks
31
because both images were built of the same five blocks arranged in different geometrical configurations. Good invariance was achieved for translation, rotation, and scale. The time signatures of the two test images were easily distinguished in all cases except for the smallest rescaled "T" (Figure 17). Its patch size was 5 x 5. A grid coarsness analysis had indicated that below a 7 X 7 size the grid effects would be significant. The rotated "T" images, likewise, were sensitive to these effects, but their signatures were still distinct from those of the cross image (Figure 16) for patch sizes greater than 5 x 5 . The rotated "T" images were translated, as well, to fit in the small slab grid of 33 x 33 cells, so Figure 17 also indicates translational invar iance. The images were tested with different scene illumination levels. It was found that their time signatures (Figure 18) were essentially invariant over a factor of two hundred in illumination. This was not expected, as the ratio of the capture zone time to the neuronal period changes in this case. What happens is that the signature period varies, as expected, but the signature itself remains the same. Detailed examination of these runs after the fact gives a possible explanation: The signatures reflect the propagation of linking waves through the scene object. These waves follow gradients, and changes in the overall scene illumination did not change the relative gradient patterns. There was less variation in the signatures due to scene illumination changes than for other image changes. Figure 19 shows the effect of image distortion. A coordinate transform of the form x' = x -\- O.Olxy^ y' = y -\- O.Olxy was used to approximate an out-of-plane rotation of about 30 degrees with some perspective added. The signatures retained their characteristic forms sufficiently for the cross and the "T" images to still be correctly classified by their signatures. Again, this suggests a close relationship between the image morphology and the time signature. The insensitivity to distortion is because the signature generation is more of an area effect than an edge or angle effect. Image intensity overlays were investigated next. The 9 x 9-scale "T" image was altered by transposing the two lower blocks. This would correspond to a shadow across the image, for example. The result, shown in Figure 20, is not invariant, but shows a distinct correspondence of the new signature to the original. Figure 21 shows the effect of combined image changes. Translation, rotation, scale, scene illumination, and distortional changes were made as indicated in the figure. The new signatures were similar enough to the originals for the altered images to be correctly classified as a cross or a "T" by using only the signatures. They are clearly not strictly invariant, but show a substantial insensitivity to the geometrical changes while retaining their object-specific character.
32
Johnson, R a n g a n a t h , K u n t i m a d , and Caulfield 150 1 100 250 100
SCALE:
200
121
ROTATIO N:
il
SC= 1
\c = o
SC = .82 A C = 45
iL^ 1 49
1
85
>C = .82
\c = o
^
SC = .82 A C = 30
1
s.C =
.64 17
p 25
j ^
5C = .46 \C = 0
FIGURE 16. Periodic time signatures and invariances for the cross image. The signatures cire the periodic part of the total output time signal of the pulsed 2u:ray. SC is the scale factor and AC is the rotation angle in degrees. Good scale invariance was found for scales over 1:0.46, and for large rotations of 30 and 45 degrees. The five blocks arranged to form the image were scaled from 11 x 11, 9 x 9 , 7 x 7 , to 5 x 5 block sizes. The 33 x 33 slab had a background intensity level of zero. Grid coarseness effects were expected for 7 x 7 and smaller block sizes in scale, and for 14 x 14 block sizes in rotation. Grid effects were not severe in this image. (Reprinted with permission from [3]).
1. P u l s e - C o u p l e d Neural N e t w o r k s
150
33
100 200 100
SCALE:
250
121 ROTATION
it L
kl llL
SC= 1 AC = 0
SC = .82 AC = 0
81
111
SC = .82 AC = 30
1,1 17
25
iiLi
SC = .46 AC = 0
FIGURE 17. Periodic time signature and invariances for the "T" image. Same setup as for Figure 16, but with the five blocks rearranged to form a "T". The signature was very distinct as compared to the first case, showing that the net makes unique time signatures for different images even when they are rearrangements of the same components. The scale invariance was good down to the 7 x 7 block size. The rotated images' signatures still followed the overall "T" signature shape in contrast to the cross signature. Their variation from ideal is strictly due to grid effects. (Reprinted with permission from [3]).
34
Johnson, Ranganath, Kuntimad, and Caulfield
81
iL
SC = .82 AC = 0 40
13
17
BRIGHT: I=2*Io
ORIGINAL
81
81 SC = .82 AC = 0
1
SC = .82 AC = 0
IL
SC = .82 AC = 0
13
17
BRIGHT: I=2*Io
ORIGINAL
DIM:I=.01*Io
L
SC = .82 AC = 0
40 DIM: 1=01 *Io
FIGURE 18. Intensity invariance. The 9 x 9 block size images were multiplied by an intensity factor /o corresponding to a change in scene illumination. Prom /o = 2 to 0.01 the signature was invariant in its shape, though the period of the signature varied from 13 to 40 time units. (Reprinted with permission from [3]).
10
Segmentation
Image segmentation, the task of partitioning an image into its component parts, may be defined as the process of decomposing a given image F into disjoint nonempty regions, or subimages, Ri, R2, • • •, Rk such that • RiUR2U'"URk
= F;
• Ri is connected for all z; • All pixels belonging to Ri are similar, based on some meaningful similarity measure M; • Pixels belonging to Ri and Rj are dissimilar based on M. In general, image segmentation is a challenging problem. Intensity variations within regions, fuzzy and incomplete boundaries, changing viewing conditions, and the presence of random noise are a few of the factors that make image segmentation a difficult task. In the past, researchers have used classical deterministic and nondeterministic methods, knowledge and rule based systems, and trainable neural networks to build automatic image segmentation systems. A recent survey paper by N. R. Pal and S. K. Pal summarizes many image segmentation techniques reported in the literature [18]. It is obvious that fast and accurate image segmentation is essential to
1. Pulse-Coupled Neural Networks
35
81
UL
81
III IL. 81
I
SC =.82 AC = 0
SC =.82 AC = 0
SC =.82 AC = 0
FIGURE 19. Image distortion. A coordinate transform approximating a 30-degree out-of-plane rotation was used for both test images. Their signatures were still distinct and recognizable as belonging to the correct image classification. (Reprinted with permission from [3]).
obtain meaningful results from image analysis or computer vision systems. The next few sections describe how pulse-coupled neural networks (PCNN) may be used for segmentating digital images.
36
Johnson, Ranganath, Kuntimad, and Caulfield 150 100 200
[150 100 200
100
250
250
100 121
121
Uk
SC= 1 AC = 0
ORIGINAL
i
1
sc = 1
III- ^ ° ° NEW
FIGURE 20. Signature of "T" image with two blocks interchanged. The two lower blocks of the full-scale unrotated "T" image were interchanged, simulating the effect of a shadow moving down the image. The new signature is similar to that of the 7 X 7 - block size "T" image and still has an initial peak followed by a valley and then a higher peak. In contrast, the cross image's second peak was lower than its first peak, so this signature would still be classified as a "T" and not a cross. (Reprinted with permission from [3]).
10.1
Modified Pulse- Coupled Neuron
An area is segmented by the PCNN when a linking wave sweeps through it in a time short compared to the overall repetition rate of that area, so the linking activity is the primary process in segmentation. In order to emphasize the linking action, the feeding inputs will be constrained to be small compared to the threshold gain Vr- Special attention will be given to the linking strength /3 and the radius r of the linking field, as well. The pulse generator and the dendritic tree are accordingly modified to reflect this emphasis. The number of neurons in the network is equal to the number of pixels in the image to be segmented. For each pixel in the image there is a corresponding neuron. Let Xj and Nj be the jth image pixel and its corresponding neuron, respectively. The segmentation model is as follows: 1. The feeding, or primary, input to Nj is the intensity value of Xj or simply Xj. There are no leaky integrators in the feeding branch of the dendritic tree. If desired, the average intensity of a local neighborhood centered on Xj may also be used as the feeding input to Nj. 2. Each neuron receives a linking input from its neighbors. Let Sj denote the group of neurons that are linked with Nj. Usually, a circular linking field of radius r centered on Nj is used: all neurons that are within a distance of r from Nj are linked to Nj. Other neurons are not linked to Nj. The outputs of all the leaky integrators in the linking branch of the dendritic tree decay at the same rate, as determined by the linking field decay time constant a^- The linking contribution of Nk to Nj is given by equation (1).
1. Pulse-Coupled Neural Networks
501
Ih
(18,18), SC = .64, AC = 45, 10 = .5, RD = 33.
121
37
44
llUlA. (20,14), SC = .64, AC = 45, I0 = .5,RD = 33.
121
u
(16,16), S C = 1 , AC = 0, I 0 = 1 , R D = 0.
iHUi. (16,16), S C = 1 , AC = 0, I 0 = 1 , R D = 0.
FIGURE 21. Effect of combined image changes. The original images were located at coordinates (16,16) with scale factors of unity, unrotated, and with no distortion (RD is the approximate out-of-plane rotation). The signatures were sufficiently insensitive to the combined changes for the images still to be correctly classified. (Reprinted with permission from [3]). Usually, the weights Wkj are inversly proportional to the distance or the square of the distance between Nj and Nk. 3. The feeding input Xj and the linking input Lj are combined via equation (3) to produce the total internal activity Uj {t) for the neuron Nj. At present, the value of (3 is the same for all neurons for a given image. However, it may be ultimately desirable to use different values of /3 for different regions, based on the regional intensity distribution. Then /3 can be viewed as an adaptive weight that adjusts to each image region for optimum segmentation. 4. The pulse generator of the neuron consists of a step-function generator and a threshold signal generator. The output of the step-function generator Yj (t) goes to 1 when the internal activity Uj {t) is greater than the threshold signal 9j{t). This charges the threshold according to equation (4). Since VT is much larger than Uj{t), the output of the neuron changes back to zero. The pulse generator produces a single pulse at its output whenever Uj{t) exceeds Oj{t). There are two major differences between this model and the original. The latter has the ability to produce a train of output pulses.
38
Johnson, Ranganath, Kuntimad, and Caulfield
The model used here for segmentation produces only one pulse, which is approximated by a unit impulse function. The second difference is in the recharging of the threshold. Because the internal activity Uj{t) is much smaller that the threshold gain factor Vr, the recharging is done by setting the threshold to VT rather than to 6j{t) -h Vr- If two successive firings of Nj occur at times ti and ^2, then 0j{t) = VTe-''^^^-^'\
tl
t2.
(30)
This new threshold mechanism is equivalent to the old one when the input signal level is much smaller than the threshold gain factor, as can be seen by looking at the pulse period TJ : 1 VT I VT — ln(l -\~^)^— J_ln(l + ^ ) - J - l Inn(( ^^ );, ax
aT
Aj
for Xj « VT.
Aj
On the segmentation time scale, neurons corresponding to pixels of each image region are forced to pulse together periodically. The pulse rate of a region is determined by the feeding and linking inputs to its neuron group. Therefore, it is important to understand the mathematics associated with the firing rate of a neuron using the segmentation model approximation of equation (30). Consider first a totally unhnked PCNN. Such a network may be realized by setting the linking strength /3 to zero. The activity internal to A^^ is then simply Xj. Initially, at time t = 0, Oj{0) = 0 for all j . Assuming that Xj is greater than zero, all neurons fire at f = 0. From then on, each neuron fires periodically, and the period is determined by the feeding input, VT, and QT- Since VT and Q T are constants, the period is a function of the intensity of Xj. The intensity / and the corresponding period T{I) are related by T ( / ) = —{IniVr)
- HI)).
(31)
aT
For a given / , T{I) may be increased by increasing VT or decreasing QT- It is often convenient to express T{I) as the number of decay time constants. The period in number of decay time constants is r(/) =\n{VT)-HI).
(32)
The plot of r ( / ) as a function of ln(/) is a straight line with slope —1 and intercept In(VT). If r{I) is known, one can compute r{al), T{I -f 6), and r(a7-f 6): T{al) r ( / + 6)
= r(/)-ln(a), = r ( / ) - l n ( l + 6//),
(33) (34)
T{al -\-b)
= r{I) - ln(a -h b/I).
(35)
1. Pulse-Coupled Neural Networks
39
It is interesting to note that r ( / ) — T{al -f b) is independent of Vr- Also, r ( / ) — T{al) is independent of both VT and / . The approximation of equation (30) makes the system less dependent on the prior activity of the threshold, and its behavior is more strongly governed by the linking. Now consider the effect of linking. Let ^i, t2, ts, . . . mark the times at which the ith neuron fires. For ti < t < ti^i, let / and Lj{t) be the feeding and linking inputs to the neuron, respectively. The linking input increases the internal activity of the neuron from / to 7(1 -h /3Lj{t)). Accordingly, in the interval ti-^i — ti the period reduces from T{I) to T'{I): T\I)
= T{I) - — ln(l + 0Lj{ti^,)), ax
(36)
If the decay rate of Lj (t) is large and much greater than the decay rate of Oj{t)^ the following statements can be made: 1. Lj{t) may be approximated as an impulse train, whose magnitudes are proportional to the number of linking input pulses at time t. 2. If a subset of neurons belonging to Sj fire at tj and fail to capture Nj at that time, then the subset will not capture Nj later in the interval ti < tj < ti^i. In other words, there is no linking decay tail, and the receiving neuron's output is unaltered if the linking pulse is outside the capture zone (equation (14)).
10.2
Image Segmentation
The image segmentation approach using pulse-coupled neural networks is described in this section. Figure 22 shows an image consisting of two regions i?i and i?2- Spatially connected object pixels form Ri. Similarly, spatially connected background pixels form i?2. Perfect segmentation is possible if there exists a linking radius r and a linking coefficient /3 that will force all neurons belonging to Ri to pulse together periodically with period Ti. Of course, Ti is not equal to T2. If all pixels of i?i are of intensity /i and all pixels of R2 are of intensity I2, the segmentation problem becomes trivial. A pulse-coupled neural network with /3 equal to zero will do the job. Neurons of Ri will fire together at times t = nT{Ii), where n is an integer greater than or equal to zero. In practice, image segmentation is not this simple. Images that consist of two regions will have bimodal histograms. Assume that [h^h] and [/3,/4] are the intensity ranges of the background {R2) and object (Ri) pixels, respectively. If I3 > h, simple thresholding can be used to achieve perfect segmentation. When h < h^ thresholding techniques do not produce a perfect result. Optimal thresholding techniques minimize or attempt to minimize the error. The error may be defined as the number of pixels incorrectly classified during segmentation. The presence of linking inputs
40
Johnson, Ranganath, Kuntimad, and Caulfield
(a)
(b)
(c)
FIGURE 22. An example of a perfect image segmentation, (a) input image; (b) segmented object region; (c) segmented background region. (Reprinted with permission from [27]. © IEEE 1995.)
makes pulse-coupled neural networks fairly insensitive to noise and minor local intensity variations. As a result, the PCNN is expected to produce better segmentation results. Consider the segmentation of the digital image in Figure 22. Assume h > h and /i > 0. At t = 0, all neurons fire and charge the outputs of all the threshold units to Vr- The group of neurons corresponding to object pixels of intensity I4 fire first at time ^1 = T{l4). This type of firing, which is mainly due to the feeding input, is called natural firing. The natural firing at ti leads to the following: 1. Object neurons for which the following inequality is true are captured
1. Pulse-Coupled Neural Networks
41
a,t t = ti : Xj{l-^l3Lj{ti))>h.
(37)
Subscript j is used to represent object pixels and neurons. 2. Background neurons for which the following inequality is not true are also captured at ^i : Xk{l-^pLk{h))
(38)
Subscript k is used to represent background pixels and neurons. 3. Object pixels not captured at ti fire in several groups after ^i. The number of groups and the exact time at which each group fires are determined by the intensity distribution of Ri, P, and r. 4. Neurons corresponding to background pixels of intensity /2, which are not captured so far, fire at t2 = T{l2)- This primary firing has no eff^ect on neurons that have already fired {VT is large compared to the image intensity). However, all background neurons that are in the capture zone of this primary firing will fire at ^2 • X,(l+/3L,(^2))>/2.
(39)
Other background neurons organize into several groups and fire after If inequality (37) is true for all Nj (object neurons), and inequalities (38) and (39) are true for all Nk (background neurons), the input image is perfectly segmented even when I2 > h. The value of the linking input to Nj, Lj{ti), depends on the composition of 5j and the number of fired neurons at ^1. For pixels like Pi, where all members of Sj are object neurons, Lj{ti) is relatively large. For pixels like P4, where Sj consists of mostly background pixels, Lj{ti) is small. Let Lmini = minLj(ti), Lmin2 = minLj{t2), and Lmax2 = maxLk{ti).
It is obvious that values of Lminl, Lmin2, and Lmax2
depend on r and object-background boundary geometry. All three increase in value as r increases. However, the rate of increase varies depending on the boundary geometry. Perfect segmentation of the input image is possible if there exist 0 and r such that the following inequalities are true:
hil -^ PLminlih)) hil -^ 0Lmax2itl))
> /4, < h.
h(l^pLmin2{t2)) > h-
(40) (41) (42)
The above conditions when satisfied guarantee a perfect result for the worst case.
42
Johnson, Ranganath, Kuntimad, and Caulfield
However, the solution may not be unique: perfect segmentation is not always possible. Inequality (40), when not true, leads to the fragmentation of Ri. Similarly, if inequality (42) is not true, R2 gets fragmented. Some background neurons (perhaps those near the object boundary) fire with object neurons, making Ri look larger than its actual size when inequality (41) is not true. A challange is to find optimal parameters /3* and r* that minimize the error. The determination of p* and r* requires adaptation and is not addressed in this chapter.
10.3
Segmentation Results
A pulse-coupled network was simulated on a SUN workstation. A number of real and artifical images were used. The study focused on the effects of intensity variation within regions, extent of intensity overlap, noise and smoothing, and boundary geometry. Each artifical test image, an array of size 64 x 64, consisted of two regions, an object and a background. The object was a 32 x 32 subimage located at the center of the image. The object's intensity range was [/a, 74]. The remaining pixels of the image formed the background, and its intensity range was [/i,/2]. The object intensity range overlapped the background intensity range: I4 > I2 > h > h- Since the object was rectangular, the boundary geometry was simple to handle. For r = 1 only four pixels (top, bottom, left, right) were in the Unking field. It can be shown for that c a s e t h a t Lrninl
^^ 2, Lrnin2 ^ 3 , a n d Lrnax2 ^^
1. Perfect segmentation is possible if /? is in the range [(3i,/32], where /3i = m a x [ ( / 4 / / 3 - l ) / 2 , ( / 2 / / i - l ) / 3 ] , 02 = ( / 4 / / 2 - I ) .
(43) (44)
If P2 is not greater than Pi, then perfect segmentation is not possible. Note that the solution range of /? changes with r. A number of artifical images were created by varying the object and background intensity ranges and the extent of overlap. Figure 22(a) shows an input for which /i = 100, I2 = 175, I3 = 150, and I4 = 250. From equations (43) and (44) the solution range for /3 is [1/3, 3/7]. The image was segmented using r = 1 and P = 0.35. The segmented image as determined by the synchronous firing of neurons is shown in Figures 22(b) and 22(c). The PCNN gave a perfect result because a solution range for P existed. If the intensity distribution of the image is such that Pi is greater than or equal to /32, a perfect segmentation is not possible. Then the best /? can be determined by trial and error. The PCNN was tested using low-resolution TV and infrared (IR) images of tanks and helicopters for this case. Each image consisted of one target in a fairly noisy background. The network successfully segmented each image into background and target.
1. Pulse-Coupled Neural Networks
43
It is obvious that wide and excessively overlapping intensity ranges have an adverse effect on image segmentation. The segmentation error can be greatly reduced by shrinking the object and background intensity ranges and also by reducing the extent of overlap in the intensity ranges. A reduction in the intensity range reduces the value of /3i. Now more image pixels satisfy the desired inequalities, increaging the number of pixels correctly classfied. If the value of /32 then exceeds the value of ^ i , a perfect segmentation is possible. When the spread is due to noise, a smoothing algorithm can be used. Neighborhood averaging smooths regions but blurs edges. A median filter suppresses random noise and also preserves edges. The PCNN is also capable of smoothing images without blurring the edges. The technique is to run the net and adjust the feeding input intensity of the pixels based on the local neuronal firing pattern. If a neuron Nj fires and a majority of its eight nearest neighbors do not fire then the intensity is changed as follows: 1. If five or more neighbors are brighter than Xj, c is added to the value of Xj, where c is a small integer constant.
2. If five or more neighbors are darker than Xj, c is subtracted from the value of Xj.
3. If five or more neighbors are of the same intensity as Xj, the threshold signal of Xj is set to the threshold value of its neighbors. This compensates for the phase shift. A 128 X 128 image of Bambi, shown in Figure 23(a) is smoothed using the neighborhood average, a median filter, and the PCNN algorithm. The smoothed images are shown in Figures 23(b), 23(c), and 23(d). The PCNN filtered the noise without affecting the edges. In comparison, the neighborhood average blurred the edges. The median filter broke some edges and merged parallel lines running close to each other by filling in the dark spaces that existed between them. The PCNN performed better than the other two methods. Theoretical results and simulations show that pulse-coupled neural networks can be used for segmenting digital images. The possibility of obtaining a perfect result even when the intensity ranges substantially overlap is a new and exciting result. The net can also be used to filter random noise without blurring edges. Since the network is compatible with electronic and optical hardware implementation techniques, it is a strong candidate for real-time image processing.
44
Johnson, Ranganath, Kuntimad, and Caulfield
(a)
(b)
(c)
(d)
FIGURE 23. An example of image smoothing, (a) input image; (b) image after smoothing with PCNN algorithm; (c) image after neighborhood smoothing; (d) image after median filtering. (Reprinted with permission from [27]. © IEEE 1995.)
11
Adaptation
The Eckhorn linking field model contains synaptic weights but does not require a specific learning law. Any learning law, or none, can be used. (The Hebbian decay learning law is too rudimentary and is not considered here. It fails to retain the adapted weights after learning is complete. More realistic models such as the Grossberg competitive law [14] or a saturable law [19], either associative or causal, are more useful.) Any synaptic weight in the linking field model can be made adaptive, but for simplicity only the feeding field weights will be considered. The linking field weights will be fixed as the inverse square pattern in order to retain the invariance
1. Pulse-Coupled Neural Networks
45
properties discussed earlier. Suppose a wave of pulses sweeps over a region in which the feeding weights are adaptive (Figure 24). As the wave passes over a given cell, it is turned on and receives feeding input pulses. These weights adapt, memorizing the local pattern of the wave crest around the cell. The cells that had been active just prior to this time have been reset, and they are turned off. But the leaky integrator synapses connecting them to the currently on cells still have a residual signal on them, and those connections adapt to that strength. Likewise, the connections from the group of cells that had been active still earlier have an even more decayed signal strength, and the active cell will adapt to them as well. Each time the linking wave sweeps over the cell in question, more adaptation occurs. Whenever it is on, it sees the same pattern of active cells and decayed signals from the previously active cells due to the periodic nature of the established wave pattern. After adaptation is complete, suppose that a cell is stimulated and fires. It recalls the wave-crest pattern in its local neighborhood and also sends a pulse to the cells that had fired next as the wave passed over them after leaving the cell. These connections were adapted during training. The cell forward-biases them through the adapted feeding connections and further gives them an additional input through the linking field channel. This can cause them to fire next, just as the original linking wave had done. The process continues, each wave crest forwardbiasing the next, and the slab not only recalls the wave pattern but also sets it in motion again [7]. A time average of the slab's pulse activity then approximately recovers the original spatial distribution that generated the linking wave. The waves are binary fringe-like patterns very similar in appearance to holographic fringes. This suggests that it may be possible to store many wave patterns in an adaptive slab in the same sense that many holograms can be superimposed on a single photographic plate. It may be possible to have a slab with relatively few adaptive interconnects and to use the linking modulation to fill in the patterns when they are recalled. Figure 25 shows some wave patterns generated by a light square (lower left) and a light spot (lower right) on a light background. The network stores and recalls the traveling waves. It can also do the same for sequences of images. Use a distribution of feeding time constants such that some of the feeding synapses have very long decay times. Present one image of a sequence and allow its linking waves to become established and memorized, and then do the same for the next image of the sequence. Some of the synaptic connections will overlap the images in time. Now when the first image is recalled, those connections will also stimulate the wave pattern of the next image, and it will be recalled in turn. This is the mechanism used in the time sequence memory model of Reiss and Taylor [4], except that pulses are used here. In that model an intermediate slab with leaky integrator decay characteristics was used to provide the
46
J o h n s o n , Ranganath, K u n t i m a d , a n d Caulfield
linking wave
linking modulation
\U^
distance adaptive bias
o wave direction FIGURE 24. Adaptation, (a) A linking wave sweeps over a cell, turning it on. Its feeding synapses adapt to the current wave pattern and also to the decayed inputs from previously on cells whose signal is still present on the leaky integrator synapses connecting them to the on cell, (b) After adaptation the cell fires. It recalls the wave-crest pattern and forwaxd-biases the cells that need to fire next in order to recreate the wave motion. It also sends a linking modulation to them. The wave crest that should fire next can be stimulated in preference to the one that fired previously, and the wave motion as well as the wave-crest shape can be regenerated. (Reprinted with permission from [26]). overlap in time, and then adaptively associated with the current input image. Then when the first few images of the sequence were applied to the adapted system, they formed the decaying time overlap image, which in turn recalled the next image in the sequence. It was then fed back to the intermediate slab to make the next overlap, and so on, until the entire sequence had been recalled. Consider a slab on which several wave patterns have been adapted, either superimposed or in different locations on the slab. Is it possible to
1. P u l s e - C o u p l e d Neural N e t w o r k s
47
FIGURE 25. Linking waves from an optical hybrid laboratory demonstration system. The underlying image is a light square (lower left) and a light spot (lower right) on a light background. Coherent, locally periodic linking waves are generated as the system attempts to pulse at a frequency driven by the input intensity at each pixel while also attempting to obey the linking requirement. To satisfy both requirements the waves evolve and bifurcate into complex fringe-like patterns. (Reprinted with permission from [2].)
selectively recall a given pattern using only its time signal as input? This would mean that the slab could access any memory in parallel. Suppose the time signal of one of the encoded patterns is globally broadcast to the entire slab. It will stimulate all the patterns to attempt to regenerate their waves. As they start up, those that have different time signals will interfere with the incoming signal. The pattern with the same time signal will also interfere, since it will not generally be in phase with the incoming signal. None of the patterns will be able to establish themselves. They will continue to compete for resonance with the input. Eventually, the pattern with the matching signal may start up in the right phase. It will establish itself at the expense of the others because it will be locked in with the incoming signal and will proceed to generate its traveling wave pattern. A time average of the slab pulse activity then recovers the original input scene. This argument shows how a pulse-coupled adaptive neural network can in principle achieve parallel memory access. It is recognized that it must be verified before it can be claimed to be a viable mechanism for global recall,
48
Johnson, R a n g a n a t h , K u n t i m a d , and Caulfield
but it is a specific possibility.
12
Time to Space
The pulse-coupled neural network generates a time signal that encodes a spatial distribution. Is it possible to make a network that forms a spatial distribution from a time signal? If so, then the cycle would be complete: space to time to space. The time signal is periodic and coherent. The intensity of the input maps to frequency in the time signal, while the geometrical relationships are encoded by the linking into phases in the time signal. The desired mapping should have a frequency coordinate and a phase coordinate for each amplitude component. Wavelet transforms [20], [21], [22] retain both phase and frequency information, so these transforms may be appropriate for the pulse-coupled time signals. Wavelet transforms can be done optically [23]. A way to do it with a third-order linking field is discussed below. It is not required that the resulting spatial distribution be identical to the original one that generated the time signal, but rather that it be reasonablly object-specific. Then the time-to-space transform becomes the second half of a spatial remapping transform. The resulting spatial distribution can in turn make another time signal, and so on, so that an input is transmitted from one place to another as a time signal and at each place is operated on by spatial interactions. This is a parallel processor in one sense, but in another sense, it is a serial processor like a digital computer. It has the advantages of the parallel processing and adaptation inherent in a neural network, yet it can perform the sequential operations necessary for causal logic operations. It does not need predefined features. It generates its own syntactical features. These are very insensitive to geometrical distortions, yet they can be object-specific. The key is weak linking. In this linking regime it is possible to make periodic, coherent, object-specific time signals, and from them the rest follows.
12.1
A Model for Time-to-Space Mapping
This model uses a third-order pulse-coupled neural network. It consists of two slabs P and Q, as shown in Figure 26(a). The P-slab generates a spatial signal distribution of frequencies in the vertical direction and phases in the horizontal direction. The Q-slab receives a globally broadcast time signal at every cell and a one-to-one input from the P-slab. These are multiplied by a linking modulation in front of each Q-slab model neuron, making it a third-order node (Figure 26(b)). The product of the global time signal input and the P-slab signal input comprise the feeding input to the Q-slab cell. The P-slab consists of rows of horizontally linked cells
1. P u l s e - C o u p l e d Neural N e t w o r k s
49
(a) Time to space network architecture Q-slab linking Pulse | - ^ Y Q « I , , Gen.
V)
(b) Q-slab third order cell One-way P-slab linking I(v)
Yp($,v) Pulse Gen.
to (t)+l P-slab linking input
(c) P-slab second order cell FIGURE 26. A time-to-space architecture. A two-slab system is used. The P-slab has one-way linking across each row. Just as the last cell in a row fires, the first cell fires again. The kMigtli of the row and the feeding input of the row are chosen such that each row lias a repetition rate that increases with row number. The P-slab cells are second-order neurons. The Q-slab neurons are third-order cells. A time signal S{t) is globally broadcast to the Q-slab and multiplied by the input from the P-slab at eacli point. A pulse in the time signal with a given frequency and phase will be coinc ident with one of the pulses from the P-slab at a location corresponding to its frequency and phase, giving a nonzero feeding input to the Q-slab cell at thai location. This produces a distribution on the Q-slab whose geometry is a function of the frequency and phase x content of the time signal. (Reprinted with permission from [26]).
50
Johnson, Ranganath, Kuntimad, and Caulfield
with hnking only in the forward direction as shown in Figure 26(c). When the leftmost cell in each row fires, a linking wave sweeps across its row. The length of the row is such that the wave reaches the other side at the same time that the leftmost cell fires again. The rows have a feeding input / that increases with increasing row number. The result is that the P-slab sustains horizontally propagating waves along each row that have a repetition rate that increases with increasing row number. Each row represents a different frequency, and the distance along each row represents the phase at that frequency. Consider a time signal input S{t) globally broadcast to the Qslab. Suppose one of its frequency components i/ has phase 0. Then it will be coincident into the Q-slab cell with the P-slab's nonzero input on the z/th row and at the 0th distance along that row, and the linking product will be nonzero for that Q-slab cell. This construction satisfies the basic requirements for converting a time signal to a spatial distribution.
13
Implementations
The nonadaptative pulse-coupled neural network has been implemented as a hybrid optical laboratory demonstration system [2], [7] and as a linear eight-element electronic array. The optical system used a liquid crystal television (LCTV) spatial light modulator from a commercially available projection television to perform the linking modulation. The scene was reimaged to an intermediate focal plane and then sent through the LCTV located slightly beyond the focus so that it was out of focus. This allowed each pixel of the LCTV to modulate a small local area of the input image, effectively forming the linking receptive field by the defocusing circle. The input image was then reimaged into a standard video camera and its signal sent to a framegrabber in a 386 PC. The signal was compared to the current value of the computed threshold in the computer, and an output array was formed that contained a one or a zero, depending on whether or not the input was below the threshold. This array represented the pulses. It was used to update the threshold array, recharging at those pixels that had a pulse output, and then sent through the framegrabber back to the LCTV. A bright pixel there indicated that the neuron for that pixel had fired, and it multiplied the incoming scene to preform the linking modulation for the next processing cycle. Each cycle took about ten seconds, which gave time to examine in detail the traveling linking wave patterns that formed. The electronic chip array had eight neurons in a linear array. Each was linked to its two nearest neighbors and had a feeding input as well. Four arrays were built. Two were entirely electronic, and two had photodetectors at each cell for the feeding inputs and ferroelectric spatial light modulator pads for outputs. Preliminary tests of the all-electronic arrays showed a
1. Pulse-Coupled Neural Networks
51
pulse output range from 2 Hertz to 1 MHz and that the nearest-neighbor Unking was active. Further tests are in progress at this time. The optical implementation is attractive in that it allows access to the linking wave patterns for study, but it suffers from the limit of video frame rates. The best that it can do is 30 Hz for the maximum pulse frequency. On the other hand, electronic two-dimensional array architectures are entirely within current technology. The linking field receptive weight pattern can be approximated by a resistive plane or grid that is common to all the cells. It can also have local 3 x 3 linking fields in addition to the larger resistive plane field. Electronic arrays have the major advantage of high pulse rates, at or above the 1 MHz rate already demonstrated. The time signal is the sum of all the pulse activity, so the output can be a single wire. The linking modulation is straightforward, and the pulse generator architecture is electronically simple.
14
Integration into Systems
Two key features of the pulse-coupled neural network are first, it does not require training and second, it has the capability of operating very fast. This makes it suitable as a preprocessor because it can decrease the temporal complexity of many problems due to its high-speed parallel operation while producing an invariant output suitable for use by an adaptive classifier or by sequential iconic logical processors. The retina is an example of a preprocessor. It is nonadaptive and so can operate on any visual image. It is a hard-wired processor with parallel, high-speed action. It does immense bandwidth reduction, edge enhancement, noise reduction, and spectral decomposition and transmits the preprocessed results, all in real time. There is some evidence that the human vision preprocessor has further properties in terms of ability to tolerate significant distortions. For instance, in a 1993 special issue of Science on human vision [24], "Recognition of objects from their visual images is a key function of the primate brain. This recognition is not a template matching between the input image and stored images like the vision in lower animals but is a flexible process in which considerable changes in images, resulting from different illumination, viewing angle, and articulation of the object, can be tolerated." If the retina does in fact produce the invariant time signals of the pulse-coupled net, a view supported by the simple symmetries in the nonadaptive receptive fields being the cause of the invariances, then the "tolerance" is in the preprocessor itself. When viewed as an image preprocessor, the pulse-coupled neural network bridges the gap between the most fundamental division in pattern recognition: the division between the syntactical and the statistical approach. In statistical pattern recognition, the properties (features) of the
52
Johnson, R a n g a n a t h , K u n t i m a d , and Caulfield
scene are measured and used to form a multidimensional feature vector in an A^-dimensional hyper space. Each set of measurements forms a vector in the space. If the features form groups (i.e., if they are "good" features), then surfaces in the hyperspace can be found that "optimally" separate the groups. Then a given input feature vector can be classified as belonging to one of the groups. The problem is that the features must be correctly defined, and this has been a major problem in statistical pattern recognition. Syntactical pattern recognition goes beyond statistical pattern recognition by considering, and indeed emphasizing, the relationships among features. Since the number of possible relationships is exponential in A^, this is an incomparably richer, more powerful method. It is also much harder: the number of groups is also exponential! But if the geometrical relationships are made independent of the possible geometrical distortions, then the syntactical approach yields a natural grouping method in which the large number of possibilities becomes an advantage rather than a drawback. The pulsecoupled neural nets provide the invariances essential for syntactical pattern recognition. They do this in a suprising way. The features it uses are not features of the input pattern. Rather, they are features of the pulse code generated by the net when the image is presented to it. The simulations using a cross and a "T" shape illustrate this. The features are the pulse phase patterns, and they are syntactical: "Where does the bar cross the post?" The image itself no longer is used, only the syntactically derived periodic time signal. This serves as the input to a statistical pattern classifier, and the pattern it classifies is the phase structure of the time signal, not the image pattern. When a time-to-space mapping is also possible, the pulse-coupled neural network becomes more than a preprocessor. A spatial input IQ is first transformed into a time signal and then transmitted to another location where it is retransformed into a spatial distribution So again. The new pattern will not necessarily be the same as the original, but since the time signal had invariances encoded into it, the new pattern will also be invariant against the same distortions and so will be of reduced dimensionality in the sense of information content. The information that is lost is information about the disortions. The syntactical information about the geometrical input pattern is preserved, so the new pattern is an idealization or generalization of the original. Now suppose the pattern is again transformed into another time signal, transmitted, and made into a second spatial pattern Si. It will preserve the syntactical information of the preceeding pattern. As an example, consider the information about the scale of an input image. The first transform pair {IQ, SQ) is scale invariant with respect to the pulse phase pattern, but the amplitude of the time signal connecting them was proportional to the area covered by the image /Q, and so the amplitude of So still has an area dependence. However, the second transform {So,Si) will be invariant with respect to amplitude, as shown in the discussions
1. Pulse-Coupled Neural Networks
53
earlier, so Si will not depend on the original image area either by phase structure or by amplitude and will be completely independent of any scale effect in the original image. Each successive transform {S^ Sn-\-i) results in a more invariant pattern. If the time-to-space transform is poorly chosen, this could result in a final pattern that is invariant with respect to everything, including syntax. This is not desirable! On the other hand, it may be possible to choose a time-to-space transform that becomes stable yet still contains the fundamental syntactical information of the original image /Q. If so, then in the asymptotic limit the transform pair will become idempotent: SN = SN-^-I- This will be a point attractor, and all the distortions of /o that map to it will define its basin of attraction. It will be an idealized, or platonic, icon that represents the object itself rather than a view of the object. The existence of platonic icons is shown by this argument to be critically dependent on the choice of the time-to-space mapping. The repeated transformation process, however, will always make the resultant icon more and more invariant, and since it will always be an icon, there must always be at least some syntactical information in it. Now, whenever there is a spatial distribution in a net, it is possible to perform spatial operations on it via weighted receptive fields. Thus the repeated iconic transforms can undergo processing each time they are mapped to a spatial distribution, making the pulse-coupled neural net into a full processor rather than a preprocessor. Further, since each iconic transform is sequential in time, the system possesses causality. This leads to the view of a powerful processing system combining the capabilities of parallel and serial processing techniques, where information is transmitted as time signals and operated on as spatial distributions.
15
Concluding Remarks
This work begins with the Eckhorn linking field model and then investigates the new regime of weak linking to find the existence of time signals that encode spatial distributions in their phase structure. The signals are generally periodic. They are a signature for the image that generated them. They are a syntactical signature, made by the network itself, and its temporal features are features that are about the image, not in the image. The pulse-coupled nets are a general higher-order network that provide an object-specific and reasonably invariant time signature for spatial input distributions. Multiple time scales exist, and for each time scale at which a signature exists, the next time scale permits segmentation of the part of the image generating that signature. Conditions for perfect segmentation are given and verified through simulations. The time signal may represent a possible means of communication within the brain, a way to transmit and
54
J o h n s o n , R a n g a n a t h , K u n t i m a d , and Caulfield
receive information. It is analogous to the characteristic acoustic tone of a given musical instrument, in a sense bestowing a different "sound" on each distinct two-dimensional input image. The musical analogy is reinforced by the observation that pulse frequency harmonics are more stable against noise when hnked; i.e., the "harmony of thought" may be literally true [25]. The time signal can be transformed back into spatial distributions and operations performed on it, and these in turn generate another time signal to be sent to other processing areas of the brain. It reduces the basic problem of image understanding to that of correlation on an invariant time signal. Much research remains to be done, but the pulse-coupled model and its time signals are a significant step forward in the understanding of the brain. 16
REFERENCES
[1] R. Eckhorn, H. J. Reitboeck, M. Arndt, and P. Dicke, "Feature Linking via Synchronization Among Distributed Assemblies: Simulations of Results from Cat Cortex," Neural Computation 2, 293-307 (1990). [2] J. L. Johnson and D. Ritter, "Observation of Periodic Waves in a Pulse-Coupled Neural Network," Optics Letters 18 (15), 1253-1255 (1993). [3] J. L. Johnson, "Pulse-Coupled Neural Nets: Translation, Rotation, Scale, Distortion, and Intensity Signal Invariance for Images," Applied Optics 33 (26), 6239-6253 (1994). [4] M. Reiss and J. G. Taylor, "Storing Temporal Sequences," Neural Networks 4, 773-787 (1991). [5] R. Eckhorn, R. Bauer, M. Rosch, W. Jordan, W. Kruse, and M. Munk, "Functionally Related Modules of Cat Visual Cortex Show StimulusEvoked Coherent Oscillations: A Multiple Electrode Study," Invest. Ophthalmol. Visual Sci. 29 (12), 331 (1988). [6] R. Eckhorn, "Stimulus-Evoked Synchronizations in the Visual Cortex: Linking of Local Features into Global Figures?" In Neural Cooperativity, J. Kruger (editor). Springer Series in Brain Dynamics. SpringerVerlag, Berlin (1989). [7] J. L. Johnson, "Waves in Pluse-Coupled Neural Networks," Proc. World Congress on Neural Networks, Vol. 4, p. IV-299. INNS Press (1993). [8] R. Eckhorn, H. J. Reitboeck, M. Arndt, and P. Dicke, "A Neural Network for Feature Linking via Synchronous Activity: Results from Cat
1. Pulse-Coupled Neural Networks
55
Visual Cortex and from Simulations." In Models of Brain Function, R. M. J. Cotterill (editor), pp. 255-272. Cambridge University Press (1989). [9] R. Eckhorn and T. Schanze, "Possible Neural Mechanisms of Feature Linking in the Visual System: Stimulus-Locked and StimulusInduced Synchronizations." In Self-Organization, Emerging Properties and Learning, A. Babloyantz (editor), Plenum Press, New York (in press). 10] P. W. Dicke, "Simulation Dymanischer Merkmalskopplungen in Einem Neuronalen Netzwerkmodell," Inaugural Dissertation. Biophysics Department, Philipps University, Renthof 7, D-3550 Marburg (1992). 11] A. S. French and R. B. Stein, "A Flexible Neural Analog Using Integrated Circuits," IEEE Trans. Biomed. Eng. B M E - 1 7 , 248-253 (1970). 12] C. Giles and T. Maxwell, "Learning, Invariance, and Generalization in High-Order Neural Networks," Applied Optics 26 (23), 4972-4978 (1987). 13] C. Giles, C. Miller, D. Chen, H. Chen, G. Sun, and Y. Lee, "Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks," Neural Computation 2 (3), 393-405 (1992). 14] S. Grossberg, Studies of Mind and Brain, Reidel Publishing Company, Dordrecht, Holland (1982). 15] S. Grossberg and D. Somers, "Synchronized Oscillators During Cooperative Feature Linking in a Cortical Model of Visual Perception," Neural Networks 4, 453-466 (1991). 16] N. Farhat and M. Eldefrawy, "The Bifurcating Neuron," Digest of the Annual Optical Society of America Meeting, San Jose, CA, p. 10 (1991). 17] C. Giles, R. Griffin, and T. Maxwell, "Encoding Geometrical Invariances in Higher-Order Neural Networks," Proc. IEEE 1st Int. Neural Inf. Proc. Syst. Conf., Denver, CO, p. 301 (1987). 18] N. R. Pal and S. K. Pal, "A Review on Image Segmentation Techniques," Pattern Recognition 26 (9), 1277-1294 (1993). 19] J. L. Johnson, "Globally Stable Saturable Learning Laws," Neural Networks 4, 47-51 (1991).
56
Johnson, Ranganath, K u n t i m a d , and Caulfield
[20] I. Daubechies, "The Wavelet Transform, Time-Frequency LocaUzation, and Signal Analysis," IEEE Trans. Inf. Theory IT-10, 961-1005 (1990). [21] S. Mallat, "Multiresolution Approximations and Wavelet Orthonormal Bases of L 2 ( R ) , " Trans. Am. Math. Soc. 3 (15), 69-87 (1989). [22] C. K. Chui, An Introduction to Wavelets^ Academic Press, Boston (1992). [23] H. J. Caulfield and H. H. Szu, "Parallel Discrete and Continuous Wavelet Transforms," Opt. Eng. 31, 1835-1839 (1992). [24] Keiji Tanaka, "Neuronal Mechanisms of Object Recognition," Science^ 262, 685-688 (1993). [25] F. H. Rauscher, G. L. Shaw and K. N. Ky, "Music and Spatial Task Performance," Nature 365, 611 (1993). [26] J. L. Johnson, "Pulse-coupled neural networks," SPIE Critical Review Volume CR-55, Adaptive Computing: Mathematics, Electronics, and Optics, S. S. Chen and J. H. Caulfield (Eds.), pp. 47-76, Orlando, FL, 1994. [27] H. S. Ranganath, G. Kuntimad, and J. L. Johnson, "Pulse-Coupled Neural Networks for Image Processing," Proc. IEEE Southeastcon 95, IEEE Press, Raleigh, NC, 1995
Chapter 2 A Neural Network Model for Optical Flow Computation Hua Li J u n Wang ABSTRACT Optical flow computation in dynamic image processing can be formulated as a minimization problem by a variational approach. Because solving the problem is computationally intensive, we reformulate the problem in a way suitable for neural computing. In this paper, we propose a recurrent neural network model that may be implemented in hardware with many processing elements (neurons) operating asynchronously in parallel to achieve a possible real-time solution. We derive and prove the properties of the reformulation, as well as analyze the asymptotic stability and convergence rate of the proposed neural network. Experiments using both the test patterns and the real laboratory images are conducted.
1
Introduction
Motion perception is one of the essential visual functions of biological organisms. Motion information processing, as generally believed, occurs at a relatively early stage of the perception [Sekuler 1975] due to the fact that a rapid response to a moving object is often more important than the precise recognition of what has moved. In addition, the need to search for food and to avoid becoming the prey of other animals demand real-time processing. In this regard, it is not enough to come up with solutions that merely give the correct output for a given input. A solution must be available within milliseconds of the problem's presentation, and actions must be forthcoming within a few hundred milliseconds [Churchland 1992]. So far, the human vision system outperforms any sophisticated computer vision system in motion perception. Motion detection and motion parameter estimation are challenging problems due to the fact that a huge set of image data has to be processed in real time. For example, a typical 512-by-512 black-and-white image has to be processed at the rate of 30 frames per second, or equivalently, about 8 megabytes of image data to be processed for every second, which is about 57
58
Li and Wang
the size of a telephone book of a city with a population of 300,000. Secondly, most of the mathematical formulation and computational models of a biological vision system are ill-posed in the sense of Hadamard. Regularization processes that contribute to intensive computation are needed. In dynamic image processing, there is often a need for detecting motion and estimating motion parameters in real-time in order for a system (e.g., a robot) to interact with a changing environment. Most existing image processing algorithms for motion application, however, are too computationally intensive to provide a real-time solution. Recently, biologically inspired algorithms and hardware have been developed for motion-related vision applications. In the area of early vision computing, Poggio and Koch have conducted interesting research [1,14]. Mead et al. have built a resistive network, "electronic retina," to compute image flow [12]. Recently, many works have been reported, which include designing an analog network for simulating a function of human visual peripheral processes of motion perception [10], image segmentation [11], simulation of human eye saccadic movement [20], and vertebrate retinal processing [17]. Optical flow, introduced by Gibson in the 1950s [4], is a two-dimensional vector field induced by relative motion between an observer and viewed objects. Under an egocentric coordinate system, the pattern of the flow provides the motion-related information. Based on this theory, Horn [6] and Thompson [18], among others, have developed mathematical models for optical flow computation on a pixel-by-pixel basis. Nagel and Enkelmann [13] have investigated the "smoothness constraint." Kearney et al. [9] have performed the error analysis for optical flow computation. Recently, Wohn, Wu, and Brockett [23] have developed a new iterative transformation technique to compute full image flow. Snyder [16] has shown that Nagel's weight matrix is the only physical plausible constraint and further derived a general form of "smoothness constraint". In this paper, we reformulate the optical flow computation in such a way that the motion information can be mapped to the node states of a recurrent neural network. The computation is distributed on each processing element. The stabilized activation states of the network represent the solution. We provide theoretical analysis on the asymptotic stability and convergence of the network. The proposed network can operate asynchronously in parallel. In addition, the regular structure of each processing element makes it possible to implement the proposed neural network in VLSI for real-time processing.
2. Neural Network for Optical Flow Computation
2
59
Theoretical Background
In order to derive a computational formula suitable for neural computing, we start from the problem formulation.
2.1
Optical Flow as a Minimization of Functionals
Let E{x^ 2/, i) be an image intensity function at position (x, y) and time t. By Taylor expansion, it can be derived rather easily that ^^-^^^^=ExU-\-EyV-{E, + o{h), where E. = ^ £ ^ , Ey = '-^^, E^ = ^ ^ ^ , u = f , y = -^^ and o{h) is a higher-order term. The problem of finding u and V is ill-posed in the sense of Hadamard. Regularization is utilized to convert the problem to a well-posed one by imposing a smoothness constraint. Therefore, computation of optical flow is formulated as a minimization of functionals [5],
mmj
l^[{E.u+EyV+Etf+aii^)'+i^)'H^)'H^)')]dxdy,
(1)
where a > 0 is a regularization parameter and Q is the image plane on which the optical flow is to be computed. Prom the theory of calculus of variations, the Euler necessary condition of equation (1) gives r v ^ u = a{E:,u -f Eyv -f Et)Ea:, \ s/^v = a{ExU + EyV -f Et)Ey,
f^) ^^
where v ^ = ^ + -^ is the Laplacian operator. These coupled elliptical partial differential equations give the solution to equation (1). They are subject to a natural boundary condition,
{
(du du\^(dji _dx^\ — 0 (dv dv_\^(d]i _dx^\ _ n ^dx^ dy^ ^ds^ ds ' "^ ^^
where s denotes the boundary of the image plane n , and ( | f ? f^)* and ( | ^ , 1^)* are column vectors.
2.2
Formulation for Neural Computing
Applying the flnite difference method to equation (2), we have the following difference equations:
60
Li and Wang
f ( - 4 - aEa:^)u{x, y) +u{x, 2/ - 1) ( - 4 - aEy^)v{x, y) +v(x, 2/ - 1) -
-f u{x -f 1, y) + u{x - 1 , 2 / ) + u{x, y + 1) aE^Eyv{x, y) = aE^Et, -f v{x 4-1,2/) + v{x -1,2/) + t;(a:, 2/ + 1) aExEyu{x, y) = aEyEt,
,^x ^^
In view of the fact that optical flow computation is almost always performed on a square or a rectangular image, the natural boundary condition can be simplified as q q . j ^(^^ 2/ + 1) - uix, 2/ - 1) = 0, '^^'''2-\t;(x,2/ + l ) - t ; ( a : , 2 / - l ) = 0 ,
^ (u{x-\-l,y) u{x' \v{x + 1,2/) - v{x
Si.S'c
1,2/) = 0, 1,2/) = 0 ,
(5)
where 5i, z = 0,..., 3, is the boundary of a given rectangular region. Figure 1 illustrates the region that gives the above boundary condition. Prom the difference equations and the boundary conditions, we can derive a linear algebraic system AX = b. For example, to compute optical flow on a 2 X 2 image, by labeling each pixel from 1 to 4 as illustrated in Figure 1, the matrix equations in Figure 2 are used. y
I
S2
1
Si
2
4
3
S3
So
FIGURE 1. Illustration of a 3-by-3 image, upon which a 2-by-2 subregion will be used as the input for optical flow, computation. Note the boundaries (Si, i = 0,1, 2,3) of this given 2-by-2 region define a rectangular region that simplifies the mathematical manipulation of the boundary condition.
2. N e u r a l N e t w o r k for Optical Flow C o m p u t a t i o n
Coq
I
I
I
tq
ti^
I
I
I
I
OCMCM
. O O O r ^
CO
<^^ o
^
^
^
CM H
^!
CM
O O r§ '^
tqtqijqijqiaqiaqtqijq
1 1 II CM
1
O CM
1
1 CM H
1
1
CM CM O
r H* O O
O
1
^ FIGURE 2. Matrix equations to compute optical flow on a 2 x 2 image.
61
62
3
Li and Wang
Discussion on the Reformulation
In this section, we discuss some properties to assist the analysis and computation of optical flow. These properties are given to detail the algorithm construction and to show the nature of the formulation suitable for neural computing. The proofs of most of the properties are straightforward. Property 1. The dimension of the matrix A, dim A, is related to the size of the K-hy-K square region (image) by dim A=2K'^, where K^ is the number of pixels of the region (image). Proof: For at each pixel location (x,y), there are two unknowns, u{x,y) and v{x,y), to be determined. From equation (4), two linearly independent equations are needed to solve them. Therefore, for a given K x K image E{x, y,t), x,y = 1,2,...,if, there are 2{K x K) linearly independent equations, which results in dxmA=2K'^. Property 2. The matrix >1 is a sparse matrix. Except at the boundary, the ratio of nonzero elements to total elements of each row is at most ZjK^. Proof: A constructive method can be employed to prove this property. From {-4 - aEx^)u{x,y)-\-u{x -\- l,y)-hu{xl,y)-\-u{x,y-{-l)-\- u{x,y- 1) —aExEyv{x,y) = aExEt of equation (4), where x,y = 1,2,...,/^, a single row of matrix A can be constructed at a time. The nonzero element of the equation comes from the coefficients of u's and t^'s. Hence, there are only six nonzero elements regardless of K. For the size of each row is 2K^ from Property 1, and therefore the ratio of nonzero elements to the total number of elements of any given row is equal to ^^. With these properties, we may check equation (6). One should notice that due to the small size of the image ( 2 x 2 ) , each row of the matrix is affected by the boundary conditions. For example, at pixel position 1 (see Figure 1), u{x — l,y) relates to the 5i boundary. Its coeflScient must be determined by following the boundary condition of equation (5). Similarly, u{x,y -h 1) relates to the 52 boundary, and its coefficient is determined accordingly by equation (5). Property 2 holds except at the boundary conditions. Or, in other words, when K > S, one should be able to observe Property 2 well. Remark: The matrix is symmetric, and its bandwidth is equal to K'^. The bandwidth of a sparse matrix is defined as the maximum distance in terms of the number of entries between two nonzero entries in a row of the given matrix. From Properties 1 and 2 it is easy to observe that the matrix bandwidth is equal to the total number of iz's (which is equal to K'^). The
2. Neural Network for Optical Flow Computation
63
symmetry of the matrix comes from the fact of the cellular structure of the n's and i^'s given in equation (4). For example, the computation of u{x,y) involves its neighbors u{x -\- l,y), u{x — 1,2/), u{x,y -f 1), and u{x,y — 1). Based on the properties, we have constructed the matrix A of the linear algebraic system AX = 6 on a 32 x 32 window, as illustrated in Figure 3. The matrix A has a desirable regular structure and the nonzero entries are located on five subdiagonal positions and a main diagonal, as predicted by the properties developed above.
4
Choosing Regularization Parameters
Before describing the recurrent neural network model, we need to address the aperture problem. The aperture problem is an important problem in image flow computation. It refers to the ambiguity in determining the true velocity using a local motion detector. This ambiguity can be well observed from the original formulation, where — ^ t ^ ~ ExU-\-EyV-\-Et-\-o{h), which can be rewritten in vector dot product form as {Ex,Ey) - {u,v) = —Et for dE{x,y,t) ~ 0, under the smoothness constraint. This indicates that {u,v) cannot be uniquely determined when {Ex,Ey) is perpendicular to {u,v) [5]. It is generally accepted that any vision system, whether a biological or an artificial system, exhibits the aperture problem [7]. A work by Snyder [16] has shown an interesting result on the smoothness constraints based on Horn and Schunck's original formulation. Recently, Wohn et al. [23] have explicitly defined normal flow and full flow. Their iterative approach starts from normal flow and successively estimates full flow until the process converges. In this section, we provide a way for regulating the aperture problem. The regularization parameter a of the "smoothness constraint" in the second term of equation (1) controls the convergence of the imageflow computation. Property 3. The regularization for handling the aperture problem can be achieved by choosing a proper a such that for alH, a is determined by f,(a)
= \-4^aEl\-
"£
|a,,| > 0,
(7)
where aij is an element of matrix A. Proof: This property is a direct result of applying a well-known property from numerical analysis to this particular linear algebraic system. The property states that the main diagonal element of a row should be greater than or equal to the sum of the absolute values of all other elements in
64
Li and Wang
S3
(2nxn)x(2nxn) \ FIGURE 3. Top: The test pattern image with a 32 x 32 window. Bottom: The sparse matrix A for computing optical flow constructed by equations (4) and (5).
2. Neural Network for Optical Flow Computation
65
that row in order to ensure the convergence of iterations (in this situation, it can be very difficult to find the eigenvalues of the given linear algebraic system due to the size of the large sparse matrix) [2]. For the given problem, equation (7) gives
\-A-aEl\>
Y.
K|.
(8)
Following the definition of / ( a ) , it immediately is seen that / ( a ) > 0 satisfies the above condition. In practice, a small value of a should be chosen, usually in the range of [10~^, 10"^].
5
A Recurrent Neural Network Model
Based on the mathematical formulation, we introduce a recurrent neural network model for optical flow computation.
5.1 The Neural Network Architecture The proposed recurrent neural network for optical flow computation consists of 4X^ massively connected neurons. The state equation of the network can be described by the following vector-form diff'erential equation. Cz{t) = -Wz{t)
-f (9,
(9)
where C is a scalar capacitive parameter, and z G B?^ is the activation state vector, W — A?' \s the connection weight matrix, and 6 = Ah\s the biasing threshold vector of the proposed neural network. One of the desirable features of this conflguration is that the neural network can be implemented in hardware to perform parallel computation. For example, each of the 2K'^ neurons can be implemented by three operational amplifiers: a summer, an integrator, and an inverter. The connection weight Wij between neurons i and j can be implemented by a feedback resistor Rf and a connection resistor Rij such that wij = Rf/Rij; i.e., Rij = Rf/wij = Rf/ Yl2=i ^ki^kj, where aij is the element in the ith row and the j t h column of A. The threshold 6i of neuron i can be implemented by a voltage source with the bia^sing voltage 9i [22]. The architecture of the proposed recurrent neural network is shown in Figure 4, where the network has n = 2K^. This simple implementation scheme allows the design of a high-density network that can be implemented in VLSI to obtain a possible real-time solution.
66
Li and Wang
FIGURE 4. The configuration of the proposed recurrent neural network.
2. Neural Network for Optical Flow Computation
5.2
67
Stability and Convergence Rate
Proposition: The recurrent neural network for solving a system of linear algebraic equations is asymptotically stable in the large; i.e., V2:(0), 3z such that limt->oo z{t) = z. There is more than one way to prove the above proposition. For example, by a traditional approach we may define an energy function, L{z)={Az — bY{Az — b)/2, and prove that L{z) is a strict Liapunov function. Or, we may simply examine the eigenvalue characteristic derived from the proposed recurrent neural network to show the asymptotical stability as given below. Proof: Since A is symmetric, A^ = A^A and hence W = A^ is symmetric. Therefore, the eigenvalues of A^ are always real. Furthermore, since the eigenvalues of A'^ are always nonnegative, i.e., the eigenvalues of —A'^ are always nonpositive, the linear neural system is always asymptotically stable in the large. The proposed recurrent network for computing image flow is essentially a linear dynamic system. According to linear systems theory [8], the convergent trajectory of the activation state z{t) can be described as
Zi(t) = 5^Ci,(t)e-^^^ + Zi
(10)
i=i
for i,j = 0,1, ...,2ii^^, where A^ is an eigenvalue of W, and Cij{t) are constants or polynomials in t] depending on the initial condition and the uniqueness of the eigenvalues of VF. It should be pointed out that there is another interesting result related to this work. That is, recently we have also shown that the steady state of the proposed recurrent neural network represents a solution to the set of simultaneous linear equations (i.e., AX = b with X = z hy equation (9)) if and only if A is of full rank (i.e., rank{A) = dim{A) = 2K'^) [21]. The analysis of the optical flow formulation given in the previous section reveals that this condition (the condition of full rank) can be satisfied with a suitable regularization coeflBicient a. According to linear systems theory [8], the convergence rate of the proposed recurrent neural network is dominated by the term in z{t) with the largest time constant that corresponds to the smallest eigenvalue of C~^A^^ min {C~^\i\ i = 1,..., 2K^'}. From the engineering point of view, the linear neural system can reach its steady state in 5/min{C~^Ai;z = l,...,2-ftr^} seconds. Furthermore, since the positive capacitive parameter C is directly proportional to the stabilization time required by the linear neural system, the convergence rate of the solution process can be controlled by selecting
68
Li and Wang
a sufficiently small capacitance parameter C. The convergence rate of the proposed neural network also depends on the regularization parameter a. Specifically, the smaller the a, the slower the convergence rate, which will be demonstrated in the next section. Therefore, there is a trade-off between the need for regularization (smaller a) and the need for faster convergence (relatively bigger a ) .
6
Experiments
In order to demonstrate the characteristics of the proposed recurrent neural network, the experiments have been conducted in two phases: the experiments using artificially generated test patterns and the experiments using real laboratory images. A pair of test patterns are given in Figure 5. These two patterns were used as two consecutive image frames captured at time slices t — dt and t. The second pattern was diagonally shifted by 1 pixel to simulate a motion, and its intensity was slightly altered to simulate random disturbance. Ex, Ey, and Et were computed first by using 3 x 3 kernels. With a = 0.01, the optical fiow was computed by using the proposed recurrent neural network. At equilibrium state, z gives the vector components u and v, for z = {zi,Z2, '"j^sY = ^1? ...,1/4,^1, ...^v^y. The computations used to compute the optical fiow field are illustrated in Figures 6-8, and the optical fiow field determined by the vectors is shown in Figure 9, which illustrates that the fiow pattern matches the diagonal motion. In the figure, we define the distance between two diagonally connected pixels to be \/2, and the computational result matches this definition {y/u^ •+• f^), where u and v are given from the column vector z. The laboratory images were then used. The image shown in Figure 10 has 256 x 240 resolution with 1 byte per pixel. The object of interest was displaced to a new position to create a motion after the digitization of the first image. A 32 x 32 window was chosen. Ex, Ey, and Et were computed within the window before the computation of z{t). Following the criterion in Section 4, the crucial regularization parameter a was made equal to 0.01. The experimental result is given in Figure 11, which agrees with the motion.
7
Comparison to Other Work
Our work described in this paper includes the reformulation of Horn's model for possible neural network implementation. Horn's original model is based on the optimization of an objective function in a global scale, as
2. Neural N e t w o r k for Optical Flow C o m p u t a t i o n
A
69
t ^'
:x.y,t+dt)
I(x,y,t)
10
8
7
7
7
10
5
7
fc
• ^
X
X
FIGURE 5. Illustrated here axe two frames of the small (2-by-2) artificially generated images. Note that the second frame of the image is shifted diagonally at time t + dt. Then the partial derivatives Exi, Ey^, and Et^ are computed (repeating the boundary elements).
/-4 2 2 0 0 0 0 V 0
2 -4 0 2 0 0 0 0
2 0 -4.09 2 0 0 0 0
0 2 2 -4.04 0 0 0 0.04
0 0 0 0 -4 2 2 0
0 0 0 0 2 -4.09 0 2
0 0 0 0 2 0 -4 2
0 0 0 0.04 0 2 2 -4.04/
/2l\ 22
2l 22 \ Z244 /
= a
0 \ 0 -0.09 0.04 0 -0.09 0 0.04 /
FIGURE 6. The linear algebraic system constructed for solving the optical flow.
70
Li and W a n g
100000
200000
JOOOOO
400000
SOOOOO
100000
Iteration
200000
300000
400000
SOOOOO
Iteration
o.aoo -3 Z4
0
tOOOOO
200000
300000
400000
SOOOOO
100000
1.250 -
200000
300000
400000
SOOOOO
Iteration
Iteration
1 ^*
1.000 0.750 0.500 • 0.250 -
C
tooooo
200000
300000
400000
200000
SOOOOO
Iteration
100000
200000
300000
Iteration
300000
400000
500000
400000
500000
Iteration
400000
500000
00000
200000
300000
Iteration
FIGURE 7. The computation is performed by using a recurrent neural network. Note that the energy function of the proposed neural network is a strict Liapunov function. It decreases monotonically as the number of iterations increases. The plots of vector Z are shown. Note that it takes a large number of iterations to reach the final result. Since the network can be implemented in hardware, the number of iterations is not really the concern. As pointed out in the study, the speed of convergence can be controlled by choosing different fi and regularization parameter a. But trade-offs have to be made to ensure the "smoothness" constraint.
2. Neural Network for Optical Flow Computation
0.005 n
71
Energy
0.004 H 0.003 H 0.002 0.001
0 . 0 0 0
II 1IIIIIII | I M ' I IIIII I IIIIM rI IIIIIIIIII IIIII II IIII I
0
100000
200000
300000
400000
500000
Iteration FIGURE 8. Energy as a function of iteration.
defined by equation (1). As a result of the global optimization, the model is less sensitive to local variation and random noise. The algorithm based on the reformulation is implemented as a recurrent neural network. The network can operate concurrently, in asynchronous fashion, for potential real-time application. The behavior of the network, such as convergence, convergence speed, and stability is analyzed. An analog VLSI implementation of the network is possible because of the nice regularity of the network structure. Our work is based on a mathematical formulation with a smoothness constraint This constraint is widely adopted in many currently pursued models. The constraint can be further divided into the condition of smooth motion at any given short sampling time interval, and the requirement of smooth change of illumination. Obviously, in real life, the requirement of smooth change of illumination may or may not be satisfied. Therefore, there is a need to develop an illumination-invariant model. The existing Fourier analysis technique is computationally intensive and it does not provide accurate results. It has been reported recently that Tsao and Chen [19] have proposed a computational model for optical flow computation based on Gabor phase functions. They demonstrated that the proposed method works for the synthetic test pattern images.
72
Li and Wang
Image Flow with a = 0 . 0 1
FIGURE 9. The computational result.
8
Summary and Discussion
In this paper, we have reformulated the optical flow computation in such a way that the optical flow can be mapped to activation states of a recurrent neural network. The advantage of this proposed approach is that the computation of optical flow is distributed to each simple and regular processing element of the network. The solution of optical flow is provided as the stabilized activation state of the network. The network operates concurrently, and it can be implemented in analog VLSI. Analog VLSI has some remarkable features, which include (1) fast computational speed, (2) lower power consumption, (3) smaller size in silicon implementation, and (4) simpler circuit configurations for realizing the same functionality. But in general, the design of analog VLSI circuit takes a longer time, and the computational accuracy is not as good as the digital counterpart. The current state-of-the-art analog VLSI technology can deliver about 6-bit res-
2. Neural N e t w o r k for Optical Flow C o m p u t a t i o n
73
FIGURE 10. The laboratory scene.
olution (for example, Intel's neural chip, 80170NX, based on CHMOS III EEPROM technology, has 10,240 modifiable analog weights in 4-quadrant analog multiplier synapses with over 6-bit precision). Compared to the resistive network for optical flow computation in Hutchison et al. [7], the proposed network here is based on Horn's functional analysis approach. The network can be implemented in a standard recurrent neural network, which is a desirable feature. It should also be pointed out that the massive connections of each neuron to every other neuron in this design may limit the size of images. It can be derived that the number of connections needed for each neuron is on the order of K^. In this study, we have also proven properties necessary for constructing a recurrent neural network and conducted experiments on both the test patterns and the laboratory images that confirm our theoretical analysis. Our future work includes the further investigation of the neural network architecture to reduce the size of the network. We are also working on the analog VLSI implementation of the algorithm.
74
Li and Wang
o4 FIGURE 11. The optical flow of the laboratory scene. Note that the computation is performed within a 32-by-32 window with a = 0.001.
9
References 1. M. Bertero and T. Poggio, "Ill-Posed Problems in Early Vision," Proc. of IEEE, Vol. 76, No. 8, pp. 869-889, 1988. 2. R. Burden, J.D. Faires, and A.C. Reynolds, Numerical Analysis, Weber and Schmit Press, Boston, 1981. 3. P.S. Churchland and T.J. Sejnowski, The Computational Brain, MIT Press, Cambridge, MA, 1992. 4. J. Gibson, The Ecological Approach to Visual Perception, Houghton Mifflin Company, Boston, 1979.
2. Neural Network for Optical Flow C o m p u t a t i o n
75
5. B.K.P. Horn, Robot Vision, MIT Press, Cambridge, MA, 1986. 6. B.K.P. Horn and E.G. Schunk, "Determining Optical Flow," Artificial Intelligence, Vol. 17, pp. 185-203, 1981. 7. J. Hutchinson, C. Koch, J. Luo, and C. Mead, "Computing Motion Using Analog and Binary Resistive Networks," Computer, Vol. 21, No. 3, pp. 52-63, March, 1988. 8. T. Kailath, Linear Systems, Prentice Hall, Englewood Cliffs, NJ, 1980. 9. J.K. Kearney, W.B. Thompson, and D.L. Boley, "Optical Flow Estimation: An Error Analysis of Gradient-Based Methods with Local Optimization," IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol. 9, No. 2, pp. 229-244, 1987. 10. H. Li and C.H. Chen, "Simulating a Function of Visual Peripheral Processes with an Analog VLSI," IEEE MICRO, Vol. 11, No. 5, pp. 8-15, 1991. 11. A. Lumsdaine, J. Wyatt, and I. Elfadel, "Nonlinear Analog Networks for Image Smoothing and Segmentation," Proc. of IEEE Int. Symp. Circuits and Systems, Vol. 2, pp. 987-991, 1990. 12. C. Mead, Analog VLSI and Neural Systems, Addison-Wesley, Reading, MA, 1989. 13. H. H. Nagel and W. Enkelmann, "An Investigation of Smoothness Constraints for the Estimation of Displacement Vector Fields from Image Sequences," IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol. 8, No. 5, pp. 565-593, 1986. 14. T. Poggio and C. Koch, "Ill-Posed Problems in Early Vision: From Computational Theory to Analogue Networks," Proc. of Royal Society of London Series B, Vol. 226, pp. 303-323, 1985. 15. R. Sekuler, "Visual Motion Perception," in Handbook of Perception, Vol. V, Seeing, edited by E.C. Carterette and M.P. Friedman, Academic Press, New York, 1975. 16. M.A. Snyder, "On the Mathematical Foundations of Smoothness Constraints for the Determination of Optical Flow and for Surface Reconstruction," IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol. 13, No. 11, pp. 1105-1114, 1991. 17. J.G. Taylor, "A Sihcon Model of Vertebrate Retinal Processing," Neural Network, Vol. 3, pp. 171-178, 1990.
76
Li and Wang
18. W.B. Thompson, S. Barnnard, "Low-Level Estimation and Interpretation of Visual Motion," Computer, IEEE Computer Society, pp. 20-28, August, 1981. 19. T.R. Tsao and V.C. Chen, "A Neural Computational Scheme for Extracting Optical Flow from the Gabor Phase Differences of Successive Images," Proc. of IJCNN 1992, IV-450, Baltimore, MD, 1992. 20. D.B. Tweed and T. Vilis, "The Superior ColUculus and Spatiotemporal Translation in the Scaccadic System," Neural Networks, Vol. 3, pp. 75-86, 1990. 21. J. Wang and H. Li, "Solving Simultaneous Linear Equations Based on a Recurrent Neural Network," International J. of Information Science, Vol. 76, No. 3/4, pp. 255-278, Elsevier Publishing Co., New York, 1993. 22. J. Wang, "Electronic Realization of a Recurrent Neural Networks for Solving Simultaneous Linear Equations," Electronics Letters, Vol. 28, No. 5, pp. 493-495, 1992. 23. K.Y. Wohn, J. Wu, and R.W. Brockett, "A Contour-Based Recovery of Image Flow: Iterative Transformation Method," IEEE Trans, on PAMI, Vol. 13, No. 8, pp. 746-760, 1991.
Chapter 3 Temporal P a t t e r n Matching Using an Artificial Neural Network Fatih A. Unal Nazif Tepedelenlioglu ABSTRACT A traditional optimization method used for pattern matching is dynamic time warping, which is a dynamic programming algorithm that compares an input test signal with a reference template signal and obtains an optimum match. The dynamic time warping algorithm reduces the nonlinear time misalignments between the two patterns and consequently accomplishes a better comparison, as opposed to an ordinary direct template matching method that might yield a larger distance between the two patterns despite the similarity. While effective in pattern recognition, the dynamic time warping algorithm is lacking in that the processing time becomes a major consideration for real-time applications as the number and the size of the pattern increase. A parallel computing architecture becomes the only avenue to deal with the heavy computational load. It is shown in what follows that the dynamic time warping pattern matching algorithm can be effectively implemented using the Hopfield network, whereby one defines a dynamic time warping energy function to achieve an optimum match between two patterns. The energy function is mapped to the Hopfield network's Liapunov function to derive the connection weights and the bias inputs.
1
Introduction
Pattern recognition systems consist of four functional units: A feature extractor (to select and measure the representative properties of raw input data in a reduced form), a pattern matcher (to compare an input pattern to reference patterns using a distance measure), a reference templates memory (against which the input pattern is compared), and a decision maker (to make the final decision as to which reference template is the closest to the input pattern) [2]. Among these units, the most crucial component is the pattern matcher, which finds the best match and the associated distance 77
78
Unal and Tepedelenlioglu
between the unknown test input and the reference patterns. Patterns are finite sequences of real numbers, sequence index being usually interpreted as time. The rate of success in the matching process is very much dependent on how close the test pattern is to one of the reference templates. Often, due to the distortion and noise introduced during the handling of the test pattern, this desired similarity may deteriorate, and consequently, the process may suffer in that one begins making errors in matching. Among the possible causes of distortion that result in significant matching errors if not compensated for are the nonlinear shifts introduced to the time scale of the test pattern. Dynamic time warping (DTW) is one such algorithm [5] that is used to eliminate the nonlinear shifts at the time scale of the temporal patterns. It reduces the nonlinear time misalignments between the two patterns by finding an optimal warping matching path and achieves a better comparison than an ordinary direct template matching method, which might yield a large distance. It is widely utilized in pattern recognition areas such as speech recognition, speaker verification, and speaker recognition, and it contributes significantly to the performance of these speech processing systems [8, 10, 11, 13]. While effective in pattern matching, the DTW algorithm is lacking in that the processing time becomes a major consideration for real-time appUcations as the length of the patterns increases. A parallel computing architecture becomes the only avenue to achieve the high computational rate. A possible remedy toward this end would be the use of a Hopfield network, which can be interpreted as one form of parallel computing. It is a fully connected single-layer feedback neural network with symmetric connection weights [4]. The Hopfield network can be regarded as a compromise between finding the best warping path at a considerable computational cost and an acceptable suboptimal solution rapidly. Although the use of the Hopfield network is mentioned so far in relation to the DTW problem, the approach presented here is flexible enough to apply to other optimization problems. The organization of the chapter is as follows: The Hopfield network and a general procedure to solve optimization problems with the Hopfield network are described in Section 2. The implementation of the DTW algorithm using the Hopfield network is explained in Section 3. Section 4 contains the computer simulation results, and finally, the conclusions are drawn in Section 5.
3. Pattern Matching Using an Artificial Neural Network
79
2 Solving Optimization Problems Using the Hopfield Network The embodiment of the Hopfield network is shown in Figure 1. As seen from the figure, the network consists of neurons with self feedback in a single layer structure, and the full connection is achieved through symmetric weights. The behavior of this system is described by the differential equation u=---hWv-hb, (1) r where the inputs of the neurons are denoted collectively by the vector u, outputs by the vector v, the connection weights between the neurons by the matrix W , the bias inputs by the vector b , and r determines the rate of decay of the neurons. Also, the input-output characteristics of the neurons are taken as Vi = g{ui) = hl
+ tB.nh{^)),
Z
UT
(2)
where UT determines the steepness of the sigmoidal activation function g and is called the temperature [4]. The corresponding graph is shown in Figure 2. Hopfield showed that this network, with a symmetric W , forces the outputs of the neurons to follow a path through the state space on which the quadratic Liapunov function L(v) = - i y ^ W v - b^v + - V
/ * 9-Ha)da
(3)
monotonically decreases with respect to time as the network evolves in accordance with equation (1), and the network converges to a steady state that is determined by the choice of the weight matrix W and the bias vector b. That is, ^ - ^ < 0 [3]. The Liapunov function L(v) can be interpreted as the energy of the network. Note that duj ^ dt
dL{v) dvi
^^
can be derived from equations (1) and (3). Thus, the Hopfield network corresponds to a gradient system that seeks a minimum of the Liapunov function L{v). The network converges to a stable state when a minimum is reached. So, ^^jp- = 0 implies ^ = 0, and this is achieved when the network reaches a stable state.
80
Unal and Tepedelenlioglu
a a
n
a
{}
u
n
FIGURE 1. The Hopfield network.
This characteristic of the network is exploited to solve optimization problems. Usually, a quadratic energy function ^ ( v ) composed of a cost function, and possibly some constraints, is defined for the optimization problem at hand and equated to the Liapunov function L(v) to determine the connection weights W and the bias inputs b . It should be noted that the performance of the network (where it converges) critically depends on the choice of the cost function and the constraints and their relative magnitude, since they determine W and b , which in turn determine where the network settles down.
3. Pattern Matching Using an Artificial Neural Network
81
g(u,)
FIGURE 2. Sigmoidal activation function. Table 1 shows the procedure that is used to set up a Hopfield network to solve an optimization problem. Each step in the procedure is briefly addressed in the next section when the implementation of DTW is described. The decay (or damping) term — " i n equation (1) corresponds to the integration term of equation (3). One has to include an energy component in the energy function that will balance this integration term if the Liapunov function given by equation (3) is used. Otherwise, the convergence of the system can be disturbed [7, 15], and thus the performance of the Hopfield network may be lowered. In this study, the decay term (or equivalently the integration term) is ignored, as in most of the studies reported so far, and the following differential equation and the corresponding Liapunov function are used for the Hopfield network: u = Wv + b
(5)
L(v) = - - v * W v - b * v .
(6)
and
3
Dynamic Time Warping Using Hopfield Network
This section introduces the concept of the DTW and the use of the Hopfield network to implement it.
82
Unal and Tepedelenlioglu
Step 1. Find a neural network representation for the problem Step 2. Determine a number representation with the neurons Step 3. Define a Liapunov function L{\) for the Hopfield network Step 4. Devise an energy function ^ ( v ) for the optimization problem Step 5. Derive the connection weights W and the bias inputs b by equating L of Step 3 and E of Step 4 Step 6. Compute the energy function coeflScients c TABLE 1. A general procedure to solve an optimization problem with a Hopfield network.
3.1
Dynamic Time Warping
As mentioned in the introduction, DTW is a sophisticated pattern matching algorithm that is used to compare an input test pattern with a reference pattern template and obtain an optimum match subject to certain constraints [5]. An associated distance is also determined during the process. The DTW algorithm eliminates the nonlinear time misalignments between the two patterns and consequently achieves a better comparison as opposed to an ordinary direct template matching procedure, which might yield a larger distance between the two patterns despite the similarity [12]. The DTW algorithm effectively eliminates the nonlinear x-axis variations to compensate for the nonlinear temporal distortions. Note that in speech processing applications, such distortions may arise due to the variations in the speaking rates of the speakers. The algorithm can be formulated as a minimum-cost path problem as illustrated in Figure 3. Thus, the problem is transformed to one of finding an optimal alignment path m = w{n) between a reference signal r{n) and a test signal t{m) over a 2-D finite Cartesian grid of size N x N, where A^ is the length of the signals, and n and m are the discrete time scale indices for the reference and the test signals respectively. Each node i;(n, m) has a specified cost d(n, m) that corresponds to the distance between the reference signal sample r{n) and the test signal sample t(m). The problem is to obtain the minimum cost path from v(0,0) to v{N — 1, iV — 1). In order to implement an eflFective and efficient DTW algorithm, it is necessary to specify a number of factors and constraints on the solution [9], which could vary depending on the application field. In what follows, to fix ideas we will assume that the application field is speech recognition, in which case the constraints become:
3. Pattern Matching Using an Artificial Neural Network
83
(a) Endpoint constraints: ^(0) = 0,
w{N-l)
(7)
=N-1.
(b) Local path constraints: The following are Itakura path constraints, which are illustrated in Figure 4 [5]: 0 < w{n) - w{n - 1) < 2, w{n — 1) — w{n - 2) > 0
(8)
if w{n) — w{n — 1) = 0
These constraints guarantee that the average slope of the path lies between 1/2 and 2, provide path monotonicity, and prevent excessive compression and expansion of the time scales, as shown in Figure 3. (c) Global path constraints: rriLiri) <m<
mH{n),
(9)
where rriHin) = min{2n, \^n + ^{N - l)l,iV - 1}, rriLin) = Tnax{[^n\,2n
(10)
- {N - 1),0},
and [x\ denotes the smallest integer greater or equal to x, and \x] denotes the greatest integer less than or equal to x. Note that these global constraints elicit the parallelogram in Figure 3 that has the sides with slopes 1/2 and 2 emanating from the points n = 0, m = 0, and n = N — 1, m = N — 1 in which the optimal warping path w{n) lies. Strictly speaking, actual slopes of the line segments connecting the grid nodes can be 0, 1, or 2 only. However, for convenience, one can assume an average slope of 1/2 and 2 for the edges of the parallelogram that is shown in Figure 3. (d) Local distance measure: The absolute difference metric is used as the distance measure, which is implemented in the form d{r{Ti),t{m)) = \r{n) - t{w{n))\.
(11)
Consequently, the total distance along the optimal path w(n) from the grid point (0,0) to the grid point (A^ — 1, iV — 1) can be written as N-l
D = min^^n){Yl
d{r{n),t{w{n)))}.
(12)
n=0
With all these constraints in mind, we can reiterate the definition of the DTW problem as finding an optimal warping path m = w{n) through the
84
U n a l a n d Tepedelenlioglu
m = w(n)
t(iii)
0
1
5 1 4
•
2 /
2 / /
4
0^
4
2
r2
/
3
/ 3
4
2/
2 /
3^
1
*
e
0
/
2
1/
3
^'
/
• / V ^
3
/
5
0 / 1
n
1
*
0
1
\— 1
1
•
3
2
2
*
4 1
1
3
1
5
3
1 1
!
r(n)
Legend: Represents the reference pattern r(n) Represents the test pattern t(m) d(r(n) , t(in) Denotes a grid node v(n,in) and the associated distance d(r (n) , t (m) ) between the n^ component of the pattern r and the m ^ component of the pattern t, where d{r{n) , t{m))=\r{n)-t{m) \ Marks optimal warping path m = w(n) Marks the parallelogram resultina from the constraint equations FIGURE 3. A D T W example depicting an optimal alignment path w{n) to match r{n) to t{m). Reprinted with permission from [17]. © IEEE 1992.
3. Pattern Matching Using an Artificial Neural Network
(n-2, m )
(n-i^m
Q
)
X
85
( ^ ; m )
O
(n-Lm-i)
(n-i,m-2)
FIGURE 4. Itakura path constraints for the DTW. Reprinted with permission from [17]. © IEEE 1992. grid points v{n,m) in Figure 3 to match the reference pattern r{n) with the test pattern t{m) subject to the constraints given by equations (7), (8), (9), and (10) such that the total distance D given by equation (12) is minimized. Thus, for the particular example illustrated in Figure 3, the optimal warping path m = w{n) (indicated by the solid Une) goes through the grid nodes v{0,0), v{l, 1), v{2,1), i;(3,3), i;(4,4), and v{5,5) and corresponds to the best match between the two patterns with the associated total distance 10. Note that none of the other valid paths (that satisfy the constraints) within the parallelogram have smaller total distance. The results of the DTW for the example shown in Figure 3 are summarized in Table 2.
3.2
Hopfield Network
Implementation
Our purpose in this section is to demonstrate that the Hopfield network can be used to realize the DTW algorithm [17]. To achieve this objective, we follow the procedure given in Table 1: First, a neural network representation for the DTW algorithm is found. Once the representation is chosen, a DTW energy function E{v) is devised to obtain an optimum match between the unknown input test and the reference patterns. Note that there is no need to determine a number representation with the neurons, since the steady state outputs of the neurons suffice to define the final warping path, and hence step 2 in Table 1 is omitted. The DTW algorithm then is mapped onto the Hopfield network by equating the energy function ^ ( v ) to the Liapunov function I/(v), and the connection weights W and the bias
86
Unal and Tepedelenlioglu
tom)
1 " m r{n) 0 1 2 3 4 5
0 1 1 3 4 5
4 5 6 3 7 1
2 4 4 3 5 4
d(r(n),f(m)) = |r(n) - t(m)| 2 1 2 0 2 3
10 Total
Distance
1 D
1
TABLE 2. DTW results for the example in Figure 3.
inputs b are found by matching the linear and the quadratic terms in both functions. The energy function coefficients have to be determined in such a way as to obtain a balance among the energy function components to achieve a high quality result while maintaining the validity of the solution. For this purpose a method is developed that computes the energy function coefficients systematically. The reader is referred to [16] for details of this method. Every grid point on the {n,m) plane in Figure 3 can be represented by a neuron. Therefore, a two-dimensional array representation is used for the network. The neuron outputs will be denoted by subscripts x (for ordinate m) and i (for abscissa n) showing the row and the column indices respectively. Hence, by scrutinizing the warping path through the grid nodes and considering the objective function (12) and the constraints (7) through (10), which are described in Section 3.1, the following energy function E{v) can be constructed for the DTW algorithm: N-lN-lN-l
^(v) = ^YlYl
X][K,i+dj,,2+i)t^x,it^2,,i+i
a;=0 1=0 y=0
N-lN-1 x=0 i=0
N-1 y^x,y^x-\-l,y^x-\-2
(13)
3. Pattern Matching Using an Artificial Neural Network
87
N-lN-lN-l
+ f Z! IZ Z! ^^'^^2/'^ x=0 1=0 y^x N-lN-1 x=0 i=0 N-lN-l
N-1
x=0 i=0
j^i,\i-j\^l
N-1
N-1
1=0
i=0
where modulo A^ arithmetic is used for the subscripts wherever apphcable, i.e., N = 0. The Co term stands for the objective function that minimizes the total distance between the two signals associated with the optimal warping path through the grid points that satisfies equation (12). This component will be pulled down to a minimum, since the energy function E{y) is anchored to the Liapunov function I/(v), which decreases monotonically during the operation of the neural network as described in Section 2. The Itakura path slope constraint that is given by equation (8) is satisfied when the ci component equals zero. Namely, the slopes of the line segments between the grid nodes that form the final warping path w{n) will be pushed to 0, 1, or 2 when this energy component is minimized. Every signal sample in the reference pattern should be visited once while matching with the test signal when the DTW is applied. Hence, a constraint that will force only a single neuron having output 1 at each column necessitates the next energy component, which is the C2 term. Again, when this part of the energy function is minimized, there will be at most one active neuron (output 1) in each column. Note that this is a necessary but not sufficient condition since this constraint is satisfied even if all the neurons have output 0. Therefore, the sufficient condition is also to be added to the energy function, which is the C3 component. By minimizing the C3 energy member, the neural network will end up having N active neurons at the time the network converges to a minimum state. The C4 component enforces the Itakura constraint given by equation (8). As the network settles down to a minimum energy configuration, this energy ingredient approaches zero, and successive zero-slope line segments are avoided in each row.
88
Unal and Tepedelenlioglu
Finally, minimizing the C5 component of the energy function forces the neurons to have 0 or 1 output when the network reaches a stable state, as in [14]. Now, it is straightforward to show that the connection weights and the bias inputs can be obtained by equating equation (13) to equation (6) as 'Wxi,yi = -Co{dx,i + dyj){Si^ij
-f Si-ij)
(14)
-CiSi^lj{l - Sx,y){l - (5x+l,2/)(l - <^x+2,j/) -C2Sij{l - 6x,y) -C3
-\-4CsSijSx,y
and bxi = C3N-2cs,
(15)
where Sx,y is the Kroneker delta function, which is equal to 1 when x = y, and 0 otherwise. Modulo N arithmetic is used for the subscripts wherever applicable. The derivation of (14) and (15) is omitted to save space. The reader is referred to [16] for details. Even though it is possible to extend the concept of DTW to multidimensional signals, in this study we confine ourselves to one-dimensional signals only. The formulation is completed to establish the necessary theoretical framework, and work is underway to conclude the experimental studies for dynamic spatial warping (DSW), which is the extension of DTW to two-dimensional signals. We plan to apply this novel method to the image recognition fields including fingerprint, signature, and handwritten character recognition.
4
Computer Simulation Results
The dynamical behavior of the Hopfield network model is represented by equation (5). The equation is solved numerically using Euler's method by replacing the derivative in equation (5) by the quotient of the forward differences as follows: N-1
u ^ . ^ ^ = u?'''>
'Wxi^yjVyj -\-bxi),
(16)
yj
where 0 < x, i < N — 1. This equation will be used throughout this work to simulate the operation of the DTW Hopfield network. We also replace the sigmoidal activation function by the following piecewise linear function:
3. P a t t e r n Matching U s i n g an Artificial Neural N e t w o r k
0 g{ui) = { Ui-\- 0.5 1
if Ui< - 0 . 5 if - 0.5
0.5
0.5
89
(17)
Although the theory is developed assuming a sigmoidal function, it turns out that the above piecewise linear function works just as well, and it is more efficient to evaluate. Therefore, in the interest of speed it was used in the experiments throughout. Prom our experiments, we observe that it takes fewer iterations to converge to a solution with this activation function, and the quality of the results is not adversely affected. From the computer simulations, we find that the initial inputs to the neurons affect the quality of the results significantly. If the neurons are initialized in a way consistent with the C3 term of the DTW energy equation (13), better results are achieved. To avoid the symmetric stuck conditions [4], noise is added to the inputs; hence the inputs are uniformly distributed random variables in the range UQ — O.lSu < Uxi
(18)
where 5u is the noise, which is uniformly distributed in [0,1] and uo = -^ — 0.5. The above initial values are used for the neurons that reside inside the parallelogram defined by the path constraints that are addressed in Section 3.1. The outside neurons are clamped to zero because of the same constraints. Furthermore, the neurons at the origin and at n = (A^ — 1), m = (AT — 1) are clamped to 1 because of the endpoint constraints. During the operation of the network these neurons always have these fixed states and force other neurons to acquire better final states. Another important factor is the step size At, which is used in the iterative solution of equation (16). Step size 0.02 is used in our experiments, and this size At seems to be sufficient with N = 10, N being the number of signal samples that are matched. The number of iterations to reach a solution is increased with smaller step sizes, and the quality of the solutions does not improve.
4-1
Performance Measurement with Random Signals
To evaluate the performance of the network, uniformly distributed random reference and test signals are generated. From these signals a distance matrix d is produced using equation (11). The distances are normalized to the unit square. Using d, the optimal warping path corresponding to the global minimum total distance and the path with the global maximum distance are determined by going through all of the possible paths within the parallelogram, as shown in Figure 3. Then the DTW Hopfield network
90
Unal and Tepedelenlioglu
is employed to find the optimal path. A distance measure is defined to compare the results as follows: miriNN -miriG .^^ ..^. : X 100 (19) maxG ~ TniriG where miriG and maxG are the global minimum and maximum distances corresponding to the best and worst warping paths that are found by means of the traditional DTW using exhaustive search. The miriNN is the minimum distance corresponding to the optimal path found by the neural network. The dGM is the percentage of the distance to the global minimum and represents the independent variable on the horizontal axis in Figure 5 and Figure 7. The ^/-axis denotes the number of occurrences out of 500 runs. Two tests are run to measure the performance of the DTW Hopfield network with the constraint coeflScients ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0. In the first test, the objective function coefficient is taken as CQ = 2.0. Then the same test is repeated with a more dominant objective function coefficient, CQ = 4.0, to demonstrate its impact on the solution validity and quality. Figure 5 shows the test results for the first set of coefficients. It is seen that the network converged to a valid solution 96% of the time. Figure 6 displays the corresponding iteration histogram. With the second set of coefficients the results summarized in Figures 7 and 8 are obtained. Using this set, the DTW Hopfield network reaches a valid solution 72% of the time, but the quality of the paths found is superior to that in the prior case. The reason for this is that while the constraint coefficients enforce the validity of the warping path, the objective function coefficient Co competes with them to minimize the total distance associated with the path. Thus, the quahty of the DTW path can be improved by increasing the value of CQ^ but this results in more frequent invalid paths. For both cases, the network converges to a valid solution in fewer than 50 iterations, and the results achieved show that the network is capable of matching the reference and test signals effectively. dcM =
4.2
Comparisons with Direct Template Matching
The purpose of this experiment is to demonstrate the superiority of the pattern matching performed by the DTW Hopfield network over ordinary direct template matching. First, direct template matching is applied to the reference signal r and the test signals t i , t2, which are shown in Figures 910. Then, the experiment is carried out with the energy function coefficients Co = 4.0, ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0 using the DTW Hopfield network. The results are shown in Tables 3 through 7. Table 3 displays the samples of the signals, the total distances, and the local distances between the samples of the reference and the test signals. The
3. P a t t e r n Matching U s i n g an Artificial Neural N e t w o r k
91
o
Distance Percentage to Global Minimum FIGURE 5. Performance measurement results with coefficients co = ci = 13.8, C2 = 13.8, C3 = 4.5, CA = 6.3, C5 = 1.0.
2.0,
92
Unal and Tepedelenlioglu
70
t
\i _ __ ._ \
80 h 50 h
^
; _
i
[
40 h
L Jjl
30 h CJ CJ
o
20 h
M K i i
10 h r
.
,
/
,
i
20
,
,
, ^
i
40
/tw/\,
,
,
i
,
,
60
.
i
i
.
.
1
, J
100
Number of Iterations FIGURE 6. Iteration histogram with coefficients co = 2.0, ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0.
3. P a t t e r n Matching U s i n g an Artificial Neural N e t w o r k
93
60
f
i
1
:
!
!
^
\
\
\
i
i
50 h
f 40
Pi
h
I
O
30 H-
o
20 Vr
J
fl J
;
10 h
•
-1
1-
1 1
1
20
1
J.. _ i _ V - l \
1 A - L . -1
1
40
1
!
60
1
1
L
J
80
100
Distance Percentage to Global Minimum FIGURE 7. Performance measurement results with coefficients co ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, cs = 1.0.
4.0,
94
Unal and Tepedelenlioglu
50 45 40 35 03
30
Pi 25 {-I
o
20 L
15 10 h 5 0 0
20
40
50
80
100
Number of Iterations FIGURE 8. Iteration histogram with coefficients co = 4.0, ci = 13.8, C2 = 13.8, C3 = 4.5, C4 = 6.3, C5 = 1.0.
3. Pattern Matching Using an Artificial Neural Network n r{n) 0 20.0 1 15.0 2 5.0 3 0.0 4 4.9 5 14.9 6 20.0 7 15.0 5.0 8 9 0.0
hin)
t2{n) 14.0 15.0 3.0 15.0 1.0 12.5 4.0 10.0 11.0 10.0 13.0 10.0 14.0 10.0 19.0 7.5 16.0 5.0 5.0 7.0
95
d{r{n),ti{n)) d{r{n),t2{n)) 6.0 5.0 12.0 0.0 4.0 7.5 4.0 10.0 6.1 5.1 1.9 4.9 6.0 10.0 4.0 7.5 11.0 0.0 7.0 5.0 1 62.0 55.0 Total Distances
TABLE 3. Results of the direct template matching. absolute difference distance metric is used to calculate the local distances and the distance matrices. The resultant distance matrices are given in Tables 4 and 6. According to the direct template matching results in Table 3, the test signal t2 is more similar to the reference signal r than the test signal t i , since its total distance to r is smaller. Figures 11 and 12 illustrate the effect of DTW on the test signals, and the corresponding warping functions are displayed in Tables 5 and 7. As the results show clearly, the DTW Hopfield network can compare signals intelligently and achieve better results than the ordinary direct template matching.
5
Conclusions
The main objective of this study is to show that the Hopfield network can be used to implement the DTW algorithm, which compares two signals to accomplish the best match under certain constraints. The results obtained in Section 4 verify that the method proposed is a good candidate for this purpose, and the DTW Hopfield network has the capability to make fast and intelligent comparisons in the pattern matching phase of a pattern recognition process. The procedure proposed in Section 2 provides a methodical approach to solve optimization problems using the Hopfield network. Most of the steps in this procedure are straightforward, except the neural network representation and the definition of the energy function. There can be more than one valid neural network representation and energy function for a given
96
Unal and Tepedelenlioglu
20
^7 15
h T
••/
A
V
t
r
t1
1 5 h
1 1 1 1
1 _A 1
J_L_
, 11 •v .
• • • 1.
• ,
1 i
1 •
i f
, 1 •,
1 ,
• 1
1
1
n FIGURE 9. The reference signal r and the test signal t i . Signal r is marked by A, and t i is marked by V.
3. P a t t e r n Matching U s i n g an Artificial Neural N e t w o r k
97
26 4r
15 ^
10 h
5 h
0
I
FIGURE 10. The reference signal r and the test signal t2. Signal r is marked by A, and t2 is marked by D.
98
Unal and Tepedelenlioglu
9 1.41 8 1.41 7 1.41 6 1.41 5 1.41 4 1.41 3 1.41 2 1.41 1 1.41 0 0.37 r 0
1.41 1.41 1.41 1.41 1.41 1.41 1.41 0.99 0.84 0.02 1
1.41 1.41 1.41 1.41 1.41 0.47 0.06 0.28 0.13 1.41 2
1.41 1.41 1.41 1.04 0.97 0.82 0.30 0.07 0.22 1.41 3
1.41 1.41 1.06 0.69 0.61 0.47 0.06 0.28 1.41 1.41 4
1.41 1.41 0.35 0.02 0.09 0.24 0.76 0.99 1.41 1.41 5
1.41 0.22 0.00 0.37 0.45 0.60 1.12 1.41 1.41 1.41 6
1.41 0.13 0.35 0.02 0.09 1.41 1.41 1.41 1.41 1.41 7
0.17 0.84 1.06 1.41 1.41 1.41 1.41 1.41 1.41 1.41 8
0.52 1.41 1.41 1.41 1.41 1.41
1.41 1 1.41 1.41 1.41 9
TABLE 4. Local distance matrix for r and t i . The distances are normalized to the unit square.
X
9 8 7 6 5 4 3 2 1 0 i
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 2 3 4
0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 5
0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6 7
1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8 9
TABLE 5. The final neuron states Vx,i for r and t i matching. The total distance corresponding to this path is 1.93. The D T W Hopfield network reached this state in 14 iterations.
3. P a t t e r n Matching U s i n g an Artificial Neural N e t w o r k
99
t2
9 1.41 8 1.41 7 1.41 6 1.41 5 1.41 4 1.41 3 1.41 2 1.41 1 1.41 0 0.00 r 0
1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 0.59 1.41 0.59 0.12 0.82 0.35 1.06 0.35 1.41 1 2
1.41 1.41 1.41 0.94 0.94 0.94 0.94 1.18 1.41 1.41 3
1.41 1.41 0.35 0.59 0.59 0.59 0.59 0.82 1.41 1.41 4
1.41 1.41 0.35 0.12 0.12 0.12 0.12 0.12 1.41 1.41 5
1.41 0.94 0.71 0.47 0.47 0.47 0.47 1.41 1.41 1.41 6
1.41 0.59 0.35 0.12 0.12 1.41 1.41 1.41 1.41 1.41 7
0.12 0.12 0.35 1.41 1.41 1.41 1.41 1.41 1.41 1.41 8
0.47 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 1.41 9
TABLE 6. Local distance matrix for r and t2. The distances are normalized to the unit square.
X
9 8 7 6 5 4 3 2 1 0 i
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0
0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1
0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 2
0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 3
0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 4
0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 5
0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 6
0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7
1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8
1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9
TABLE 7. The final neuron states Vx^i for r and t2 matching. The total distance corresponding to this path is 3.77. The D T W Hopfield network reached this state in 20 iterations.
100
Unal and Tepedelenlioglu
20 4r
OJ
FIGURE 11. The reference signal r and the warped test signal t i ; r is marked by A, t i is marked by V.
3. P a t t e r n Matching U s i n g an Artificial Neural Network
101
20
[ \ \ 15 Qj CL.
/
\
LX^
\
t2
•
1. M 1 i , •, 1 i , 1 , A
1
•,
i i 1 1
> . i
, 1 ,
•
\
i
i
, 1 •, •
FIGURE 12. The reference signal r and the warped test signal t2; r is marked by A, t2 is marked by D.
102
Unal and Tepedelenlioglu
problem. In some applications, definition of the energy function may not require any work at all. In others, the neural representation may be quite intricate. The energy function defined in equation (13) is neither unique nor claimed to be the best energy function for the DTW problem. It may be possible to find better energy functions. Combining some of the constraint components and/or incorporating them into the objective function would reduce the number of energy function coefficients. But then it would not be possible to control the eff"ects of these components independently. It should be noted that the components of the energy functions compete and cooperate with each other, while the neural network descends with the Liapunov function, as dictated by the energy function, toward a stable minimum energy state. The energy function coefficients CQ through C5 define the characteristics of this falling motion. There is a delicate balance among these components, which are weighted by the energy function coefficients. It would be interesting to study the effects of changing the energy function coefficients dynamically (as a function of energy) as the neural network evolves toward a solution state. This could aid the DTW Hopfield network to reach a lower minimum with a faster convergence rate. Superiority of the DTW Hopfield network over the traditional DTW algorithm is that the pattern matching time is independent of the pattern size. It should be noted that the time required for pattern matching with the conventional DTW is directly proportional to the number of possible valid paths within the parallelogram. Obviously, the size of the parallelogram, and consequently the number of the paths inside the parallelogram, is a function of the pattern size. On the other hand, for the DTW Hopfield network, the number of iterations does not increase as the size and the complexity of the problem grow. Only the number of neurons and their connections expand. This aspect of the DTW Hopfield network was tested for patterns with sizes up to ten, and the number of iterations to converge to a solution was in the neighborhood of twenty, as given in Section 4.1. Twenty iterations would be completed within an elapsed time of a few characteristic time constants of the network (the decay time of a neuron) which is in the order of microseconds [1,6]. The effect of the objective function (relative to the constraint components) could be reduced by calibrating the energy coefficients if maintaining a valid result has a higher priority than the quality of the solution. In this study, a piecewise linear activation function was utilized instead of the standard tangent hyperbolic function to improve the simulation time. This did not adversely affect the performance of the DTW Hopfield network. The activation function could be digitized, and further advantages could be gained in the implementation of the neural network with digital VLSI (Very Large Scale Integration) technology. The significance of this study is that for the first time, we have shown that the DTW algorithm can be realized using a Hopfield network. As the results
3. Pattern Matching Using an Artificial Neural Network
103
in Section 4 confirm, the DTW Hopfield network is capable of comparing signals intelligently and can be employed in the pattern matching phase of a pattern recognition system.
6
References
[1] S. Bhama and M. H. Hassoun, "Continuous Hopfield computational network: hardware implementation," Int. J. Electronics, vol. 69, no. 5, pp. 603-612, 1990. [2] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. Wiley, New York, 1973. [3] J. J. Hopfield, "Neurons with graded response have collective computational properties like those of two-state neurons," Proc. Natl. Acad. Sci. U.S.A., vol. 81, pp. 3088-3092, May 1984. [4] J. J. Hopfield and D. W. Tank, "Neural computation of decisions in optimization problems," Biol. Cybernet., vol. 52, pp. 1-25, 1985. [5] F. Itakura, "Minimum prediction residual principle applied to speech recognition," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-23, pp. 67-72, February 1975. [6] B. Kamgar-Parsi, J. A. Gualtieri, J. E. Devaney and B. Kamgar-Parsi, "Clustering with neural networks," Biol. Cybernet., vol. 63, pp. 201-208, 1990. [7] K. C. Lee, N. Funabiki and Y. Takefuji, "A parallel improvement algorithm for the bipartite subgraph problem," IEEE Trans. Neural Networks, vol. 3, no. 1, pp. 139-145, January 1992. [8] S. E. Levinson and D. B. Roe, "A perspective on speech recognition," IEEE Commun. Mag., pp. 28-34, January 1990. [9] C. Myers, L. R. Rabiner and A. E. Rosenberg, "Performance tradeoffs in dynamic time warping algorithms for isolated word recognition," IEEE Trans. Acoust., Speech, Signal Process., vol. 28, pp. 623-635, December 1980. [10] J. M. Naik, "Speaker verification: A tutorial," IEEE Commun. Mag.,
104
Unal and Tepedelenlioglu
pp. 42-48, January 1990. [11] D. O'Shaughnessy, "Speaker recognition," IEEE ASSP Mag., pp. 4-17, October 1986. [12] H. Sakoe and S. Chiba, "Dynamic programming algorithm optimization for spoken word recognition," IEEE Trans. Acoust., Speech, Signal Process., vol. 26, pp. 43-49, February 1978. [13] H. F. Silverman and D. P. Morgan, "The application of dynamic programming to connected speech recognition," IEEE ASSP Mag., pp. 625, July 1990. [14] M. Takeda and J. W. Goodman, "Neural networks for computation: number representations and programming complexity," Appl. Opt., vol. 25, pp. 3033-3046, September 1986. [15] Y. Takefuji and K. C. Lee, "Artificial neural networks for fourcoloring map problems and K-colorability problems," IEEE Trans. Circuits Syst., vol. 38, pp. 326-333, March 1991. [16] F. A. Unal, "Pattern matching using an artificial neural network," Doctoral Dissertation, Electrical and Computer Engineering, Florida Institute of Technology, Melbourne, FL, December 1992. [17] F. A. Unal and N. Tepedelenlioglu, "Dynamic time warping using an artificial neural network," IJCNN'92 Proceedings, vol. 4, pp. 715-721, Baltimore MD, June 7-11, 1992.
Chapter 4 Patterns of Dynamic Activity and Timing in Neural Network Processing J u d i t h E. Dayhoff Peter J. Palmadesso Fred Richards Daw-Tung Lin ABSTRACT This chapter addresses topics on the dynamic behavior of neural networks as they oscillate and produce specific timing patterns in their activity. A network of simple processing units is capable of producing prolonged selfsustained oscillations and even chaotic behavior. Modulation of a controlled parameter causes the temporal dynamics to increase in complexity until chaos is reached. An external stimulus—a pattern—can be applied to a chaotic network, resulting in a simpler, limit cycle attractor, which can be recognized in a pattern-to-oscillation map. Since random networks tend to have only one observed dynamic attractor, we have designed a weight perturbation schedule to develop multiple dynamic attractors from different initial states of the network. The result is to create different basins of attraction for different patterns or pattern groups. We can observe a tremendous flexibility not only in evoked attractors (usually oscillations) but in their basins of attraction—the collections of states that lead to the same attractor. Attractor training has been done in networks with time-delay mechanisms, where an a priori chosen dynamic attractor can be trained into the network. A comparison to temporal processing in biological systems is discussed.
1
Introduction
H u m a n s a n d most animals are extraordinarily capable in t h e t e m p o r a l domain. We o p e r a t e with self-sustained activity, generate oscillating p a t terns such as walking and swimming, think and imagine for chosen periods of time, and recognize sensed p a t t e r n s in spite of t e m p o r a l shifts and time-warping. This extensive set of t e m p o r a l capabilities is s u p p o r t e d by a neural system architecture with a built-in capability t o o p e r a t e in t h e 105
106
Dayhoff, Palmadesso, Richards, and Lin
temporal domain. From which anatomical and physiological structures do these temporal properties arise, and what are their operational properties? One can imagine time delays, recurrent loops, and firing threshold dynamics all created and modulated by cell anatomy and complex biochemical relationships, but the questions remain of how the underlying architectural components cause and modulate temporal dynamics and temporal processing and how the temporal processing allows for complex time-dependent behaviors. Dynamic neural net architectures are capable of producing prolonged self-sustained activity, with changes in activations continuing across the network over long periods of time. Variations in activity may go on indefinitely, even in the absence of external stimuli. Recurrent loops in the neural connections contribute to such self-sustained activity. Networks with closed loop connections are possible with only a single layer, or they may have multiple layers, but they must depart from the traditional feedforward structure that does not allow closed-circuit connections. The resulting dynamic networks can be modulated to move among different modes of self-sustained activity, including simple oscillations, limit cycles, transients to dynamic at tractors, and chaos. Previous investigators have shown that chaos tends to be inevitable in large models of neural networks in which interconnection weights are random and asymmetric (wij ^ Wji) [SCS88]. Alternative paths to chaos were identified for a progression of activity patterns that started with fixed-point stable states and bifurcated to chaotic oscillations [DCQS93] [CDQS94]. Even when interconnections were sparse, the same types of behavior resulted—the network did not have to be fully connected. Thus, self-sustained activity can be produced and modulated in neural networks with uncorrelated weights and with a realistic density of interconnections. There is a rich potential for computational paradigms that could take advantage of this self-sustained activity, temporal dynamics, and dynamic attractors in neural networks. This computational potential is completely untapped in static neural networks, where feedforward architectures are used, or in recurrent configurations that are allowed to relax to a single stable state—a nondynamic fixed-point attractor. With a dynamic network paradigm, different initial states would evoke different end-state oscillations, or different external inputs would modulate those oscillations. Thus we have ways of obtaining a pattern-to-oscillation map that could potentially be used for pattern classification, representation, and identification. Important issues remain to be resolved about dynamic neural networks before they can be fully understood and fully developed. Issues include how the weights and interconnection topology determine the dynamic selfsustained activity of the network. The number of attractors in the resulting network, the boundaries of the basins of attraction, and the training of
4. Patterns of Dynamic Activity and Timing
107
attractors into the networks are also important for the development of computational paradigms. Most artificial neural network models applied today are powerful pattern recognition tools but do not have interesting temporal activity. Traditional feedforward networks are static, processing fixed patterns as inputs (vectors) and producing fixed patterns, one at a time, as output (other vectors). These static networks are powerful because they can be taught a number of pattern pairs simultaneously, and they imprint the memory and pattern mapping capability into the weight matrix. But their ability is only to map one fixed pattern to another at one time; temporal patterns are not spontaneously generated, learned, or associated. The complex dynamics generated by dynamic networks have yet to be fully tapped for learning and computational purposes. Pattern recognition, associative memory, signal generation, trajectory production for robotics, and time series predictions are a few of the cogent applications that are natural for dynamic networks. An abundance of temporally varying signals can be found in a multitude of applications domains. Models of biological systems and even consciousness have been proposed to involve dynamic neural networks [Day94]. The ability of an advanced organism to remain conscious in the presence or absence of external stimuli points to some kind of self-sustained activity occurring within the brain. Conscious experience involves a sequence of events that the organism experiences; somehow the sustained neural activity enables this sequence to be experienced and often remembered. Simulation studies have been aimed at understanding how neural activity can be sustained and what are the types of prolonged activity patterns. Ultimately, we would like to understand how self-sustaining neural circuits could support the ongoing activities and memory formation involved in conscious experience. Since biological sensorimotor systems are so impressive in their spatiotemporal abilities, an examination of how biological systems may be modeled as dynamic networks is warranted. To capture and eventually understand biological capabilities, we must identify the temporal components and structures that make their dynamic behavior possible. In addition to recurrent, closed-loop configurations that lead to limit cycles and chaos, there are anatomical and physiological structures that are intriguingly temporal in nature. Time delays and their adaptation, variations in thresholds over time, and the spacing of action potentials in nerve impulse trains name a few of the most important structures. Trains of action potentials spaced over time offer an order of intricacy and complexity not available in most artificial networks, which have a nonpulsed structure. Impulse trains carry the communication between nerve cells in living systems and, taken together, must be responsible for representation of information and processing in the brain. Temporal patterns and coincidences between multiple impulse trains may have special significance in coding and processing schemes.
108
DayhofF, P a l m a d e s s o , Richards, and Lin
In this chapter we consider the development of dynamic behavior in neural networks and show how even a network of simple processing units, inspired by real neurons, is capable of producing oscillations and chaos. Section 2 illustrates how the movement of a controlled parameter causes modulation of the degree and type of temporal dynamics in the network's activity and describes dynamic neural networks that have prolonged selfsustained activity arising from dynamic attractors that oscillate. Different dynamic modes can be developed and controlled in such a neural network, and even a simple architecture with random weights can be forced to develop chaos from a fixed-point attractor—a "stable state" of the network. Section 3 shows how an external stimulus—a pattern—applied to a chaotic network can lock the network into a limit cycle attr actor. Unique patterns then can evoke different limit cycles. Section 4 addresses networks with multiple dynamic attractors. Since random networks tend to have only one dynamic attractor, we have designed a weight perturbation schedule to develop multiple dynamic attractors. In Section 5 we analyze the tremendous flexibility in the basins of attraction—the collections of states that lead to the same oscillation, or attractor. Dynamic networks can have high capacity for attractors and for differing basin boundaries. Section 6 shows attractor training in networks with time-delay mechanisms. The trained weights give the network an a priori chosen dynamic attractor. Time-delays are components in biological systems, as are action potentials and impulse trains. Section 7 discusses the diversity of roles that impulse timing could play in the temporal dynamics of biological neural systems, and Section 8 concludes this overview of temporal mechanisms by discussing the impact of dynamic activity patterns on neural network processing.
2
Dynamic Networks
Dynamic neural networks have an extensive armamentarium of behaviors, including dynamic attractors—finite-state oscillations, Umit cycles, and chaos—as well as fixed-point attractors (stable states) and the transients that arise between attractor states. The transitions that occur from one neural state to another while a network is in a dynamic attractor comprise self-sustained activity. A wide variety of such activity is possible, with each attractor having its own dynamic pattern of changing activations. In addition, each attractor has a basin of attraction—a set of states that lead to that attractor—and tremendous variations occur in the boundaries of the basins of attraction and in the transients that lead to each attractor. We have explored a method of developing dynamic attractors in a neural network and of modulating the network into a chaotic state. The neural
4. Patterns of Dynamic Activity and Timing
109
units were simple biologically inspired units, performing a weighted sum and nonlinear squashing function. Thus we did not use the approach of building oscillators and chaotic components into a network, to insure the presence of dynamics, but rather we allowed the dynamics to develop naturally as a result of the network's processing units and interconnections. We have examined a variety of paths from single fixed-point attractors to chaos. Although we use the approach of Doyon et al [DCQS93], who showed progressions for 128-neuron networks, we show here observations of the dynamics of networks with 64 neurons, which appear to have more variability. The neural networks were single-layer networks with recurrent connections, where reciprocal connections did not have to be the same (e.g., Wij ^ Wji), and continuous-valued activations were allowed. The networks were fully connected or sparsely connected. Activations were determined by N
aj{t-^l) = f{Y,9^jMt)),
(1)
where aj(t) = activation of unit j at time t, Wji = weight to unit j from unit 2, iV = the number of processing units, the function / is a squashing function, and g a multiplier. We have used a symmetric sigmoid function / ( x ) = ( 1 / ( 1 + e - - ) - 0 . 5 ) * 2.0,
(2)
which allows activation values to vary from —1.0 to 1.0. The parameter g is a multiplier for the weights and can be set to any value greater than zero. Both the interconnections and weights were chosen at random. Networks were denoted as (A^, K), where N was the total number of processing units, and K the number of incoming interconnections for each unit. The K units that sent interconnections to each unit were selected at random, and the values of the weights were selected from a uniform random distribution [-1/K,1/K]. The parameter g has two interpretations. The first is as a multiplier for all weights. Thus, the original set of weights becomes amplified or deamplified depending on whether p > 1 or ^ < 1. In this interpretation, the incoming sum for neuron j is Sj = T,{gwji)ai,
(3)
and the modulated weight is gwji. The neuron then performs the squashing function to determine its next activation value: ajit + l) = f{Sj).
(4)
110
DayhofF, Palmadesso, Richards, and Lin Sigmoid, g-1 2
1.5
y' 0.5
.^^^^.^^-^^"^^ ^^^"''^^
0
-1
,'
-1.5
-0
^'' Ji
1
1
1
,
,
1
FIGURE 1. Symmetric sigmoid function /(x), with y = x line. The second interpretation for ^ is as a scaUng of the x-axis in the squashing function. Organizing equation (1) differently, we get AT
Rj =
^'^ji^ii^)^
(5)
i=l
where Rj is now the incoming sum for unit j and aj{t-^l)
=
fg{Rj),
(6)
where
fg = figx)
(7)
is a sigmoid squashing function that is rescaled with respect to the x-axis. Thus the weight is not modulated by g, but instead, the horizontal scale of the sigmoid is modulated by g. Assume that / is the symmetric sigmoid function given in (2), the most commonly used squashing function. Then Figure 1 shows the sigmoid function, with asymptotes at —1 and 1, and with x = 0 as its horizontal midpoint. Note that the maximum slope of the sigmoid function is 0.5, since /'(O) = 0.5. This slope is less than one, the slope of the line x = y. Figure 1 also shows the line {x = y). Now consider the function fg (x), the sigmoid modified by the multiplier g (as in (7)). Then Figures 1-4 plot fg{x) for different values of g. When ^ > 2, the maximum slope of the modified sigmoid function reaches above 45 degrees, at fg{0), which causes two pockets with positive area to form between the 45-degree line and the modified sigmoid. In Figure 2, ^ = 5, which causes pockets to form. In Figure 3, p = 10, which causes larger pockets to form. When g < 2, the sigmoid becomes more flattened, and no pockets of positive area form (Figure 4). The pockets induce more complex
4. Patterns of Dynamic Activity and Timing
111
Modified Sigmoid, g-S
0.8
y
0.6
y
0.4
/^ y' /^y
0.2
S
I'
0 -0.2
y'y
-0.4
J^^,,.^
,"
-0.6 -0.8
^'
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
• 0.6
i^__ 0.8
1 1
FIGURE 2. Modified sigmoid function /^(s), with p = 5. Modified Sigmoid, g-10 2
1.5
y
f
,^^
0.5
I," 0
.\
yJ
-0.5
,•' -1.5
,'' ^
,
.
J
FIGURE 3. Modified sigmoid with g = 10. dynamics for the neuronal activity of the neurons and the entire network, which is consistent with the fact that the parameter g has been shown to be key to producing chaos in large random networks. In large networks of neurons, for example when {N,K) = (64,16), small values of g (usually g < 1) yield a single fixed-point attractor. Increasing g eventually gives limit cycle at tractors and chaos. Consider a model of a single neuron with a recurrent loop, as in Figure 5. Assume that the activation of this unit is a{t) at time t. Using the second interpretation given above for g, a ( t - h l ) = fg{wa{t)),
(8)
where w is the weight on the recurrent loop. Suppose w = 1. Then a{t-^l)
= fg{a{t)).
(9)
112
DayhofF, Palmadesso, Richards, and Lin Modified Sigmoid, g-0.8
FIGURE 4. Modified sigmoid with g = 0.8.
FIGURE 5. A single processing unit with a recurrent loop. If ^ > 2, then Figures 1 and 4 apply. If p < 2, then Figures 2 and 3 apply. Figure 6 shows a map of the sequential activation states of the neuron in Figure 5. Figure 6 shows a case where g < 2, with the modified sigmoid fg. Also the 45° line x = yis shown, as well as the orbit of two points, Xa = 1.95 and xt = —1.8. Paths to successive values for {x,fg{x)) and (/^(x),/^(x)) are shown, followed by paths to ( / ; ( a : ) , / ; + i ( x ) ) and (f^-^H^) Jg^H^)) for successive values of n (n = 1,2,3,...). This graphical analysis of the orbits of X illustrates the presence of a single attracting fixed point fg (0) = 0, as each of the initial points moves along its path to (0,0). Figure 7 similarly shows a graphical analysis of the orbits of x for fg{x) when ^ = 5. The orbits of two points, x = 0.1 and x = —0.2, are shown. There are three intersections between x = y and fg{x), each a fixed point. The point at the origin is repelling, and the other two points are attracting. A network with 64 units each with 16 incoming connections, (64,16), was constructed with random initial weights [PD95]. Transitions from fixed point attractors to chaotic attractors were observed as g was increased starting from numbers below 1.0. Figure 8 shows such a progression, from a fixed point attractor to a chaotic attractor, with average activation a{t)
4. P a t t e r n s of D y n a m i c A c t i v i t y and T i m i n g
113
Sigmoid, g=1 2
1.5
.-'' ^^
0.5
^
0
^^^^-^ ^-^—"^""^ ^"
-0.5
-1
^^ . - - - ' • " ^ A ^..--^
r
^-17J-''
-1.5
^-'
FIGURE 6. (a) The sigmoid / ( x ) . Paths to its single attractor are shown.
Compressed Sigmoid. g=5
^ "
.^t^ ^^ -2
-1.5
-1
-0.5
FIGURE 7. Modified sigmoid for p = 5. Paths to its two attractors are shown.
114
DayhofF, Palmadesso, Richards, and Lin
graphed against a{t -h 1), to form a map of the dynamics. Use of the average activation in the plot was chosen to project the many dimensions (64 activation levels) to a single measured observation over time. For low g {g = 0.9), a single fixed-point attractor was observed, shown in Figure 8(a). This graph has a single point at (0,0). When g was increased to 1.0, a hmit cycle appeared (Figure 8(b)). A Umit cycle appears as a dense set of points along one or more closed loops. When g increased to 1.1, a protrusion appeared at the two tips of the limit cycle graph (Figure 8(c)). When g was increased to 1.18, asymmetric changes occured in the two corners (Figure 8(d)). In Figure 8(e), g was increased to 1.2, and the limit cycle appears hke scribbling in a closed loop. When g = 1.3 (Figure 8(f)), the network is locked into a 2-state oscillation, and when g is increased to 1.5, a bifurcation occurs to form a limit cycle that appears as 2 rings (Figure 8(g)). When g = 1.69, a locking occurs into a 14-state oscillator (Figure 8(h)), and each point becomes a ring when g = 1.7 (Figure 8(i)). When g is increased to 1.8 (Figure 8(j)) and 2.1 (Figure 8(k)), chaotic behavior is observed. This type of exploration illustrates the tremendous range and complexity of dynamic attractors and the ability to exert some control over their appearance, through varying the single parameter g. Although Doyon et al. [DCQS93] identified the value of g at which the first bifurcation occurs, and showed four distinct paths to chaos in 128-neuron networks, they studied larger networks than are examined here. In our work, we made observations on smaller networks, with 64 neurons, and found more elaborate paths to chaos, such as the path shown in Figure 8. In many cases the transitions between limit cycles and the transitions to chaotic activity happened rapidly. Figure 9(a) shows a limit cycle at g = 1.880 that changes to a qualitatively different limit cycle when g = 1.889 (Figure 9(c)), with a frequency locking in between (Figure 9(b)) at g = 1.8865. Only small changes in g were required at these transition points. In some cases, increasing the number of iterations calculated allowed activity that appeared chaotic to resolve into a limit cycle. Often, the number of transients in these cases is too large to be practical to implement. Thus, for practical use to be made of dynamic neural networks, a chaotic response would be considered to be behavior that appeared to be chaotic in a limited predefined time frame.
3
Chaotic Attractors and Attractor Locking
We have considered the problem of how to utilize the attractors in dynamic neural networks to perform pattern recognition and classification. To solve this problem, a pattern communicated to the network must change the
4. Patterns of D y n a m i c A c t i v i t y and T i m i n g
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(J)
115
(k)
FIGURE 8. Progression from fixed point to chaos in a random (64,16) network. The horizontal axis is average activation at time t H-1, a(t + 1), and the vertical axis is average activation at time t^ a(t). (a) g = 0.9; (b) g = 1.0; (c)g=l.l; (d) g = 1.18; (e) g = 1.2; (f) g = 1.3; (^)g=1.5; (h) g = 1.67; {i)g=l.7; (j) g = 1.8; ( k ) p = 2.1.
116
Dayhoff, Palmadesso, Richards, a n d Lin
-O.M
-O.OS
-0.04
-0.02
0
0.02
0.04
0.06
(a) 1
«.04
-0.00
1
1
-0.03
{
1
1
-0.01
1
0.01
0.03
0.09
(b)
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
(c) FIGURE 9. A quick transition between limit cycles, (a) g = 1.880; (b) g = 1.8865; (c) g = 1.889. The horizontal axis is average activation at time t -h 1, a(t + 1), and the vertical axis is average activation at time t, a{t).
4. Patterns of Dynamic Activity and Timing
117
dynamic activity in an observable fashion. The attractor that the network enters is observable, through graphing and projecting as in Figure 8, and could ultimately be identified with an automatic matching algorithm. Different attr actors appear different. We have a potential pattern classification device when different patterns produce different attractors in a dynamic neural network and when similar patterns produce similar attractors. We initially explored whether different initial patterns would produce different attractors in the random networks described above. The initial state of the network was set so that neuron activation levels matched the pattern vector to be classified (ai(0) <- e^). Thus a pattern E = (ei, e2,..., e^) became the initial state of the network. The network was then updated by (1) and (2) for a thousand or more iterations, to pass transients. The resulting attr actor was then observed. In our networks with random weights (64 nodes), usually only one attractor was observed, which was reached from a wide variety of initial states. Sometimes, there were two limit cycle attractors, but they were symmetric with one another, having a 180-degree rotational symmetry about the origin. Figure 10 shows a pair of symmetric limit cycle attractors, drawn separately in Figures 10(a) and 10(b). In this case, each initial condition resulted in one of the two different but symmetric limit cycles. Although different initial conditions could evoke different (but symmetric) limit cycles in this case, this scenario does not offer enough flexibility to discriminate patterns in general.
(a)
(b)
FIGURE 10. A pair of symmetric limit cycle attractors. The horizontal axis is average activation at time f-hl, a(t-l-l), and the vertical axis is average activation at time t, a.{t). A few times, different attractors were observed from diff^erent initial states during our simulations of networks with random weights. This circumstance was very rare among our observations, and the parameter g was tuned precisely near a bifurcation point to produce this effect. Then sometimes different initial states led to attractors that were different in appearance. Figure 11 shows such an example. Figure 11(a) shows an 8-
118
DayhofF, Palmadesso, Richards, and Lin
state oscillator. This occurs at ^ = 1.72 in a (64,16) random network. Figure 11(b) shows the same network, except that a different initial state was selected at random. The 8 points appear to have bifurcated into 8 rings, which are asymmetric about the origin. A third random initial state evoked the limit cycle in Figure 11(c), which is symmetric to that in Figure 11(b). These results do not resolve how to classify patterns in random networks when the pattern to be classified is presented as the initial state of the network. To classify an external pattern by means of evoking different dynamic attractors, we explored a different approach for presenting a pattern to the network, in which the pattern is treated as an external stimulus [PD95], [QDS95]. To include an external stimulus, the updating equation (1) can be modified as follows: N
aj{t + 1) = fi^gwjiaiit))
-\- aeu
(10)
where E = (ei,e2, ...,en) is the external input, and a the strength of the external pattern. The input E is then applied at every time step, and its strength is modulated by the multiplier a, which is usually fixed over time. The vector E can then be considered as the pattern to be classified. A network is initially put in a choatic oscillation, as described in Section 2, by increasing parameter g until the network reaches chaotic behavior. The chaotic net does not have an external stimulus, and it updates by (1). Typically, we do not increase g more than is necessary to produce chaotic behavior, so the network can be said to be at the "edge of chaos." An external input E is then applied to the chaotic net, and the network uses (10) to update. The externally applied input often "locks" the chaotic network into a simpler attractor, usually a limit cycle. We call this scenario "attractor locking". The attractor that results depends on the characteristics of the externally applied pattern and the network weights. The attractor does not depend on the state of the net at which the external input was first applied; the external input applied at any state during the chaotic oscillation was observed to yield the same attractor. Thus, in our observation, the chaotic behavior can continue any amount of time without the external input, and regardless of when E is applied, the same dynamic attractor appears. Figure 12 shows a progression in which a chaotically behaving network (Figure 12(a)) receives an external stimulus pattern (Figure 12(b)). The apphcation of the same pattern at an increased strength is shown in Figure 12(c). Figure 13 shows the result of applying a different pattern as external stimulus to the chaotic network used in Figure 12(a). The resulting limit cycle is clearly different from those in Figures 12(b)-(c). The points scattered outside of the closed figure in Figure 13 are transients that occurred after the external pattern was applied and before the limit cycle was en-
4. P a t t e r n s of D y n a m i c A c t i v i t y and T i m i n g
119
(a)
(b)
(c) FIGURE 11. A case where different initial states lead to different attractors, from a random (64,16) network, (a) An 8-state oscillator; (b) A limit cycle with 8 closed loops; (c) A limit cycle symmetric to part (a). The horizontal axis is average activation at time *H-1, a(t-l-l), and the vertical axis is average activation at time t, a.{t).
120
DayhofF, Palmadesso, Richards, and Lin
tered. These transients demonstrate how few transients (about 25) were needed to transfer the network into the new Umit cycle. Transients were stripped in Figure 12. It is expected that a wide range of different pattern inputs can evoke unique dynamic attractors. With suitable sensitivity studies we will be able to learn the extent to which similar patterns, when applied as external stimuli, evoke the same at tractor. Thus there is the potential for performing pattern classification with attractor locking of dynamic neural networks. The classification of the pattern would be read off as the attractor evoked when the pattern is applied to the network. There is some potential for gaining superior classification from dynamic neural networks because dynamical systems can have complicated basin boundaries. The attractor locking scenario may represent a paradigm that could occur biologically. Freeman has suggested that neural circuits of the olfactory system operate in a chaotic mode in the absence of a stimulus, and that when a stimulus is applied, the network enters a simpler oscillation (a "wing" of the chaotic attractor), whereby recognition or classification occurs [YF90]. The idea of a "wing" of the attractor is analogous to the limit cycle evoked by the applied pattern in the attractor locking scenarios described above. The possibility that classification occurs very quickly when a network transitions from an initial chaotic state is also a possibility [YFBY91]. A fast locking into a limit cycle attractor was observed in Figure 13, where the small numbers of transients are shown in addition to the limit cycle.
4
Developing Multiple Attractors
Multiple attractors were developed in a dynamic network by a weight modification scheme dependent on past performance. The goal was to develop more than one dynamic attractor in the neural network and to constrain the basins of the attractors so that a set of chosen initial states would be forced into different basins. The chosen states would not have to occur in the attractor of its basin but must be somewhere in the basin. With this scheme, we could find weights that produce multiple dynamic attractors that could be accessed from different initial states. A pattern classification scheme in which a pattern is imposed as an initial state to the network would require different initial states to lead to different attractors. A static version of the same approach would be the Hopfield associative memory, where initial states of the network (vector patterns) evoke a memory state (a stable state) [Hop82], [Hop84]. This paradigm has limited capacity, and memory states must become fixed-point attractors of the network. More capacity and fiexibility were sought by modifying the approach in two ways: (1) to use dynamic attractors in addition to allowing
4. Patterns of D y n a m i c A c t i v i t y and T i m i n g
-0.3
0.2
-0.2
121
0.3
(a)
« ••
(b)
(c) FIGURE 12. An attractor locking progression, (a) Chaotic state of a random network with no external pattern applied. Parameter g has been raised just enough to be at the "edge of chaos" {g = 2.05). (b) An external stimulus pattern has been applied, with strength multiplier 1.1 (c) The same external pattern is applied at a stronger strength, with multiplier 1.15. The horizontal axis is average activation at time t -h 1, a(* + 1), and the vertical axis is average activation at time t, 8L{t).
122
Dayhoff, Palmadesso, Richards, and Lin
FIGURE 13. Application of a second external pattern to the same chaotic network as in Figure 12. Transients are shown as well as the limit cycle. Strength multiplier was 1.3. fixed-point attractors and (2) no longer to require that a memory pattern be a part of the attractor. The neural network weights were changed according to a perturbation schedule that depends on previous success in developing new attractors [DPR94]. The networks were rewarded for increasing the number of attractors by keeping the altered weights. First, a set of UB states, B = {Bi,B2,..',BnB)j is chosen. The goal is then to develop a network such that Bi, when used as an initial state, gives a different attractor compared with any Bj {i ^ j). The results of the perturbation analysis depend on the set of initial states Bi used. This set should include a representative sample of initial states needed for an application. In our test cases we generated initial states from a uniform random distribution [—1,1]. The network was initialized with small random weights. Throughout the perturbation schedule, small random numbers were added to randomly selected subsets of weights in the network. At each iteration a set of q weights {wi,W2, ••-fWq) was selected at random and perturbed as Wi
Wi -hrnci.
(11)
where e^ wa^ a random variable from a uniform distribution and m was a multiplier. The new network was then tested with the set of initial states Bi {i = 1,2, ...,725). For each initial state, the network was iterated forwards past transients, using (1) and (2), until an attractor or final set of states was reached and the attractor was classified as fixed-point ("order 0"), an n-cycle oscillator ("order n"), or a final state with no observed repeats. Performance was then evaluated according to the number of distinct attractors. If performance was better with the perturbed weights, then the new weights were saved. Otherwise, the perturbations were discarded. Dur-
4. Patterns of Dynamic Activity and Timing
123
FIGURE 14. Development of multiple attractors in a 10-neuron network is subjected to 10,000 perturbations. There axe twenty pattern vectors in the set of initial states B. The number of attractors observed in each iteration is plotted as a function of iteration. ing the perturbation schedule, values of q and m that tended to increase the number of attractors were explored. Preliminary explorations showed that particular ranges of each of these values were auspicious for increasing the attractors in a network. Figure 14 shows results from a 10-neuron network subjected to 10,000 perturbations. The set of initial states had 100 patterns, chosen at random from a uniform distribution [—1,1]. Figure 14 shows the number of attractors found on each iteration. At each iteration, the best network was saved. Figure 15 shows the number of attractors in the best network as a function of iteration. In the best network, after 10,000 iterations, 82 attractors were observed. Most were 2-state oscillators, but many were fixed-points. To exploit the rich dynamic behavior of neural networks for computational purposes, we must first be able to build dynamic attractors into a network through weight adjustment. Although previous work demonstrated transitions from fixed-point to chaotic attractors [DCQS93], it is necessary to exploit how multiple distinct attractors could be developed and accessed through unique initial states. Here we used the number of attractors as the performance criterion when adjusting weights in a recurrent neural network. Attractors could be fixed-point or n-cycle oscillators in this experiment. We showed that a perturbation schedule could increase the number of attractors in a network and could organize their basins so that a set of a priori specified patterns were in distinct basins. The number of attractors developed could easily exceed the capacity of the Hopfield network associative memory. In the example above, ten neurons had twenty distinct attractors, whereas the Hopfield associative memory has a limit of about 0.15n memories (n the number of neurons) [MPRV87]. The dynamic networks used
124
DayhofF, Palmadesso, Richards, and Lin
FIGURE 15. The maximum number of attractors observed so far, at each iteration, is graphed as a function of iteration. This data is from the same simulation as Figure 14, with 10,000 perturbations on a 10-neuron network. here offer a wide repertoire of differing attractors and basins. These networks, with multiple dynamic attractors, demonstrate capacity that could ultimately be tapped for engineering tasks, where the attractors could represent pattern classes, memories, optimization solutions, or control actions. Different initial states would drive the network into these different final attractors, and a set of a priori given states could be preselected to be in the same or differing basins of attraction.
5
Attractor Basins and Dynamic Binary Networks
To develop the use of dynamic attractors in computational paradigms, it is helpful to characterize the wide repertoire of attractors and basins of attraction that can be generated by neural networks. The neural network's capacity for attractors and for differing basins of attraction must be assessed. Different neural networks, with different numbers of neurons and differing weights, are expected to have different numbers and types of attractors as well as different basin boundaries between those attractors. Here we show results of studying basins of attraction in recurrent binary networks. Binary neural networks were proposed and studied in earlier work [Ama72a] [Ama72b] [AM88] [Ami89] [Dem89] [HTV94] [Koh74] [VPPS90], but here we emphasize the flexibility possible in attaining different sets of basins of attraction. Figures 16-19 show network transition graphs (NT-graphs) for neural networks with three neurons. An NT-graph has as nodes all the states possible for a binary network and as edges all transitions that the network could make from one state to another. The figures have eight nodes, repre-
4. Patterns of Dynamic Activity and Timing
125
FIGURE 16. Network transition graph for a neural network with n = 3 processing units. There is one attractor, a 4-state oscillator, and thus one basin. The basin class is /01234567/, where nodes are numbered 0-7 and / delimits basins. senting the eight (2^) possible binary states of the 3-neuron network. The NT-graph depends on the neural network weights. However, many different weight matrices can yield the same NT-graph. Each NT-graph shows all oscillations and fixed-point at tractors, along with their basins of attraction. In Figure 16, there is one oscillator, oscillating among four states, with all eight states in its basin of attraction. In Figure 17, there are two fixed-point attractors and two 3-state oscillators. In Figure 18, there are two fixed-point attractors and one 2-state oscillator, and Figure 19 also has two fixed-point attractors and one 2-state oscillator. A basin class is a set of neural networks that have the same basins. The attractors in those basins and the paths to those attractors, however, may be diff^erent. The networks that produce the NT-graph in Figure 18 are in the same basin class as those for Figure 19, as the basins are the same even though the attractors and paths differ. To study the multiplicity of basin classes in dynamic binary networks, networks were constructed with random weights, and the state-to-state transitions of each network were simulated. The basin class was then computed and compared to basin classes previously observed. Two thousand networks of each size were simulated, with three, four, and five neurons. Figure 20 shows the results, where the number of unique basin classes is plotted as a function of the number of networks simulated. Figure 20(a) applies to networks with three neurons, and the number of basin classes observed rises to 17. For networks with four neurons (Figure 20(b) the number of basin classes rises steeply to more than 600, with the slope get-
126
Dayhoff, Palmadesso, Richards, and Lin
FIGURE 17. Network transition graph for a neural network with n = 3 processing units. There are four attractors, consisting of two fixed points and two 3-state oscillators, and thus four basins, with basin class /0/124/356/7/. Adapted with permission from [DP95].
FIGURE 18. Network transition graph for a neural network with n = 3 processing units. There are three attractors, consisting of two fixed points and one 2-state oscillator, and thus three basins, with basin class /0347/15/26/. Adapted with permission from [DP95].
4. Patterns of Dynamic Activity and Timing
127
FIGURE 19. Network transition graph for a neural network with n = 3 processing units. There are three attractors, consisting of two fixed points and one 2-state oscillator, with basin class /0347/15/26/. Its basin class is the same basin class as in Figure 18. Adapted with permission from [DP95]. ting smaller, indicating that the number is leveling off. In Figure 20(c), the slope is high for the first 2000 networks, with a new basin class observed for almost every network simulated. More than 1400 basin classes were observed among just 2000 networks. The "basin class capacity" is the capacity of a set of neural networks to exhibit a variety of different basin classes. Thus binary neural networks have a basin class capacity of over a thousand classes when the network has only five neurons, and the increase in basin class capacity with increasing numbers of neurons appears to be very rapid. Our results indicate that dynamic networks have extremely high basin class capacity, even when we consider binary networks alone [DP95]. This high capacity leads to much flexibility in the way that the basins of attraction divide the set of neural network states. For future applications, adjustment of the basin boundaries could be more important than the attractor in the basin. The basin boundaries determine .which attractor the network goes into, and in an application, the attractor could represent the answer, result, or memory recalled from the neural network's computation. The exact nature of the attractor (fixed or oscillating) and its location (particular state(s) involved) can be less important than the basin boundaries. Since we aim to eventually have paradigms that allow adjustment and training of attractor basin boundaries, we have accomplished the first step towards this aim, which is to explore how many sets of basins and basin boundaries are possible with weight adjustments.
128
DayhofF, P a l m a d e s s o , R i c h a r d s , a n d Lin
(a)
(b)
(c) FIGURE 20. The number of basin classes found as a function of the number of networks simulated. Two thousand networks of each size (n = 3,4, and 5) were simulated. Weights were randomly selected from —1 to 1, and reciprocal weights were allowed to be unequal (wij ^ Wji). (a) For n = 3 there is initially a steep rise; then the graph flattens off with little increase in the number of basin classes found, (b) For n = 4, there is an initial steep rise that begins to become less steep by the 2000th network, (c) For n = 5, the initial steepness continues throughout the first 2000 nets, and the flattening out must occur later. Part (c) reprinted with permission from [DP95].
4. Patterns of Dynamic Activity and Timing
6
129
Time Delay Mechanisms and Attractor Training
Biological systems have anatomical and physiological mechanisms that force time delays on interconnections, thus providing a way of putting time delays in the neural network's computational circuitry. Propagation of action potentials (APs) from soma to synapse takes time, and speeds and distances vary. The diffusion of chemicals across the synapse takes time, as do postsynaptic potentials (PSPs) and integration of membrane potentials at the neuronal cell body and dendrites. In consideration of these biological mechanisms for time delays, we have used artificial neural networks that incorporate time delays on interconnections. The time-delay neural network (TDNN), proposed by Waibel, can have arbitrary delays on any connection and multiple connections between two units with different delays on each connection [WLH90]. The TDNN network is trained by adapting weights. The adaptive time-delay neural network (ATNN) adapts time delays in addition to weights [DD93], [DD91], [LDL92b], [LDL92a]. We have explored how an ATNN with a recurrent loop can be trained to have an a priori specified dynamic attractor, and thus to act as a dynamic network [LDL94], [LDL95], [LDL93]. This ATNN network used time delays along all interconnections and adapted both its weights and time delays during training. A connection from the network's output layer was made to the input, so that the results of the output neurons were used as input during the next time step. At each iteration, the network input was a segment of a trajectory, and the network produced the values for the next position along the trajectory. There were two output units and two input units, each pair specifying x{i) and y{i) of the trajectory in the x,2/-plane. Figure 21 was generated when the network was trained on a circular, closed-loop trajectory. Figure 21(a) shows the network during training and Figure 21(b) shows the results after training was completed, when the network learned to generate the circular trajectory correctly. The network was given an initial segment of the circle, and it completed the figure, using its own output as successive inputs. Thus, at each point along the circle, the network took a segment of the circle and predicted the next point. Figure 21(c) shows the noise resilience of the trained network. At the beginning of the experiment, an initial segment was generated with noise added to a portion of the circle. This segment was submitted to the neural network, and surprisingly, the network was still able to generate the circular trajectory, regenerating the circle on the second time around almost perfectly (Figure 21(d)). Figures 21(e) and 21(f) show initial segments that are smaller and larger than the circle, respectively, and the network spirals to generate the trained circle. These results suggest that the trajectory trained into the ATNN network is in fact an attractor of the network. The circular trajectory can be considered a limit cycle attractor because of the repeated sampling of points along the circular figure. Each time around the
130
DayhofF, Palmadesso, Richards, and Lin
- itfgel lr«ject<>iV reproduced trajectory • itartini point
00
— t«r|et trijectory reproduced trajectory 9 tiarllrti point 0.0 1(0
- largel Irnjeclory reproduced trajectory
- target trajectory reproduced trajectory
FIGURE 21. Trajectory generated by a network trained to produce a circular trajectory, (a) The trajectory generated during training on the circle, (b) The trajectory generated after training, which closely follows the circle, (c) Initial segment is noisy, but the network recovers the circular trajectory, (d) The next circle after (c) is generated almost perfectly, (e) The initial segment is from a smaller circle, and the network's trajectory spirals out to the original circle, (f) The initial segment is from a larger circle, and the network's trajectory spirals in to the original circle. Parts (a) and (b) reprinted with permission from [LDL93]. Parts (e) and (f) reprinted with permission from [LDL95].
4. Patterns of Dynamic Activity and Timing
131
circle, a different set of points can be generated so as to fill in the circular drawing. Since the networks always arrived at the trajectory for which they were trained, the initial arcs used were within the basin of attraction for the trained figure—the attractor. Once a chosen attractor is trained into a neural network, the network's ability to produce that attractor can be utilized for applications purposes. Thus an autonomous or controlled system, such as a robotic arm, a vehicle, or other moving object, could be trained to generate repetitive desired motions, and it could attain this repetitive motion from arbitrary starting trajectories. The starting trajectories would have to be in the basin of attraction for the attractor that is the trained repetitive motion. The results here indicate that the basin of attraction can be quite large, and thus the trained motions would be quite stable and would be able to recuperate in the face of perturbations.
7
Timing of Action Potentials in Impulse Trains
An entirely new realm of possibilities arises when impulses are used to communicate signals between neurons. This construct occurs in biology, where neurons produce action potentials (APs) that travel along axons. These APs are fast waves of depolarization that travel at speeds exceeding other biological mechanisms for communication. Impulse trains are trains of action potentials spaced over time, with varying time intervals between them. The brain thus includes a massively parallel impulse train generator and processor. Simultaneously generated impulse trains can have patterns that are a function of the activity of ensembles of neurons. Patterns and synchronies in these impulse trains furnish important putative codes for information transmission and processing in the brain. Models can incorporate spiking neurons, temporal patterns, or coincidences in the impulse trains, and sometimes attractor states [LAM"*"96]. Usually, artificial neural models use activation level parameters, which are continuous real-valued numbers that are communicated from one processing unit to another. A naive assumption is that the activation level in neural network models reflects firing rates in biological neural systems. While firing rates appear to play an information encoding role in some biological subsystems, it seems likely that a more complex processing scheme is enabled by the action potentials of neurons, based on a set of computational schemes that goes beyond simple firing rate encoding. Simultaneously recorded nerve impulse trains appear as in Figure 22. Typically, the waveform is the same on each impulse recorded from the same neuron and, as a result, is not expected to carry information. Thus, the placement of impulses in time must represent, process, and carry the
132
Dayhoff, Palmadesso, Richards, and Lin
cell 1 cell 2 cell 3 cell 4
cell 30 FIGURE 22. Simultaneously recorded nerve impulse trains (simulated data). information. Temporal patterns have been examined in nerve impulse trains. Favored patterns are firing patterns that repeat in exact or approximate form over an extended period of time (Figure 23). Their occurrences may be placed arbitrarily in time, or they may be periodic, occurring at equal intervals. Methods have been developed for identification of recurring temporal patterns that are statistically significant [DG83a], [DG83b]. These methods overcome the problem that some number of coincidental recurrences are expected at random. The methods realistically identify neural recordings that contain recurring patterns unusually often, according to statistical tests. Favored patterns have been found in single unit recordings and in multiple unit recordings [DG83a], [DG83b], [AG88], [FFH90]. This research has shown the presence of favored temporal patterns in neural recordings that include a variety of preparations (crayfish claw, cat visual cortex, cat brainstem). These intriguing results contribute to the accumulating studies and analysis of nerve impulse timing [NZJE96], [Les96], [SZTM96], [JSB97], [RWSB96], [MZ093], [Hop95], [SZ95], [TGK94], [Day87]. Temporal patterns are consistent with models that include dynamic attractors, as oscillating at tractors can produce repeating temporal patterns among one or more neurons. A temporal pattern could be elicited in exact or approximate form each time that a section of an oscillating attractor is revisited. In multiple unit recordings, it is cogent to evaluate data for the presence
4. P a t t e r n s of D y n a m i c A c t i v i t y and T i m i n g
I
I I
1 I I
Example
f i r i n g pattern
133
("word")
time —>
4—W—^-H-
44
II nil occurrence with extra
spike
FIGURE 23. Paxticulax firing patterns re-occur in the nerve impulse trains above, with some variation in interspike interval on each occurrence. The third line shows the pattern at the top occurring with an extra spike. Data was simulated. Reprinted with permission from [Day87]. © 1987 IEEE.
134
Dayhoff, Palmadesso, Richards, and Lin
of temporal synchronies (and other patterns) among groups of two or more units. A synchrony would occur when a group of neurons each fire an impulse at approximately the same time. The study of multiunit synchronies is highly motivated for the following reasons. Neurons are natural recognizers of synchrony arriving at presynaptic sites, as synchronous stimuli sum more effectively when postsynaptic potential peaks coincide. Synchronous groups can stimulate postsynaptic activity faster than individual neurons. Synchronies play a role in LTP learning, and synchronous groups are consistent with models of neural processing. In addition, synchronous groups can multiplex firing rate codes. Methods for identification of synchronies have been developed and synchronies have been observed in biologically recorded systems, and evidence of ensemble coding has been found [Day95], [GPD85], [LHM+92J, [GKES89], [RCF96], [CDSS97], [GSM96]. In a synchrony code representation, an ensemble of near-coincidental firing would represent information or its processing during cognitive tasks. The event of synchronous firing, however, would last only an instant unless repeated. Repetitions could occur at regular periods or irregularly over time. Clearly, the brain has a mechanism to sustain a representation over an arbitrary period of time because we can imagine an image or consider an idea for any chosen length of time. Thus the proposed synchrony code could allow for sustained representations by repetitions of the synchronous firing. Repetitions could in turn be caused by oscillations, or attractors, in the network dynamics. Thus, synchronies are consistent with models of dynamic attractors that oscillate to produce repeated synchronous events. Some models of networks of spiking neurons have shown synchronies, temporal patterns, or oscillations and attractors [PCD96], [MR96], [TH96], [Kel95].
8
Discussion
The research described here is motivated by the temporal dynamics of living neural systems and especially by the temporal abilities of humans and higher animals. Our brains can respond to time-varying signals, can generate time-varying patterns, can process information (think) over time, can represent concepts and images mentally for arbitrary intervals of time, and have differing states of ongoing, self-sustained activity (awake, aroused, sleeping). Furthermore, we seem to automatically have time-related skills and dynamics such as recognition of spatiotemporal patterns as they occur; coordination of internal processing in the brain, in spite of no apparent controlling time clock; and the presence of self-sustained dynamic activity in many areas of the brain, through oscillation (e.g., respiratory neurons) or other more complex continuing activity ("spontaneous activity"). This
4. Patterns of Dynamic Activity and Timing
135
extensive array of temporal capabilities and time-varying activity points to a temporally dynamic neural network underlying these processes. To date, many neural models show pattern mapping abilities but lack the dynamics and temporal behavior of the systems they are intended to model. We have explored a series of paradigms that concern dynamic activity in neural networks. We have illustrated how a simple model of a neural network can develop dynamic attractors, self-sustained activity, and chaos. Control over parameter ^, a weight multiplier, allows modulation of the dynamics with a progression from a simple fixed point attractor to chaos. Once we generate chaotic activity patterns in a neural network, we can apply a stimulus pattern and lock the network into a limit cycle attractor. This scenario poses a potential way to perform pattern recognition and signal classification. Because dynamic systems can have complicated basin boundaries for their attractors, there is reason to expect increased performance and generalization capabilities from this type of approach. Developing multiple attractors in a neural network can be accomplished via an accretional method with weight perturbations. In the resulting network, a set of initial states each evoke their own attractor. Computational tasks in pattern classification and associative memory could be accomplished through differing initial states evoking diflFering dynamic attractors. In dynamic binary networks, exploration of attractor basins and the flexibility of those basins of attraction showed capacities for attractors to be considerably higher than the number of memories in the static Hopfield network (0.15n). With as few as five neurons in a dynamic binary network, thousands of basin classes—divisions of patterns into diff'erent basins—can be accomplished. To train a specific attractor into a neural network, a neural network with time delays was trained to generate a closed-loop trajectory. The trained network generates this trajectory in spite of noisy starting conditions, and with differing initial segments. The result is a robust signal and path generator for communications and control applications. Impulse trains add a new dimension of spatiotemporal processing in biological neural systems. Temporal patterns of nerve impulses and synchronies among ensembles of neurons are putative codes for information processing and representation. The firing activity of neurons and neural ensembles could refiect transients and dynamic attractors superimposed on the impulse train structure of biological neural processing. The general problem of recognition and generation of spatiotemporal signals appears solvable with dynamic neural networks, although much research remains to be done. The ability to generate and train self-sustained activity, based on dynamic oscillating attractors, is shown in the preliminary results described here. As biological systems have indisputable power in the temporal domain, we experiment with tapping their mechanisms for artificial systems. Mech-
136
Dayhoff, Palmadesso, Richards, and Lin
anisms that appear in biological systems include time-delays, recurrent loops, and the adjustment of synaptic strengths. Our models lead to selfsustained activity, dynamic attractors, and the training of those attractors. Whereas observations of living neural systems catch them in the act of evolving increasingly powerful structures, we are beginning to develop a spectrum of dynamic and temporal neural networks that have far more potential than previous networks. Ultimately, we hope to exploit, in humanmade systems, the mechanisms responsible for the power of biological systems in the temporal domain.
9
Acknowledgments
J. Dayhoff was supported by the Naval Research Laboratory (Special Project on Nonlinear Systems and Contract N00014-90K-2010), the National Science Foundation (Grants CDR-88-03012 and BIR9309169), the Institute for Systems Research at the University of Maryland, and the Air Force Office of Scientific Research (Summer Faculty Research Program, Phillips Laboratory, Kirtland Air Force Base). P. Palmadesso and F. Richards acknowledge support from the Office of Naval Research. D.-T. Lin was supported by the Applied Physics Laboratory of Johns Hopkins University. Thanks go to Greg Tarr, Lenore McMackin, Ed Ott, B. Doyon, B. Cessac, Manuel Samuelides, and Ira Schwartz for stimulating discussion on this and related topics. 10
REFERENCES
[AG88]
M. Abeles and G. L. Gerstein. Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. Journal of Neurophysiology, 60(3):909-924, 1988.
[AM88]
S.-I. Amari and K. Maginu. Statistical neurodynamics of associative memory. Neural Networks, 1:63-73, 1988.
[Ama72a]
S.-I. Amari. Characteristics of random nets of analog neuronlike elements. IEEE Trans, on Systems, Man, and Cybernetics, 2(5):643-657, 1972.
[Ama72b] S.-I. Amari. Learning patterns and pattern sequences by selforganizing nets of threshold elements. IEEE Trans, on Computers, 21(11):1197-1206, 1972. [Ami89]
D. Amit. Modelling Brain Function. Press, Cambridge, U.K., 1989.
Cambridge University
4. Patterns of Dynamic Activity and Timing
137
[CDQS94] B. Cessac, B. Doyon, M. Quoy, and M. Samuelides. Mean-field equations, bifurcation map and chaos in discrete time neural networks. Physica D, 74:24-44, 1994. [CDSS97] D. Contreras, A. Destexhe, T. J. Sejnowski, and M. Steriade. Spatiotemporal patterns of spindle oscillations in cortex and thalamus. Journal of Neuro science, 17(3):1179-1196, 1997. [Day87]
J. E. Dayhoff. Detection of favored patterns in the temporal structure of nerve cell connections. Proceedings First International Conference on Neural Networks, 3:63-77, 1987.
[Day94]
J. E. Dayhoff. Artificial neural networks: biological plausibility. Abstracts, Toward a Scientific Basis for Consciousness, University of Arizona, Tucson, Arizona, 1994.
[Day95]
J. E. Dayhoff. Synchrony detection in neural assembUes. Biological Cybernetics, 71(3):263-270, 1995.
[DCQS93] B. Doyon, B. Cessac, M. Quoy, and M. Samuelides. Control of the transition of chaos in neural networks with random connectivity. International Journal of Bifurcation and Chaos, 3(2):279-291, 1993. [DD91]
S. P. Day and M. Davenport. Continuous-time temporal back-propagation with adaptive time delays. Neuroprose archive, Ohio State University. Accessible on Internet via anonymous ftp on archive.cis.ohio-state.edu, in pub/neuroprose/day.tempora.ps August, 1991.
[DD93]
S. P. Day and M. R. Davenport. Continuous-time temporal back-propagation with adaptive time delays. IEEE Trans, on Neural Networks, 4(2):348-354, March 1993.
[Dem89]
A. Dembo. On the capacity of associative memories with linear threshold functions. IEEE Trans, on Information Theory, 35(4):709-720, 1989.
[DG83a]
J. E. Dayhoff and G. L. Gerstein. Favored patterns in spike trains. I. Detection. Journal of Neurophysiology, 49(6): 13341348, June 1983.
[DG83b]
J. E. Dayhoff and G. L. Gerstein. Favored patterns in spike trains. II. Application. Journal of Neurophysiology, 49(6):13491363, June 1983.
138
Dayhoff, Palmadesso, Richards, and Lin
[DP95]
J. E. DayhofFand P. J. Palmadesso. Capacity for basin flexibility in dynamic binary networks. Proceedings of World Congress on Neural Networks (WCNN), 1:365-368, 1995.
[DPR94]
J. E. Dayhoff", P. J. Palmadesso, and F. Richards. Developing multiple at tractors in a recurrent neural network. Proceedings of World Congress on Neural Networks (WCNN), 4:710-715, 1994.
[FFH90]
R. D. Frostig, Z. Frostig, and R. M. Harper. Recurrent discharge patterns in multiple spike trains. Biological Cybernetics, 62:487-493, 1990.
[GKES89] C. M. Gray, P. Konig, A. K. Engel, and W. Singer. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338:334-337, 1989. [GPD85]
G. L. Gerstein, D. H. Perkel, and J. E. Dayhoff. Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5(4):881-889, April 1985.
[GSM96]
D. M. Gothard, W. E. Skaggs, and B. L. McNaughton. Dynamics of mismatch correction in the hippocampal ensemble code for space: interaction between path integration and environmental cues. Journal of Neuroscience, 16(24) :8027-8040, 1996.
[Hop82]
J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 1982.
[Hop84]
J. J. Hopfield. Neurons with graded responses have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, USA, 81, 1984.
[Hop95]
J. J. Hopfield. Pattern recognition computation using action potential timing for stimulus representation. Nature, 376:3336, 1995.
[HTV94]
J. Hao, S. Tan, and J. Vandewalle. A new approach to the design of discrete hopfield associative memories. Journal of Artificial Neural Networks, l(2):247-266, 1994.
[JSB97]
D. Jaeger, E. D. Schutter, and J. M. Bower. The role of synaptic and voltage-gated currents in the control of purkinje cell spiking: a modeling study. Journal of Neuroscience, 17(1):91-106, 1997.
4. Patterns of Dynamic Activity and Timing
139
[Kel95]
J. A. S. Kelso. Dynamic Patterns: The Self-Organization of Brain and Behavior. MIT Press, Cambridge, MA, 1995.
[Koh74]
T. Kohonen. An adaptive associative memory principle. IEEE Trans, on Computers, C-23:444-445, 1974.
[LAM+96] A. V. Lukashin, B. R. Amirikian, V. L. Mozhaev, G. L. Wilcox, and A. P. Georgepoulos. Modeling motor cortical operations by an attractor network of stochasic neurons. Biological Cybernetics, 74:255-261, 1996. [LDL92a]
D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Adaptive time-delay neural network for temporal correlation and prediction. In Intelligent Robots and Computer Vision XI: Biological, Neural Net, and 3-D Methods, Proc. SPIE, volume 1826, pages 170-181, Boston, November, 1992.
[LDL92b]
D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. A learning algorithm for adaptive time-delays in a temporal neural network. Technical Report SRC-TR-92-59, Systems Research Center, University of Maryland, College Park, MD 20742, May 15 1992.
[LDL93]
D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Learning spatiotemporal topology using an adaptive time-delay neural network. In World Congress on Neural Networks, volume 1, pages 291-294, Portland, OR, 1993. INNS, New York.
[LDL94]
D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Prediction oF chaotic time series and resolution oF embedding dynamics with the ATNN. In World Congress on Neural Networks, volume 2, pages 231-236, San Diego, CA, 1994. INNS Press, New York.
[LDL95]
D.-T. Lin, J. E. DayhofF, and P. A. Ligomenides. Trajectory production with the adaptive time-delay neural network. Neural Networks, 8(3):447-461, 1995.
[Les96]
R. Lestienne. Determination oF the precision oF spike timing in the visual cortex oF anaesthetised cats. Biological Cybernetics, 74:55-61, 1996.
[LHM+92] B. G. Lindsey, Y. M. Hernandez, K. F. Morris, R. Shannon, and G. L. Gerstein. Dynamic reconfiguration oF brain stem neural assemblies: respiratory phase-dependent synchrony versus modulation oF firing rates. Journal of Neurophysiology, 67:923-930, 1992.
140
DayhofF, Palmadesso, Richards, and Lin
[MPRV87] R. J. McEliece, E. C. Posner, E. R. Rodemich, and S. S. Venkatesh. The capacity of the hopfield associative memory. IEEE Trans, on Information Theory, 33:461-482, 1987. [MR96]
I. MeiUjson and E. Ruppin. Optimal firing in sparselyconnected low-activity attractor networks. Biological Cybernetics, 74:479-485, 1996.
[MZ093]
J. W. McClurkin, J. A. Zarbock, and L. M. Optican. Temporal codes for colors, patterns, and memories. Cerebral Cortex, 10:443-467, 1993.
[NZJE96]
H. Napp-Zinn, M. Jansen, and R. Eckmiller. Recognition and tracking of impulse patterns with delay adaptation in biologyinspired pulse processing neural net (bpn) hardware. Biological Cybernetics, 74:449-453, 1996.
[PCD96]
O. Parodi, P. Combe, and J.-C. Ducom. Temporal coding in vision: coding by the spike arrival times leads to oscillations in the case of moving targets. Biological Cybernetics, 74:497-509, 1996.
[PD95]
P. J. Palmadesso and J. E. Dayhoff. Attractor locking in a chaotic network: stimulus patterns evoke limit cycles. Proceedings of World Congress on Neural Networks (WCNN), 1:254257, 1995.
[QDS95]
M. Quoy, B. Doyon, and M. Samuelides. Dimension reduction by learning in a discrete time chaotic neural network. Proceedings of World Congress on Neural Networks (WCNN), 1:300303, 1995.
[RCF96]
R. Ratnam, C. J. Condon, and A. S. Feng. Neural ensemble coding of target identity in echolocating bats. Biological Cybernetics, 75:153-162, 1996.
[RWSB96] F. Rieke, D. Warland, R. D. R. V. Steveninck, and W. Bialek. Spikes: Exploring the Neural Code. MIT Press, Cambridge, MA, 1996. [SCS88]
H. Sompolinsky, A. Crisanti, and H. J. Sommers. Chaos in random neural networks. Physical Review Letters, 61(3):259262, 1988.
[SZ95]
C. F. Stevens and A. Zador. Neural coding: The enigma of the brain. Current Biology, 5:1370-1371, 1995.
4. Patterns of Dynamic Activity and Timing
141
[SZTM96] D. Scheuer, J. Zhang, G. M. Toney, and S. W. Mifflin. Temporal processing of aortic nerve evoked activity in the nucleus of the solitary tract. Journal of Neurophysiology, 76(6):3750-3757, 1996. [TGK94]
D. W. Tank, A. Gelperin, and D. Kleinfeld. Odors, oscillations and waves: Does it all compute? Science, 265:1819-1820, 1994.
[TH96]
P. Tass and H. Hermann. Synchronized oscillations in the visual cortex—a synergetic model. Biological Cybernetics, 74:31-39, 1996.
[VPPS90] S. S. Venkatesh, G. Pancha, D. Psaltis, and G. Sirat. Shaping attraction basins in neural networks. Neural Networks, 3:613623, 1990. [WLH90]
A. Waibel, K. J. Lang, and G. E. Hinton. A time-delay neural network architecture for isolated word recognition. Neural Networks, 3:23-43, 1990.
[YF90]
Y. Yao and W. J. Freeman. Model of biological pattern recognition with spatially chaotic dynamics. Neural Networks, 3(2):153-170, 1990.
[YFBY91] Y. Yao, W. J. Freeman, B. Burke, and Q. Yang. Pattern recognition by a distributed neural network: an industrial application. Neural Networks, 4:103-121, 1991.
This Page Intentionally Left Blank
Chapter 5 A Macroscopic Model of Oscillation in Ensembles of Inhibitory and Excitatory Neurons Joydeep Ghosh Hung-Jen Chang Kadir Liano ABSTRACT Very large networks of neurons can be characterized in a tractable and meaningful way by considering the average, or ensemble behavior, of groups of cells. This paper develops a mathematical model to characterize a homogeneous neural group at a macroscopic level, given a microscopic description of individual cells. This model is then used to study the interaction between two neuron groups. Conditions that lead to oscillatory behavior in both excitatory and inhibitory groups of cells are determined. Using Fourier series analysis, we obtain approximate expressions for the frequency of oscillations of the average input and output activities and quantitatively relate them to other network parameters. Computer simulation results show these frequency estimations to be quite accurate. ^
1
Introduction
Biological neural networks consist of very large n u m b e r s of neurons. T h e h u m a n brain has over 10^^ neurons, with an average connectivity in t h e t h o u s a n d s . Faced with numbers of this m a g n i t u d e , it is impossible a n d meaningless to model every single neuron and its interactions with t h e entire system. To gain insight into t h e complex functions performed by neural systems, an understanding of t h e overall network in t e r m s of ensemble, or group, behavior and group interaction is required t h a t is not overwhelmed by t h e details of individual neurons [Ama72, Ede87]. Such macroscopic ^Supported by NSF grant ECS-9307632, AFOSR contract F49620-93-1-0307 and ARC contract DAAH04-95-10494.
143
144
Ghosh, Chang, and Liano
models are useful in studying collective behavior of biological neural systems, and in particular, macroscopic oscillations in cell assemblies. Oscillatory phenomena have been widely observed in cortical circuits at similar frequencies (in the 35-60 Hz range) and at many different spatial scales. They occur in single neurons, within small (10-100 cells) neural networks, and in large (over lOK cells) networks. Oscillations are considered fundamental to memory storage and temporal association functions in biological organisms [Bow90, vB86, Pav73]. Large-scale rhythmic/oscillatory phenomena are integral to the dynamic timing mechanisms for heartbeat, EEG waves, breathing, walking, and other activities. Recent experiments by Gray and Singer [GS89] and by Traub, Miles, and Wong [TMW89], among others, show oscillations occurring at the level of local populations of cortical neurons. Rhythmic patterns emerge even though single neurons may fire asynchronously. Some of the remarkable characteristics of cortical oscillations are the synchronization of oscillations between spatially disparate cell assemblies, and phase locking [AB89, GS89]. Such experiments, together with previous theoretical investigations [vB86], give credence to the labeling hypothesis, wherein cell assemblies are established through oscillations and labeled by their phase and/or frequency. The labeling hypothesis postulates that neural information processing is intimately related to the temporal relationships between the labels of different populations. To study and characterize the behavior of large populations of neurons, several researchers have developed macroscopic models that characterize the aggregate population behavior, in a way similar to how statistical mechanics has been used to obtain global quantities like pressure and temperature starting from molecular-level description of gases. Pioneering research in developing macroscopic models of large neural ensembles has been performed by Amari, who studied characteristics of random networks of threshold logic elements (McCulloch-Pitts formal neurons) [Ama71] and, subsequently, of continuous-time analog neurons [Ama72]. The weights and thresholds in these networks were random variables, and they did not change as the ensemble evolved; i.e., no learning mechanisms were investigated. Using some simplifying assumptions including the stochastic independence among cell membrane potentials, Amari showed that a homogeneous random net is monostable or bistable. Moreover, oscillations could emerge from the interactions between two random nets consisting of excitatory and inhibitory classes of elements, respectively. At about the same time, Wilson and Cowan [WC72] showed the presence of oscillations in coupled subpopulations of inhibitory and excitatory neurons with refractory periods, but with nonadaptive weights. Amari's results were later expressed in a rigorous mathematical framework by Geman [Gem82]. Similarly, Wilson and Cowan's system has been further analyzed by other researchers [Som88], and oscillator models have been developed for specific circuits
5. A Macroscopic Model of Oscillation in Ensembles
145
such as the visual cortex [SW90]. Notable among recent research along these lines is the work by Atiya and Baldi [AB89], who consider interacting cell assemblies of continuous time analog neurons as well as "integrate-and-fire" type neurons. If the assemblies are arranged in layers, with feedback from the topmost to the bottommost layer (thus resulting in a ring structure), and the number of inhibitory layers is odd, then oscillations arise easily if the cell gains are high enough. Also, provided that the cell time constants are very similar within a layer, all the cells belonging to that layer tend to phase lock in a few time constants. As before, learning mechanisms are not incorporated in their framework. An alternative approach to obtaining oscillatory phenomena in neural networks is to use a more involved model of the individual cells that results in these cells becoming nonlinear neuronal oscillators by themselves. For example, Kammen, Koch, and Holmes [KKH90] assume a population of neuronal oscillators firing repetitively in response to synaptic input that is purely excitatory. They investigate two basic neuronal network architectures, namely, the linear chain model and comparator model, which incorporate either nearest neighbor or global feedback interactions. They conclude that nonlocal feedback plays a fundamental role in the initial synchronization and dynamic stability of the oscillations. Baird [BaiQO] discusses a generic model of oscillating cortex that assumes a minimal coupling structure. The network has explicit excitatory neurons with local inhibitory interneuron feedback that form a set of nonlinear oscillators coupled only by long range excitatory connections. He argues that an oscillatory associative memory function can be realized in such a system by using a local Hebb-like learning rule. Due to the complex characterization of individual cells, it is difficult to obtain a useful macroscopic description using such approaches. In Chang et al. [CGL92], the authors have presented a macroscopic model for a homogeneous cell assembly wherein each individual cell is an analog neuron whose characteristics are given by a well-known model [Hop84]. This model is distinguished from previous work by the fact that it relates in quite some detail the macroscopic variables to biologically motivated cell parameters, and even more so by the incorporation of adaptive weight dynamics. The latter factor makes it possible to achieve rhythmic patterns in ensemble activity even in an isolated homogeneous cell assembly with no external periodic inputs. The model not only predicts such situations, but it is also able to estimate the frequency of oscillation and indicate how the parameters could be changed to obtain a desired oscillation frequency. Here, we apply a model similar to that developed in Chang et al. [CGL92] to study the interaction of excitatory and inhibitory neuron groups. For simplicity, the weights are not adapted, though this possibility is kept open for future research. The main contribution of this report is to quantify the
146
Ghosh, Chang, and Liano
situations that lead to stable macroscopic behavior and to estimate the oscillation frequencies. The frequency estimates are observed through simulations to be quite accurate. We begin in the next section by summarizing the macroscopic model. Section 3 applies this model to analyze a system with inhibitory and excitatory neurons. Stability analysis of this system is performed in Section 4, and the frequency of oscillation is estimated in Section 5 using a first-order approximation. Simulation results presented in Section 6 support the mathematical analysis given in previous sections.
2 A Macroscopic Model for Cell Assemblies 2.1
Description of Individual Cells
Macroscopic models for neuronal assemblies depend on the characterization of individual neurons as well as on the network architecture that defines how these neurons interact with one another. The model of individual neurons should be biologically plausible without incorporating details that do not significantly affect macroscopic behavior such as ensemble oscillations. These cells should at least be able to integrate information over time. Thus connectionist-type cells, where the instantaneous output is a linear or sigmoidal function of a weighted sum of inputs at that instant, are too simplistic and clearly inadequate. The next level of complexity is to model a set of n asynchronous cells by n coupled first-order differential equations. A popular generic form is [Cow67, Ama72, GC83, Hop84]: dui
n
=
•^'^'^ikgk{uk)-\-hi,
l
(1)
k=l
where Ui is the internal state of the zth neuron and represents the shorttime averaged value of the membrane potential; r^ is a time constant; Wik represents the weight, or synaptic strength, from neuron k to i] hi is a. threshold; and gk{') is the input-output "squashing" cell-activation function for the kth neuron. The output can be interpreted as the mean firing rate of the neuron. This system can converge to equilibrium points if the weights are symmetric [GC83, Hop84]; if all weights are positive [Hir89]; or if the weights and thresholds are independent, normally distributed random variables [Ama72]. However, oscillatory behavior has not been established for this system. While the model given by equation (1) does not capture many features of biological neurons, it is quite popular because of its simple nature and because it is easily amenable to hardware implementation [Hop84]. Further detail can be added by considering "integrate-and-fire" type neurons [AB89], by applying cable theory, or by using compartmental models
5. A Macroscopic Model of Oscillation in Ensembles
147
I external input To all celli
cell 1
Vi=g[ui-OJ
X T
0 (bias)
FIGURE 1. Microscopic model of a neural cell. [KS89]. However, such detailed modeling is found to be overkill when considering large systems of neurons. In this article, we use a Hopfield-type neural model, a form of equation (1) in which each cell is represented by an electrical circuit, as depicted in Figure 1. The synaptic strength from cell k to cell i is denoted by a conductance Wik{t), and jRi, Ci are the cell membrane resistance and capacitance of cell i. Let 6i be the threshold for neuron i to fire, Ui{t) = Ui — 0i the effective membrane potential, a^ the cell time constant, Tik = ^ ^ the effective connection strength from cell k to cell i, Ii(t) the effective external current into cell z, and Vi{t) the cell output, which can be considered as a short time average of its firing rate. By applying Kirchoff's current law to the cell input and simple algebraic manipulation, one obtains
^
= -a, {U,{t) + e^) + J2Ti,{t)v,{t) + : ^ .
(2)
A;=l
The model may also include a formula for weight adaptation. In this paper, we acknowledge that the weights could be time-varying without explicitly using this fact. A popular choice for the input-output transfer function for individual cells is Vi{t)
= =
1 l^e-'^xUi{t)
9[Ui{t)].
Equations (2) and (3) thus characterize the microscopic model.
(3)
148
Ghosh, Chang, and Liano
We note that the output of a neuron is a train of action potentials, or spikes, propagated through its axon. An action potential occurs when the membrane potential reaches a particular threshold. At that juncture, the membrane potential is "discharged" to a resting value, and the neuron is prevented from firing for a short refractory period, after which the membrane potential can charge up again. While several detailed models exist for this input-output behavior, it is deemed sufficient to model the neuron as having a continuous output voltage Vi that is a function of the input voltage Ui at the same instant, for the purposes of studying macroscopic phenomena. If the refractory period is sufficiently smaller than the interval between successive firings, then Vi{t) can be interpreted as the short-term average firing rate of the ith neuron [Ami89] and be obtained by a smooth function, such as in equation (2).
2.2
Characterization of Cell Assemblies
Several researchers have studied the properties of large neural networks by aggregating them into interacting groups, or clusters [Ama72, Ede87, Ama90, CG93]. Typically, a group exhibits more homogeneity among its constituent neurons, has higher internal connectivity, and/or is used to represent a particular hypothesis [GH89]. Well-known examples include the neuronal groups of Edelman [Ede87] and the neural clusters used for distortion-invariant pattern matching by von der Malsburg [vdM88]. In this section, we apply methods in statistics to build a macroscopic neural model. Macroscopically, the behavior of a cell assembly can be characterized in terms of the time or ensemble averaged behavior of the homogeneous collection of neurons that populate it. For large assemblies, we can consider an individual cell parameter Xi (t) as an instance of a random process x{t). Thus, for any specific time instant t, x{t) is a random variable. The corresponding system-level parameter of interest is the sample mean, x(t), or the expectation {x{t)). A macroscopic model reduces the system description from a set of 2n coupled equations to a few equations involving expectations and standard deviations of these random variables. The Law of Large Numbers shows that the expectation approaches the sample average for large n. In the derivations below, the expectation and sample average are interchangeable; i.e.. 1 ^
X = — 2 . ^i ^ (^} n ^— 1= 1
initior dx{t) _ d dt ~ dt
_ 1 Y^ dxi{t) dxi{t) ~ n ^ dt ' ~ dt
(4)
5. A Macroscopic M o d e l of Oscillation in Ensembles
149
By applying equation (4) to equation (2), we obtain dU{t) = -aU{t) -f -aG{t) + T{t)v{t), dt
(5)
where Qiit) = Oi -
RJiit)
and n
At first glance, the macroscopic equation (5) seems intractable because of the coupled parameters. Fortunately, it seems that when n is large, a knowledge of any one of the random variables would contribute very little information about the other random variables [Gem82]. Based on this insight, Amari introduced the concept of a symmetric, random net to describe networks composed of one homogeneous class of analog elements described by equation (1), in which all the weights, Tij, are independent, identically distributed (i.i.d.) random variables. The thresholds are also i.i.d.s but subject to a probability distribution different from that for the weights. For such systems, a "local chaos" hypothesis analogous to the one widely used in statistical mechanics has been proposed. This hypothesis assumes that the solutions of individual equations (in our case, equation (2)) within the system are mutually independent. The chaos hypothesis was originally postulated by Rozonoer [Roz69]. Since then, it has been supported by mathematical analysis and by simulations on large systems of randomly coupled equations [Ama72, Gem82], and is key to the development of an elegant macroscopic model of cell assemblies. For our more detailed model of equation (2), the comments made above provide a natural definition of a cell assembly to be a symmetric, random network of homogeneous neurons. This implies that the parameters ai, C~^,Ti(f), Ui{t)^ U{t)^ and 6i are from independent distributions at any given time instant t. The output Vi{t) of course depends on Ui{t) but is independent of the other parameters. For such a system, the coupled parameters (ai^Ui), {ai,Qi), {Ti,Vi) in equation (5) can be separated to yield ^ ^ at
= -aU{t) - aQ{t) 4- T{t)v{t).
(6)
As presented in Chang et al. [CGL92], we know that the standard deviation of Ui, au (t) approaches some constant a as t -> 00 (^alit) = {U- {U)f « ^ E r = i {Ui - Uy), and Ui{t) becomes a linear combination of Ri and 6i as t -^ oo. Thus, if Ri and 6i are normally
150
Ghosh, Chang, and Liano
distributed, then Ui{t) is also normally distributed. The average output can then be computed by /''^
1
/
1
\
(^t-
dUi (7)
Though the macroscopic transfer function p(-) seems very complex as compared to the sigmoidal microscopic transfer function g{') relating the output of a single neuron to its input, it turns out that these two functions have a similar shape, as shown in Figure 2. In particular, we can
FIGURE 2. (a) Microscopic transfer function; (b) Macroscopic transfer function prove that g{') is monotonically increasing by showing the properties of the macroscopic input-output transfer relation that the first derivative of g{') is always positive [CGL92], with _ 0 < g{U) < 1;
(8) The above discussion has been mathematically justified in Chang et al. [CGL92] and can be summarized in the following lemma: Lemma 1. The shape of the sigmoidal microscopic transfer function g\Ui{t)\ = -lxui(t) is conserved by the macroscopic transfer function p |T7(t)] of a large homogeneous population of Hopfield-type neurons as t —^ 00.
5. A Macroscopic Model of Oscillation in Ensembles
151
TNP
N
Tpp>
TNN<0
TpN>0 External Input
External Input
FIGURE 3. Interactions between excitatory and inhibitory neuronal assemblies.
3
Interactions between Two Neural Groups
In this section, we use the macroscopic state equation of a neuron population derived in the previous section to study a simple model involving the interactions between two groups of neurons, as shown in Figure 3. The P population consists of excitatory neurons, and the N population consists of inhibitory neurons. We discuss the qualitative behavior of the equilibrium states of both the excitatory and inhibitory neurons. The equilibrium points of the system are obtained by combining the equilibrium curves of both excitatory and inhibitory neuron populations. These curves also determine the number of fixed points in the system. The connections between the two neuron groups as shown in Figure 3, a^ well as other notations used in this section, are defined below.
Tpp{t)
=
TpN{t)
=
J2^pp..(t) >o, Y.'^PN^'-^t) fc=l
Li=l
>o, ,
"N [""JV
TNN{t) "«fc=i Li=i 1 "" a„ _
= 1 "'^ 1
"P
"p 1. ==1
152
Ghosh, Chang, and Liano
©NW =
—T],eNAt),
"'Vi=l
Assuming that the number of excitatory neurons is close to the number of inhibitory neurons, i.e., Up ^ n^, the coupled differential equations describing the macroscopic behavior of the system can be derived from equation (6): dUp{t) dt
=
-a;Up{t)
- a;Qp{t)
=
dt
+ Tpp{t) v;{t) + TpN{t) tJ^(t)
U {Up{t),UN(t),t).
= -o^C/ivW - 5^07v(i) + TNNit) v^^it) + TNPit)
=
U{Upit),UNit),t).
(9)
v^it)
(10)
The system is in equilibrium when both /p and / ^ are equal to zero after time t is greater than some finite ^o- This means that the system stays in a stationary state Up{t) = A, and f/iv(0 — ^ for all ^ > ^oThe equilibrium point can be shown graphically by plotting the curves fp = 0 and / ^ = 0 on the {Up, UN) plane. Intersections of the two curves specify the equilibrium points of the system. In this paper, we assume that the aggregate values of effective synaptic strengths Tpp, TpN, TNP, and T/vAT, as well as effective biases 0 p and ©TV, cisymptotically reach constant
5. A Macroscopic Model of Oscillation in Ensembles
153
values. Under this assumption, the equiUbrium states may exist because the intersections then become time-invariant. The assumption is trivially valid if there is no learning and if the average external forcing function is constant. Also, large groups of neurons with self-regulatory mechanisms are expected to exhibit such behavior [Ede87]. The following paragraphs describe the features of the two curves at equilibrium. We start off by analyzing the curve fp = 0. For convenience, let us define h. {Up{t)) = -a;
Up(t) - o j 0 p + Tpp g{Up{t)).
(11)
Then, K ( M O ) = -a;
+ T^
g\u^{t)).
ft;(C7p(t))
<0
(12)
By equation (8) we have ^\Pg(A,r?p)
<^
forallC7p(<),
(13)
V TT
where .^
)^ i P rj Jo
smh{2V2 Xrjy)e-y ^ 1 + cos h ( 2 x / 2 A w ) ^
Also h {Up{t)) is continuous and changes from —oo to -hoc as Up{t) varies from -hoo to —oo. Thus h {Up{t)) is monotonically decreasing. Prom equation (9) at equilibrium, /p(Mt),C7^(t),t)=0
^=>
g(u;;(t))
=
-^hp{U^{t)),
which means that 'g (UN{t)) is strictly monotonically increasing. From the above discussion, it can be concluded that the relation between Up{t) and UN{t) is monotonically decreasing. Moreover, 0 < ^ (f^iv(O) < ^ ^^^ hp {Up{t)) is not bounded; the curve fp =0 is bounded on Up but not on UN' Figure 4 shows the graph fp=0 satisfying (13). When (13) is not satisfied, i.e., ^ v / f QiKrjp) > 1, hp {Up{t)) may not always be monotonically decreasing. Since h'p {Up{t)) is a unimodal function, hp {Up{t)) cannot have more than three roots. If Up^ and Upstand for the local maximum and minimum of hp (Up{t)), then by (8), 'g' {Up{t)) is symmetric with respect to the origin. So Up+ = r'(M=)\ \TppJ
and
TLr=-\a'-'(3 \Tpp
154
Ghosh, Chang, and Liano
FIGURE 4. Generic graph of / p = 0 with a single root.
(c)
(d)
FIGURE 5. Generic graphs of / p = 0 with three roots.
5. A Macroscopic Model of Oscillation in Ensembles
155
FIGURE 6. Generic graph of / ^ = 0. From the above discussion, it is clear that hp {Up{t)) is monotonically increasing over the interval [Up-, C/p+J and monotonically decreasing outside this interval. Since 0 < ^{UN{t)) < 1, the curve fp = Ois defined only for hp [Up{t)) that satisfies 0
{Up{t)) <
-TpN.
(14)
As Up(t) tends to the bounds of (14), f/yvCO tends to infinity. The two conditions that determine the shape of the curve /p = 0 are listed below: -TpN,
(15)
< -TW-
(16)
0 < /ip {Up+) < 0 < /ip {U^
If both conditions above are satisfied, the curve fp = 0 is shown in Figure 5(a). It has the general shape shown in Figure 5(b) if equation (15) is not satisfied, and that in Figure 5(c) if equation (16) is not satisfied. However, if both conditions are not satisfied, then the general shape of curve /p = 0 is shown in Figure 5(d). For the curve / ^ = 0, equation (10) can be written as
ViUpit))
1
h{UN{t)),
(17)
TN}
where h^ {UN{t)) = -0^7 UN{t) - o ^ 0iv 4- TNN g{UN{t)).
(18)
Since g {Up{t)) increases monotonically and h^ (f/iv(0) decreases monotonically, equation (18) denotes the monotonically increasing relation between T7p{t) and UN{t)' Moreover, since 0 < g {Tjp{t)) < 1 and /i^(C/iv(0) is not bounded, the curve / ^ = 0 is bounded on UN{t) but not on Up(t), as shown in Figure 6. We conclude this section with the following lemma:
156
Ghosh, Chang, and Liano
L e m m a 2. For a system of coupled excitatory and inhibitory cell assemblies governed by equations (9) and (10), there exists at least one and at most five equilibrium points (except for degenerate cases). Notice that the equilibrium of the system is shown graphically by the intersection of the curves /^ = 0 and / ^ = 0. The lemma can then be easily proved by intersecting Figure 6 with either Figure 4 or Figure 5.
4
Stability of Equilibrium States
In this section, the stability of equilibrium states is studied using Liapunov's theorem. We then apply the Poincare-Bendixson theorem along with Dulac's criterion [AVK66] to analyze the system defined in Section 3. In the previous section, we showed that there exists at least one fixed equilibrium point and assumed that T's and 0 ' s become constant. From Liapunov's stability theorem on diflFerential equations [AVK66], the stability of an equilibrium state {A, B) of the system described by equations (9) and (10) can be tested by the matrix M — I
9Up
dUs
dUp
dUN
(19) evaluated at
{A,B).
If the determinant of M is positive and the trace of M is negative, then the equilibrium state is stable; it is unstable otherwise. Combining equations (9), (10), and (19), M-( V
- ^ + 9'{mt))T^ 9' {Up{t)) TMP
9' {U^{t))TF^ \ - O ^ -f g' {UM{t)) TMN ) '
.... ^ ^
By condition (8) and the definitions given in the beginning of Section 3,
^ < 0
^-d
lfe.<»^ _£££.
det(M) > 0
=;•
dUp
_££iiL ^
dUp
If we denote the slopes of the curves /^ = 0 and / ^ = 0 by Sp and 5iv respectively, then det(M) > 0
^^
5 p {U^{t),U^{t))
< SN ( M t ) , L ^ ( f ) ) .
The above discussion can be summarized in the following theorem.
5. A Macroscopic Model of Oscillation in Ensembles
157
Theorem 1 The equilibrium state (AjB) of the system described by equations (9) and (10) is stable when both conditions below hold and is unstable otherwise. Sp{A,B) g'{A)Tpp+g'{B)TNN
<
SN{A,B)
<
o^ + o ^
(det(M)>0).
(21)
(trace(M) < 0).
(22)
To discuss the general behavior of our system governed by equations (9) and (10), let us consider a region bounded by a circle: 5ft=(C7?-A)' + ( C 7 ^ - ^ ) ' . We have fm
—
.
=
+ =
2[-a;{Up-A)Up^a;;{UN-B)U^
pWy(f7?)+7W^(C7^)-a70p]
{U^-A)
[5W{^([7?)-^(A)}+7W{^(C7^)-^(B)}a7{C7?-A}] {TJ^-A} {UJJ-B}.
Since g{-) is bounded, if region 5t is large enough, ^ < 0 when Up{t) and UN{t) become large. According to the Poincare-Bendixson theorem [AVK66], if {Up{t),UN{t)) is a solution of the system such that it exists and stays in K for t > ^o for some finite to, then (a) the solution is periodic, (b) the solution spirals toward a periodic solution as t -> oo, or (c) the solution terminates at any one of the equilibrium points. A formal statement of this theorem is given in the Appendix. An alternative mathematical analysis to show the existence of periodic solutions can be found in Geman [Gem82]. Figure 7 shows the phase portrait corresponding to each of the cases above. If we can choose an annular region Sft' that excludes all equilibrium points and assume that among these limit cycles there are no "semistable" ones (they are only possible in "noncoarse" systems), then if all paths enter the region SR' as t increases, there is at least one stable limit cycle. On the other hand, if all paths leave the region 3?' as t increases, there exists at least one unstable limit cycle. We have shown that oscillatory solutions can exist in our model consisting of excitatory and inhibitory neurons. Furthermore, we can use Dulac's
158
Ghosh, Chang, and Liano
iip\
(a)
'
(b)
(c)
FIGURE 7. Phase plot of all three possible solutions.
criterion [AVK66] to identify the situations where no periodic solution exists. If
dUp
a[g(t7pW,t^W)_/^ {u^{t),u^{t))] dUN = [ r (Upit)) Tpp + g' {UNW) TNN - (37 + ^)] dQ{pp{t),Thi{t)) dQ
dUp ^{ty^jt))
Q {Up{t), UMH))
h {Upit),UNit))
has the same sign in region SR, there exists no limit cycle in 5R (see Appendix). Here g [Up{t),UN{t)) is any continuous function with continuous derivatives. Let ^ = 1. Then
dUp
dUN
<
g'{Up(t))Tpp-ia;
<
9'{0)Tpp-{a^
=
Q{\I1P)^-TFP
+ a;;) + a;;) - {a;+
By Dulac's criterion, we obtain the following lemma.
a;;).
5. A Macroscopic M o d e l of Oscillation in Ensembles
159
Lemma 3. If
then there is no periodic solution for equations (9) and (10) with constant T's and 0's. This means that the neural system must converge to a fixed point. Prom Lemma 2, we obtain at most five equilibrium points for this system in general. If any of the conditions in (21) and (22) are not satisfied, these equilibrium points become unstable, and any perturbation from the equilibrium causes the system to oscillate.
5
Oscillation Frequency Estimation
Finding the frequency of oscillation in situations when it occurs is a natural extension in understanding the aggregate behavior of interacting neuron populations. However, since such systems are highly nonlinear, finding a simple closed-form solution is difficult. In this section, we derive an expression for the fundamental frequency of oscillation. Simulation results given in Section 6 show this estimate to be quite accurate. If the system is periodic, the solution to equations (9) and (10) can be approximated by the zeroth and first harmonics of their Fourier series. Upjt) UN{t) v^{t) ^(0
= = = =
a^, -\- f3p sin (jjt, a^ -\-P^sin{ujt + S), g{Up(t))=a'p^l3'pSmut, 9{UN{t)) = a'^ -h /Jjv sin(a;t -f 5).
. . ^^''^
Here, u is the first harmonic frequency, and 8 is the phase difference between Up{t) and f/iv(0- Since there is no delay in our model, it is obvious that Up{t) and vj{t) as well as UN{t) and v^{t) have the same phase. Substituting equations (23) into (9) and (10), we obtain /3pu; cos(a;t)
f3^u;cos{u;t -\- S)
= -h
-cT^ap — o^/Jp sin ujt — cT^Qp Tppa'p -h Tppf3p sin ut
= -CL^a^ - CL^P^ sm{ut -\- 8) — CT^QN -h T^a'^ -^T^^'j^ sin{ujt -h 8) -f
TNPOL'P -h TNPI^'P sin
ut.
For simplicity, let A = 00; i.e., let the activation function of an individual
160
Ghosh, Chang, and Liano
neuron be a step function. Equation (7) can then be written as follows:
— )] /2vJ\
v = 9{U) = ^
where the error function is defined as before. Series representation of the error function is given by 9
J:X
^2/c-i
$(x) = ^ ^ ( - l ) ' = - i
(2A;-l)(fc-l)!'
Let X = p + qsinujt. Then the coefficient of the constant term is ^
K
^,
. sl-_l
E(-i)'"' (2A: -
P
fc=l
^ip),
l)(fc - 1)!
and the coefficient of sin ut is
(A; - 1)!
A.— 1
y^
B->'^
LA:=0
y^^e
Taylor's expansion of the function ^ ( t ) = ^{p 4- gsina;t) yields 2
_
2
^(t) = ^{p) H—^ge ^ sinujt + higher-order terms. Using this result, the Fourier coefficients of the average outputs and average inputs in equation (23) can be related in the following equations: --^
0
+
^
e '^^p sinujt + • •
+7#t^^"~«i^M + '^) + ---(25) From equations (24) and (25), the following relations can be derived by matching the coefficients of cosut, sina;t, and the constant term. rpiv^^sin(5, 0 0
-o^/^p + T^(3'p + 7W/^N cos 5, -oT^ap — oT^Gp -f Tppa'p -f Tp^a'^^^, -0^7/3^ sin 5 + -a^p^
TNNPN
cos S + TNNPN
sin S, COS S 4- TNPPP,
(26) (27) (28) (29) (30)
5. A Macroscopic M o d e l of Oscillation in Ensembles +
(31)
0
=
-o^Q^N ~ a ^ 0 N
ap
=
1 - l-h$
(32)
^'P
=
/3p
(33)
"AT
=
P'N
=
o
-f- TNNO^'N
161
TNPOL'P,
(34)
1+$
(35)
Using t h e above equations and some algebraic manipulations, we arrive a t t h e expression below for t h e frequency of oscillation.
-TpN Tjsfp LJ
VN
=
\\2.rj^r,^e.p{^^^)
exp Tpp-
^.
i-m
y/2TT1
-12
(36)
-a.
where a^ a n d a ^ can be obtained from
-^a.
. ir...(-|^).iT...(-|-) =
ttpOp
- -Tpp
(37)
- -Tpisf
- - ^ ^^™*i^)-^^-*(it) =
(38)
a^QN - Ti^NN — T^J^NP' -2^^
This first-order approximation of t h e frequency is accurate when t h e condition given below is satisfied: r>2
TTVAT ^ P «P VN + «N ^F
6
V ^
L
^
H
"^^'"2.;^
^exp
OL%
2r/|,/j (39)
Experimental Validation
Simulations were performed t o validate t h e results of t h e last two sections for A = 00. For this special case, t h e different types of solutions as given by the stability analysis of Section 4 are experimentally observed. We t h e n
162
Ghosh, Chang, and Liano
show that for a system with periodic solutions, the frequency estimation in Section 5 is fairly accurate. As stated in Section 4, there exist three types of solutions for equations (9) and (10): periodic, spiral to periodic, and fixed point. Simulation results showing all three solutions are presented in Figures 8, 9, and 10. Each figure consists of six graphs: 1. 2. 3. 4. 5. 6.
Average excitatory input over time, Up. Average inhibitory input over time, UNPhase portrait of the average excitatory versus inhibitory input. Average excitatory output over time, Up. Average inhibitory output over time, vjj. Phase portrait of the average excitatory versus inhibitory output.
For the periodic case, the frequency can be approximated by equation (36) in Section 5. To simplify the computation, one of the equilibrium points is set at (0,0). Using equations (9) and (10) under these conditions, we obtain the following relations: 1-— 1; apQp --Tpp --TjsfN
=
0,
1, --^Tpp
=
0.
aj^Qjsf --TNN
It is easy to see that Qp = 0 and a^ = 0 is one of the solutions for equations (37) and (38). For this case, equations (36) and (39) can be reduced to
u = /-^PN^NP_^/^^^.^y^ V
27rrjpr]^
r]^
^P ^^r + a^ r^p = - = — \/27r \ VN
(40)
J
\V2^VP
-h — VP
I. J
(41)
Simulations confirm that amplitude and frequency, as well as phase shift, can be adjusted by changing the parameters used in equations (9) and (10). A wide range of the frequency of oscillation is obtained by changing the T's and a's while setting T]p = r]j^ = l- Numerical results from the simulations are compared with theoretical results from equation (40) in Table 1.
7
Conclusion
In this chapter, we have applied a macroscopic model of cell assemblies to study the qualitative behavior of a simple system consisting of two interacting groups of excitatory and inhibitory neurons. The theory not only is able
5. A M a c r o s c o p i c M o d e l of Oscillation in E n s e m b l e s
FIGURE 8. Periodic solution.
163
164
Ghosh, C h a n g , and Liano
I^'I CURE 9. Spiral to periodic solution.
5. A M a c r o s c o p i c M o d e l of Oscillation in Ensembles
FIGURE 10. Fixed point solution.
165
166
Ghosh, Chang, and Liano
Tpp
TpN
TNP
TNN
Op
5.6 4.0 8.0 10.0 10.8 12.8 30.0 171.8
-5.6 -6.0 -12.0 -16.0 -10.8 -12.8 -30.0 -171.8
1.0 1.4 4.0 5.0 3.0 5.0 25.0 163.0
-1.0 -2.0 -6.0 -8.0 -3.0 -5.0 -25.0 -163.0
1.32 0.7 0.7 0.7 2.8 2.8 1.8 2.5
"iv
0.5 0.1 0.1 0.1 0.31 0.31 0.11 0.61
/ Theory 0.037 0.116 0.190 0.220 0.270 0.351 0.636 1.557
w
Simulation 0.039 0.107 0.176 0.195 0.273 0.352 0.635 1.660
TABLE 1. Frequency of Oscillation with r/p = r;^ = 1 . to determine the situations that lead to oscillatory behavior, but also is able to give a good estimation for the oscillation frequency for such situations. The oscillations in our system stem from the competitive-cooperative dynamics of the neuron groups similar to those studied by Wilson and Cowan [WC72], without any imposed constructs such as the use of periodic forcing functions or oscillator neurons. The accuracy of the frequency estimation provides incentive to study more complex systems involving several neuronal groups, as well as more intricate phenomena such as phase locking. We believe that quantitative studies of rhythmic behavior will increase in significance with improved understanding of the role of temporal activities in information organization and processing in the brain.
8 Appendix The Poincare-Bendixson Theorem [AVK66]: Let 7^ be a closed bounded region consisting of nonsingular points of a 2 x 2 system x = X(x) such that some positive half-path T-L of the system lies entirely within TZ. Then either T-i is itself a closed path, or it approaches a closed path, or it terminates at an equilibrium point. Dulac's Criterion [AVK66]: For the system x = X{x,y), y = Y{x,y), there are no closed paths in a simply-connected region in which ^^^ -j^Q ^ is of one sign, where Q{x,y) is any function having continuous first partial derivatives. 9
REFERENCES
[AB89]
A. Atiya and P. Baldi.
Oscillations and synchronizations in
5. A Macroscopic Model of Oscillation in Ensembles
167
neural networks: An exploration of the labeling hypothesis. International Journal of Neural Systems, 1:103-124, 1989. [Ama71]
S. I. Amari. Characteristics of randomly connected thresholdelement networks and network systems. Proceedings of the IEEE, 59:35-47, 1971.
[Ama72]
S. I. Amari. Characteristics of random nets of analog neuron-like elements. IEEE Transactions on Systems, Man, and Cybernetics, 2:1443-1463, 1972.
[Ama90]
S. I. Amari. Mathematical foundations of neurocomputing. Proceedings of the IEEE, 78:1443-1463, 1990.
[Ami89]
D. J. Amit. Modeling Brain Function. Press, Cambridge, U. K., 1989.
[AVK66]
A. A. Andronov, A. A. Vitt, and S. E. Khaikin. Oscillators. Dover, New York, 1966.
[Bai90]
B. Baird. Associative memory in a simple model of oscillating cortex. In D. Touretzky, editor. Advances in Neural Information Processing Systems II, pages 69-75. Morgan Kaufmann, San Mateo, CA, 1990.
[Bow90]
J. M. Bower. Reverse engineering the nervous system: an anatomical, physiological and computer based approach. In S. Zornetzer, J. Davis, and C. Lau, editors. An Introduction to Neural and Electronic Networks, pages 3-24. Academic Press, San Diego, CA, 1990.
[CG93]
H.-J. Chang and J. Ghosh. Pattern association and pattern retrieval in a continuous neural system. Biological Cybernetics, 69(l):77-86, 1993.
[CGL92]
H.-J. Chang, J. Ghosh, and K. Liano. A macroscopic model of neural ensembles: Learning-induced oscillations in a cell assembly. International Journal of Neural Systems, 3(2):179-198, 1992.
[Cow67]
J. D. Cowan. A mathematical theory of central nervous activity, Ph.D. Thesis. Univ. of London, 1967.
[Ede87]
G. M. Edelman. Neural Darwinism. 1987.
Cambridge University
Theory of
Basic Books, New York,
168
Ghosh, Chang, and Liano
[GC83]
S. Grossberg and M. Cohen. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 13:815-826, 1983.
[Gem82]
S. Geman. Almost sure stable oscillations in a large system of randomly coupled equations. SI AM Journal on Applied Mathematics, 42:695-703, 1982.
[GH89]
J. Ghosh and K. Hwang. Mapping neural networks onto message-passing multicomputers. Journal of Parallel and Distributed Computing, 6:291-330, April, 1989.
[GS89]
C. M. Gray and W. Singer. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences, USA, 86:1698-1702, 1989.
[Hir89]
M. W. Hirsch. Convergent activation dynamics in continuous time networks. Neural Networks, 2:331-350, 1989.
[Hop84]
J. J. Hopfield. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, USA, 81:30583092, 1984.
[KKH90] D. Kammen, C. Koch, and P. J. Holmes. Collective oscillations in the visual cortex. In D. Touretzky, editor. Advances in Neural Information Processing Systems II, pages 77-83. Morgan Kaufmann, San Mateo, CA, 1990. [KS89]
C. Koch and I. Segev. Methods in Neuronal Modeling, from Synapses to Networks. MIT Press, Cambridge, MA, 1989.
[Pav73]
T. Pavlidis. Biological Oscillators: Their Mathematical sis. Academic Press, New York, 1973.
[Roz69]
L. I. Rozonoer. Random logical nets I, II and III. Automatiki i Telemekhaniki, 5:137-147, 1969.
[Som88]
H. Sompohnsky. Statistical mechanics of neural networks. Physics Today, pages 70-80, 1988.
[SW90]
H. G. Schuster and P. Wagner. A model for neuronal oscillations in the visual cortex. Biological Cybernetics, 64:77-82, 1990.
Analy-
[TMW89] R. D. Traub, R. Miles, and R. K. S. Wong. Model of the origin of rhythmic population oscillations in the hippocampal slice. Science, 243:1319-1325, 1989.
5. A Macroscopic M o d e l of Oscillation in Ensembles
169
[vB86]
C. von der Malsburg and E. Bienenstock. Statistical coding and short term synaptic plasticity: A scheme for knowledge representation in the brain. In E. Bienenstock, F. Fogelman, and G. Weisbuch, editors, Disordered Systems and Biological Organization, pages 247-272. Springer, Berlin, 1986.
[vdMSS]
C. von der Malsburg. Pattern recognition by labeled graph matching. Neural Networks, 1:141-148, 1988.
[WC72]
H. R. Wilson and J. D. Cowan. Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12:1-24, 1972.
This Page Intentionally Left Blank
Chapter 6 Finite State Machines and Recurrent Neural Networks Automata and Dynamical Systems Approaches Peter Tino Bill G. H o m e C. Lee Giles P e t e C. CoUingwood ABSTRACT We present two approaches to the analysis of the relationship between a recurrent neural network (RNN) and the finite state machine M the network is able to exactly mimic. First, the network is treated as a state machine and the relationship between the RNN and M. is established in the context of the algebraic theory of automata. In the second approach, the RNN is viewed as a set of discrete-time dynamical systems associated with input symbols of A^. In particular, issues concerning network representation of loops and cycles in the state transition diagram of M. are shown to provide a basis for the interpretation of learning process from the point of view of bifurcation analysis. The circumstances under which a loop corresponding to an input symbol x is represented by an attractive fixed point of the underlying dynamical system associated with x axe investigated. For the case of two recurrent neurons, under some assumptions on weight values, bifurcations can be understood in the geometrical context of intersection of increasing and decreasing parts of curves defining fixed points. The most typical bifurcation responsible for the creation of a new fixed point is the saddle node bifurcation.
1
Introduction
T h e relationship between recurrent neural networks (RNN) and a u t o m a t a has been t r e a t e d by m a n y [Min61], [Jor86], [CSSM89], [DGS92], [Elm90], [GMC+92a], [Cas93], [WK92b], [ZGS93], [MF94], [DM94], [HH94]. Activations of s t a t e units represent past histories, and clusters of these activations can represent t h e states of t h e generating a u t o m a t o n [GMC"^92b]. In this contribution, t h e relationship between an R N N and a finite s t a t e 171
172
Tifio, Home, Giles, and Collingwood
machine it exactly mimics is investigated from two points of view. First (Section 5), the network is treated as a state machine. The concept of state equivalence is used to reduce the infinite, non-countable set of network states (activations of RNN state neurons) to a finite factor state set corresponding to the set of states of M. Second (Section 6), the RNN is viewed as a set of discrete-time dynamical systems associated with input symbols of M. The dynamical systems operate on (0,1)^, where L is the number of recurrent neurons of the RNN. In our experiments, loops and cycles corresponding to an input symbol x oi M have stable representation as attractive fixed points and periodic orbits; respectively, of the dynamical system associated with the input x. Suppose there is a loop associated with an input x in a state q oi M. Denote the set of network states equivalent to q by {q)j\f. Then, if there is a vertex v G {0,1}^ such that v is in the closure of {q)u, the loop is likely to be represented by an attractive fixed point^ "near" v. Related work was independently done by Casey [Cas93], [Cas95a]. In his setting, an RNN is assumed to operate in a noisy environment (representing, for example, a noise corresponding to round-off errors in computations performed on a digital computer). RNNs are trained to perform grammatical inference. It is proved that the presence of a loop in the state transition diagram of the automaton^ necessarily implies the presence of an attractive set inside RNN state space (see the discussion in Section 6). It is also shown that the method for extraction of an automaton from a trained RNN introduced in [GMC"'"92a] is consistent: the method is based on dividing RNN state space into equal hypercubes and there is always a finite number of hypercubes that one needs in order to unambiguously cover regions of equivalent network states. In Section 7 a more detailed analysis of the case when RNN has two state neurons is presented. Under some conditions on weight values, the number, position, and stability types of fixed points of the underlying dynamical systems are analyzed and the bifurcation mechanism is clarified. The most typical bifurcation responsible for the creation of a new fixed point is the saddle node bifurcation. A mechanism of correct behavior of an RNN for short input strings when for long strings the network is known to generalize poorly, is investigated in Section 8. In such cases, a correct state transition diagram of an FSM the network was trained with can still be extracted [CMC"*"92a]. A tool called the state degradation diagram is developed to illustrate how regions of network state space, initially acting as if they assumed the role of states of the FSM in which there is a loop associated with an input symbol x, gradually degrade upon repeated presentation of x. ^of the corresponding dynamical system ^recognizing the same language as the RNN
6. Finite State Machines and Recurrent Neural Networks
173
Sections 2 and 3 bring brief introductions to state machines and dynamical systems, respectively. Section 4 is devoted to the model of RNN [nHG95] used for learning FSMs.
2
State Machines
This section introduces the concept of a state machine, which is a generalized finite state machine with a possibly uncountable number of states. When viewed as automata, RNNs can be described in terms of state machines. A state machine (SM) is a 6-tuple M = {X, F, 5, /«, /o, so)j where • X is a nonempty finite set called the input set, • y is a nonempty finite set called the output set, • 5 is a nonempty set called the set of internal states, • /s is a map fg'.SxX-^S
called the next-state function,
• /o is a map fo'.SxX—^^Y
called the output function,
• SQES is called the initial state. SMs with a finite internal state set are called finite state machines (FSMs). We assume that the reader is familiar with the notion of a monoid of words over a finite set. Following the standard notation. A, X*, X"^, and uv denote the empty word, the set of all words over X, the set of all nonempty words over X, and the concatenation of words u and t;, respectively. At every moment M is in exactly one state s e S. When an element X G X is read in, the machine changes its state to fais^x) and yields the output fo{s,x). The processing of any input word w G X'^ by M always starts with M being in the initial state. If for some x e X and s E 5 , it holds that /^(s, x) = s, then it is said that there is an x-loop in the state s. If there exist m (m > 2) distinct states 5ij -"iSm G S and an input x G X , such that fs{si,x) = 5i-|_i, for all i = l,...,m — 1 and fsism^x) = si, then the set {si,...,Sm} is said to be an X-cycle of length m passing through the states 5i, ...,5^It is convenient to extend the domain of fs and fo from 5 x X t o 5 x X * and 5 X X"^, respectively: • V5G5;
•
fs{s,A) = s,
V 5 G 5 , VW;GX*, V X G X ; fo{f8{s,w),x).
fs{s,wx) = fs{fs{s,w),x)
and fo{s,wx) =
174
Tino, Home, Giles, and Collingwood
Yet further generalization of fo is useful: V S E 5 , Vi/; = a:iX2...a:nGX"^; f^{s,w)
=
fo{s,Xi)fo{s,XiX2)...fo{s,XiX2...Xn).
A distinguishing sequence of M is a word w G X'^ such that there are no two states si,S2 ofM for which f^{si^w) = f^{s2,w). The behavior of M is a map BM -X^ -^Y: Vit'GX"'"; BM{'^) = fo{so^w), A state 52 G 5 is said to be accessible and x-accessible from the state 5i E 5 if there exists some ii; G X* and w G {a:}*, respectively, such that 52 = fs{si^w). M is said to be connected if every state 5 G 5 is accessible from So- The set of all states that are x-accessible from a state 5 G 5 is denoted by Acc{x, s). An a:-cycle j = {si,..., Sm} is said to be x-accessible from a state p G 5, if 7 C Acc{x^p). An input word w G X* is leading to a state q if fsiso^w) = q. An input word leading to q is minimal if there is no input word leading to q of shorter length. We shall also need some concepts concerning state and machine equivalence. Let Mi = (X, F, Si, fl, fi, 5oi), z = 1,2, be two SMs. States si G Si and S2 G 52 are said to be equivalent if there is no nonempty word over X that would cause Mi to give different output from that given by ^ ^ 2 , provided that Mi and M2 started from si and 82 respectively. This is formally represented by the equivalence relation E{Mi,M2) Q Si XS2: {suS2)eE{MuM2)
iff
V^GX+;
fl{si,w)
=
f^is2,w).
The set {p G S2\{q,p) G E{Mi,M2)} of all states of X 2 that are equivalent to a state g G 5i of Mi is denoted by [q]E{Mi,M2)' When Mi=M2=M, the equivalence relation E{M,M) partitions the state set S of M into the set of disjoint equivalence classes S/E{M,M). Ml and M2 are said to be equivalent if for every state si G 5i there exists a state S2 G 52 such that (51,52) G E{Mi,M2), and vice versa. If there exists a bijection 65 : 5i —> 52 satisfying • V5 G 5 i , Vx G X; Po{bs{s),x),
65(/i(s,x)) = f^{bs{s),x)
and
/,H5,x) =
• ^s(so) = "sg, then A^i and M2 are said to be isomorphic. Isomorphic SMs can be considered identical since they differ only in the names of states. An SM is said to be reduced if no two of its states are equivalent to each other. A reduced SM equivalent to A^ = (X, F, 5, /«, /o, 5o) is {X,Y,S/E{M,M)J',JU^o]EiM,M)) w i t h / ; :S/E{M,M)xX*-^S/E{M,M) S/E{M,M) defined as follows: V5G5,VI/;GX*;
,
and /^ : 5 / E ( M , X ) x X + - ^
/^ ([«]£;( A4,A^), W;) =
[/5
(5, W;)]£;(A^,;V^),
(1)
6. Finite State Machines and Recurrent Neural Networks V5G5,VI/;GX+;
3
foi[s]EiMM)^^)
= fo{s,w).
175 (2)
Dynamical Systems
Analysis of dynamical systems (DSs) via state space structures plays an important role in experimenting and interpreting complex systems. Most of the important qualitative behaviors of a nonlinear system can be made explicit in the state space with a state space analysis. In this paper only discrete-time DSs (i.e., DSs evolving in discrete time) will be considered. Our theoretical knowledge about nonlinear DSs is far from complete. The state space of a nonlinear DS often consists of qualitatively different regions. It is useful to take into account the geometric information about the structures and spatial arrangements of these regions. Among the most important characteristics of a DS are the fixed points, periodic orbits, their stability types, and the spatial arrangement of the corresponding stability regions. We review some of the basic concepts in DS theory. A discrete-time DS can be represented as the iteration of a (diflFerentiable, invertible) function f : A -> A {AC 3?^), i.e., xt^i = f{xt),
t e Z,
(3)
where Z denotes the set of all integers. For each x e A, the iteration (3) generates a sequence of distinct points defining the orbit, or trajectory of x under / . Hence, the (forward) orbit of a: under / is the set {f"^{x)\ m > 0}. For m > 1, / ^ is the composition of / with itself m times. / ^ is defined to be the identity map on A. A point x^ E A is called a fixed point of f if / ^ ( x * ) = x* for all m G Z. A point x^ E A is a periodic point of f if /^(x*) = x^ for some ^ > 1. The least such value of q is called the period of the point x^ and the orbit of X*. The set {x*,/(x*), ...,/^~^(x*)} is said to be a, periodic orbit of x^ of period q. Notice that a fixed point is a periodic point of period 1, and a periodic point of / with period ^ is a fixed point of / ^ . If x* is a periodic point of period q for / , then so are all of the other points in the orbit of X*.
Fixed and periodic points can be classified according to the behavior of the orbits of points in their vicinity. A fixed point x* is said to be asymptotically stable (or an attractive point of f) if there exists a neighborhood 0(x*) of X* such that limm-^oo / ^ ( x ) = x* for all x E 0(x*). As m increases, trajectories of points near an asymptotically stable fixed point tend to it. The basin of attraction of an attractive fixed point x^ is the set {xeA\ lim^_,oo/'^(x) = x*}. A fixed point x* of / is asymptotically stable only if for each eigenvalue A of Df{x^), the Jacobian of / at x*, |A| < 1 holds. The eigenvalues of
176
T i n o , H o m e , Giles, and Collingwood
Df{x^) govern whether or not the map / in a vicinity of x* has contracting or expanding directions. Eigenvalues larger in absolute value than 1 lead to expansion, whereas eigenvalues smaller than 1 lead to contraction. If all the eigenvalues of -D/(x*) are outside the unit circle, x^ is a repulsive point, or repellor. All points from a neighborhood of a repellor move away from it as m increases, or equivalently, move towards it as — m decreases.^ If some eigenvalues of Df{x^) are inside and some are outside the unit circle, x* is said to be a saddle point. There is a set W^ of points x such that the trajectory of x tends to x* for m -> oo. W* is called the stable invariant manifold ofx,^. Similarly, the unstable invariant manifold ofx^, W^, is the set of points x such that the trajectory of x tends to x* for m —> — oo. Since any periodic point of period q can be thought of as a fixed point of /^, these remarks apply to periodic points as well. An absorbing set of a set BCA under the map / is a set P such that for all xeB, there exists mo > 0 for which f^{x)eP for all m > mo- For a given xeB, the least such a value of mo is called the absorption level of X in P under the map / . An absorption region of P under the map / is defined as follows: Af{P) = {x e A\ there exists mo > 0 such that f^{x)
G P for all m > mo}.
When A C 5R or A C Sft2, it is useful to code with colors (or diff'erent gray levels) the absorption levels of points from Af{P) in P. We will refer to such a diagram as an absorption diagram of P under the map f. B C A is said to be a positively invariant set of / if f{B) C B; i.e., trajectories of points from B stay in B. Trivially, A is a positively invariant set of / , but in an effort to understand the dynamics of (3), we are usually interested in finding as minimal a positively invariant set^ as possible. If B is open and^ f{B) C B, then the set B = r\m>o f^i^) is not only positively invariant, but also attracting, meaning that there is a neighborhood of B such that all orbits starting in that neighborhood converge to B. Attractive fixed points and periodic orbits are trivial examples of attractive sets. Much more complicated attractive sets can be found in the dynamical systems literature under the name strange attractors [Dev86]^. As in the case of an attractive fixed point, the basin of attraction of an attractive set B is the set of all points whose orbits converge to B. If B C A is a positively invariant set of / then it is certainly an absorbing set of itself under / . B may be an attracting set of / , or it may contain an 3 f —m _
/
f—l\m
"^in the sense of inclusion ^B denotes the closure of B. ®Loosely speaking, strange attractors are attractive sets that are topologically distinct from (i.e., cannot be transformed by a homeomorphism to) the trivial attractive sets mentioned above.
6. Finite State Machines and Recurrent Neural N e t w o r k s
0
177
(t)
A output nouroni unit dslag
FIGURE 1. RNN model used for learning FSMs. attractive set of / / or none of the two.^ To learn more about the theory of DSs, see, for example, Guckenheimer and Holmes [GH82].
4
Recurrent Neural Networks
The RNN presented in Figure 1 was shown to be able to learn mappings that can be described by finite state machines [nHG95]. A binary input vector /^*) = (/^ ,..., Ij^^) corresponds to the activations of A^ input neurons. There are two types of hidden neurons in the network: • K hidden nonrecurrent neurons denoted by i 7 f \ j = l,,..,K.
HI,...,HK,
activations of which are
^Note that this does not necessarily imply that B is part of the basin of attraction of an attractive set contained in B. Think of an attractive periodic orbit inside B that encircles a repelling fixed point. ®The identity map constitutes a simple example.
178
Tino, Home, Giles, and Collingwood
• L hidden recurrent neurons 5 I , . . . , 5 L , called state neurons. We refer to the activations of state neurons by S^- , i = 1, ...,L. The vector 5(*) = {S[^\..., 5[^^) is called the state of the network. Wiin^Qjin, and Vmk are real-valued weights, and ^ is a sigmoid function g{x) = 1/(1-fe~^). The activations of hidden nonrecurrent neurons are determined by l,n
The activations of state neurons at the next time step (t-f 1) are computed as follows:
5f+i) =p(53py,,„.5W./W) =5,(5W,/W).
(4)
l,n
The output of the network at time t is the vector {0[ \ ..., OjJ) of activations of M output neurons O I , . . . , O M - The network output is determined by Oi},^=9{j:V^,-Hi'^)
= Om{S('Kl^%
(5)
k
Network states are elements of the L-dimensional open interval (0,1)^, the internal region of the L-dimensional hypercube. A unary encoding of symbols of both the input and output alphabets is used, with one input and one output neuron for each input and output symbol, respectively. The bijection defining the encoding of N input symbols into A/'-dimensional binary vectors with just one active bit is denoted by c/. Similarly, the bijection that defines the encoding of M output symbols into M-dimensional one-active-bit binary vectors is denoted by CQThe vector I{t) = (/{ ,..., /^^) € {0,1} of activations of input neurons corresponds to the input symbol cj^{l[ \ ...,/)^^). Activation of each output neuron is from the open interval (0,1). A threshold A G (0, | ) is introduced such that any value from (0, A) is assumed to be an approximation of 0, and any value from (1 —A, 1) represents the value 1. A mapping r : (0,1) —^ {0,1, — 1} is defined as follows: ^
r{x) = { 1
if x G (0,A), if x G ( 1 - A , 1 ) , otherwise.
^ — 1 represents the don't know output of an output neuron.
6.
F i n i t e S t a t e M a c h i n e s a n d Recurrent N e u r a l N e t w o r k s
179
Interpretation of network output in terms of output symbols of the FSM it models is performed via mapping D^^: nd, ./ > ! _ / ^ O H 2 / I , - , 2 / M ) i^iyi,...,yMJ-I ^
if 2/iG{0,l}, z = l,...,M, otherwise.
If the output of the network, 0{t) = (Of \ ..., O^^), falls into ((0, A) U (1 A, 1 ) ) ^ , then it corresponds to the output symbol
D{riO[%...,riO^^))
= =
Co\r{0[\...,riO^^)) c5i(i?(0W,...,0W))
where the map R is the component-wise application of the map r. Each input word (a word over the input alphabet of the FSM used for training) is encoded into the input neurons one symbol per discrete time step t, yielding the corresponding output, as well as the network new state. Training is performed via optimization with respect to the error function
£^ = jE(^i"-o^^)'' where Tm G{0, 1} is the desired response value for the m - t h output neuron at the time step t. For a more detailed explanation of the training procedure see Tifio et al [nHG95].
5
RNN as a State Machine
In this section we assume that an RNN Af of the type described above has learned to exactly mimic the behavior of a reduced, connected FSM M = (X, y, (5,<J, A, So) it was trained with. It follows that there exists a network state S^ for which network output will always be in ((0, A) U (1 — A, 1 ) ) ^ upon presentation of any input word, and such that the following correspondence holds (time is set to t = 1 ):^^ Vti; = xi...Xn6X+;
X{qi,Xi) = D{R{0^'^)),
for alH = 1, ...,n,
(6)
where ^°It is assumed that * does not belong to the set of output symbols of the FSM modeled by the RNN, * stands for t h e don't know output of the net. ^^In practical terms, during the learning phase, the network is trained to respond to a special "reset" input symbol # [^ ^ X) by changing its state to a state equivalent to 50, the initial state of M (more details in [nHG95]). 5 ° is the "next-state" computed in the layer of recurrent state neurons when the symbol # is presented to the network input after the training process has been completed.
180
Tino, Home, Giles, and Collingwood
• qi = So, • 5(1) = 5 ^ • Qi^i =S{qi,Xi),
i = l , . . . , n - 1,
• the network input 7^*) at the time step i is the code cj{xi) of the ith input symbol Xi of the input word w. Automata theory provides us with the abiUty to connect structural and behavioral equivalence of automata [Shi87]. In particular, it can be shown that for any couple {Mi,M2) of connected FSMs with equal input as well as output sets the following holds: if BMI = BM2^ then Mi and M2 are equivalent and their reduced forms are isomorphic. To investigate the correspondence between M and M in this context, we represent the network AT as an SM j \ r = ( X , y U {*},5, r, I/, 5^), where the maps 1/ and r are defined as follows: for any 5 = (5i, . . . , 5 L ) G 5 , U{S,X) =
and any
XGX;
D{R{OI{S,CI{X)),...,OM{S,CI{X)))),
and r ( 5 , x ) = (5i(5,c/(x)),...,5L(5,c/(x))), with Oi and Sj defined by (5) and (4) respectively. Prom (6) it follows that Vii;€X+; A+(5o,ti;) = l/+(S^l/;).
(7)
The set 5 = (0,1)^ of states of M can be partitioned into the set of equivalence classes corresponding to the equivalence relation E{M,M). By presenting inputs to the network and considering only the decoded network outputs, it is impossible to distinguish between equivalent network states. [S^]E{j^,Sf) is the set of all network states equivalent to 5^. Denote the set of network states accessible from states from [S^]E{Sf,Sf) by Sacc- Note that for every state S G Sacc and for each input word w € X"*", v'^{S,w) does not contain the don^t know symbol *. Prom A/', a reduced, connected SM j\fi = (X, y, Sacc/E{Af, A/"), Ti, i/i, [S^]E(Sr,sr)) is constructed, where n and i/i are defined according to (1) and (2) respectively, and respectively restricted to Sacc/E{J\f,J\/') x X* and Sacc/E{Af,A^) x X^. Mi h a s t h e same behavior as A^. It is easy to see that the number of states of A/i is finite and hence A/i is an PSM. It follows that Mi and M are isomorphic. The set [q]E{M,M) ^^ ^'^ network states equivalent to the state q oi M is denoted by {q)j^. States of an SM code the information about "what has happened so far in the course of input word processing." Prom that
6. Finite State Machines and Recurrent Neural N e t w o r k s
181
point of view, all network states from {q)j\j' code the same information, the information that is coded by the state q of M. So far we have dealt with the existence issues concerning nonempty regions of network states equivalent to states of the FSM the network is capable of exactly mimicking. For a "constructive" approach to determination of (^)jv'j the regions A/J of network state space are identified for which the network Af gives the (decoded) output y, provided that the code of the input symbol x is presented at network input. In particular, Afy = {S e 5|i/(5,x) = y}. Note that for each x e X diudy eY, M^ is an open set. For a given input word w = xiX2-..Xn G X"^, the set of all network states Afw originating the output equal to X'^{q,w) is ;^A+(g,tz;)^^Mg,xi)p
n (^x._. o ... o r , , o r , J-HAe^^^^^--^^^) , (8) ,i=2
where Tx (5) = T{S, X) for each xeX.
(9)
By f~^{A), where / is a map and A is a set, we denote the set of all points whose images under / are in A. For any x 6 X, r^ is continuous, and so is the composition r^^ o ...OTX^ o'^xi for any word xiX2...Xm 6 X'^. It follows that the sets Afw are open. However, the set
wex+ of network states equivalent to the state q^ of A^ is not necessarily open, since an infinite, countable intersection of open sets is not guaranteed to be open.^^ If {q)j\/- is open, {q)j^ ^ 0 implies that there exists a (finite) length L of input words such that^^ {q)x = r\\w\
C (qU C n K ) - ' ( ^ x ^ ) -
(11)
i>0
As in Section 3, r* is the composition of r^ with itself i times, r^ is defined to be the identity map. ^^The case when trajectories in the RNN state space may be corrupted by a noise is not discussed in this paper. However, we note that if {q)j\/' is not open, arbitrarily close to a state S G {q)M there is a network state not equivalent to the state q of M. and an arbitrarily small perturbation of S may cause failure in the RNN modeling of M.. ^^\w\ denotes the length of the word w, i.e., the number of symbols contained in w.
182
Tiiio, Home, Giles, and CoUingwood
Analogously, if there is an x-cycle of length m passing through states Qii-"iQm with outputs yi = X{qi,x), i = l,...,m, then {qi)M C f]{Tt')-' j=i
^^\{rr)-'{^f^n] . \i>o J
(12)
Similar bounds can be found for {q2)M^'-"> {(lm)j^\ in particular, C C t e W ) C {qi)M C n ( ^ r ) " n A / ' i ' 0 , J = 1, -.m.
(13)
i>0
Some researchers have attempted to extract a learned automaton from a trained recurrent network [GMC+92a], [CSSM89], [WK92a], [nHG95]. Extraction procedures rely on the assumption that equivalent network states are grouped together in well-separated regions in the recurrent neurons' activation space. After training, the network state space is partitioned into clusters using some clustering tools, and for each q E Q, the region {q)M is approximated by (possibly) several clusters so obtained. For example, in Giles et al. [GMC"'"92a] the network state neurons' activation space is divided into several equal hypercubes. When the number of hypercubes is sufficiently high, each hypercube is believed to contain only mutually equal states. After training, Tifio et al. [nHG95] present a large number of input words to the network input. All states the network passes through during the presentation are saved. Then the clustering of those states is performed using a Kohonen map with the "star" topology of a neural field consisting of several "branches" of neurons connected to one "central" neuron. Such a topology helped to reduce great sensitivity to initial conditions found in vector-coding algorithms using independent cluster centers, while avoiding time-consuming approximation of the input space topology typical of the classical regular-grid topologies of the Kohonen map [nJV94]. Other approaches to RNN state space clustering are discussed in Tino et al. [nHG95]. Having approximated the regions {q)jsj', the automaton A/i is constructed via determining arcs in the corresponding transition diagram, followed by nondeterministic eliminating and minimization procedures. All ideas presented in this section stem from the assumption that the network AT exactly mimics the FSM M it was trained with. However, it is possible that a correct automaton is extracted from trained RNN even though the network is known to generalize poorly on long, unseen input words [GMC"'"92a]. This is discussed in Section 8.
5.1
Experiments
A number of experiments were performed in which RNNs with two or three state neurons were trained with simple FSMs. To show how the net-
6. Finite State Machines and Recurrent Neural Networks
183
ai0
bll
al2
FIGURE 2. FSM M used for training an RNN. M = (X, Y, 5, fsjo, so) is represented as a directed graph called the state transition diagram. The graph has a node for each state, and every node has \X\ {\X\ denotes the number of elements of a finite set X) outgoing arcs labeled with x\y (XGX, y£Y) according to the rule. The arc from the node labeled with SIGS to the node labeled with S2ES is labeled with x\y if 52 = fsisi^x) and y = fo{si,x). The node corresponding to the initial state is indicated by an arrow labeled START.
work learned to organize its state space in order to mimic a given FSM, the regions corresponding to {q)j\j' were detected. The network state space was "covered" with a regular grid G of R x R points {R is on the order of hundreds); and a finite vocabulary F of distinguishing sequences of M was created. Regions {q)^ were approximated by grouping together those network states from the grid that for each input word from the vocabulary lead to equal output strings. In other words, {q)^f = r\wex+ were approximated by Cl^^^J^w H Q. For example, in Figure 3 approximations of regions of equivalent network states corresponding to states of an FSM shown in Figure 2 can be seen. Figure 3 should be compared with Figure 4, showing activations of state neurons during the presentation of a training set to the RNN after training. Generally, in our experiments, regions approximating (q)/^ were observed to be connected and of "simple shape." Further study needs to be devoted to this matter. However, at least empirically and for simple tasks, our use of the Kohonen map as a clustering tool [nHG95], as well as the use of the simple clustering technique introduced in Giles et al. [GMC~^92a] are supported.
184
T i n o , H o m e , Giles, and CoUingwood
FIGURE 3. Regions of equivalent network states. The capital letter inside each region indicates to which state of M the network states from that region are equivalent. A = 0.1. The two lines stemming from the origin are the lines Ta(s)i = 1/2 and To(5)2 = 1/2; between them is the region Pa,(1,1) (see Section 6).
6. Finite State Machines and Recurrent Neural N e t w o r k s
185
f
FIGURE 4. Activations of state neurons when the training set is presented to the network after the training process has finished (weights are frozen).
186
6
Tino, Home, Giles, and Collingwood
RNN as a Collection of Dynamical Systems
RNNs can be viewed as discrete-time DSs. Literature dealing with the relationship between RNNs and DSs is quite rich: [Hir89], [BW92], [GF89], [Cas95a], [Cas95b] [HZ92], [Jor86], [Wan91], [WB90], [Vid93], [Bee94], and [Hir94], for example. However, as has already been mentioned, the task of complete understanding of the global dynamical behavior of a given DS is not at all an easy one. In [WB90] it is shown that networks with just two recurrent neurons can exhibit chaos, and hence the asymptotic network dynamical behavior (on a chaotic attractor) can be very complex. In order to describe the behavior of the RNN M by an iterative map, we confine ourselves to only one input symbol x from the input alphabet of the FSM used for training A^, the code of which is repeatedly presented to the network input. The evolution of the network is described in terms of trajectories {5, TX{S), T^{S), ...} in (0,1)^. The iterative map r^ : (0,1)^ -> (0,1)^ is defined in (9). As in the previous section, here we also assume that an RNN Af exactly mimics the behavior of a reduced, connected FSM M = (X, Y, Q, S, A, SQ). In this section we deal with the problem of how certain features of M found in its STD (such as loops and cycles) induce some specific features (such as attractive points and periodic orbits) of network global dynamical behavior. Assume that there is an a:-loop in a state q of M, and A(g, x) = y. Then according to (11), {q)jsf is a positively invariant set of r^ and hence an absorbing set of itself under TX • From (8) it follows that under r^, {q)j\j' is an absorbing set of all sets {p)j^ such that q is x-accessible from p. If there is an open set B such that B C {q)j^ and TX{B) C B , or {q)j\/- C B and Tx{B) C (Q')AA, then there is an attractive set r\m>of^(^) ^^ ^^ ^^ (^W that constitutes a stable network representation o r t h e x-loop in a state q ofM. Similarly, assume that there is an x-cycle 7 of length m passing through states gi, ...,gm with outputs yj = X{qj,x), j = 1, ...,m. Then according to (13), {qj)j^ are positively invariant sets of r^, and U ^ i {QJ)M is a positively invariant set of r^. A statement concerning the existence of attractive sets of T^ inside {qj)^r (or an attractive set of TX inside UJ^IC^JOA/" ) ^^^ be made analogously to the statement above. Considering (8), it can be seen that under r^, [jq^^{q)M is an absorbing set of itself and of all sets ( p ) ^ such that 7 is x-accessible from p. Observation 1 formulates these ideas in a more compact form. Observation 1: Assume that an RNN Af exactly mimics the behavior of a reduced, connected FSM M = (X, Y,Q,S, A, SQ) • Then • / / there is an x-loop in a state q of Ai, then {q)j^ C Mx
is a
6. Finite State Machines and Recurrent Neural N e t w o r k s
187
positively invariant set ofr^^, and ^^ Ug€>icc(x,p)(pW ^ -4r, {{q)^f)• • / / there is an x-cycle 7 of length m passing through states qi,..., Qm of M, then {qj)j^^ j = l,...,m, are positively invariant sets of r^ and U ^ i {qj)^J' *^ ^ positively invariant set ofrx. (^1)^/"? •••? {qm)M o.f^^ Periodically visited in the process of iteration ofrxf (^i^d[j^cAcc{x,p)
When there was an x-loop in a state ^ of A^, in all our experiments an attractive fixed point 5* of r^ "near" a vertex v G {0,1}^ was detected (see Section 6.1 below). If S„ G {q)^^ 5* constitutes a plausible network representation of the x-loop. If furthermore, 5* is the only attractive set of Tx inside (g)jv^, then \Jq^Acc(x,p)(P)j^ ^^ ^ subset of its basin of attraction. For each input symbol x of A^ and each vertex v = {vi, ...,VL) G {0,1}^ define the set^^ Vx,v = Is e^^\Tx{s)i<-if
Vi=0',
Tx{s)i > - if Vi = 1; z = l , . . . , L i .
Hyperplanes Tx{s)i = 1/2 separate SR^ into 2^ partitions Vx,v' The map r^ is transformed to the map r^ by multiplying weights Wun by a scalar /x > 0, i.e., r^{s) = Txifis). ^ is also called the neuron gain. The following lemma was proved by Li [Li92]. It is stated for maps TX and accommodated to our notation. It tells us under what conditions one may expect an attractive fixed point of r^ to exist "near" a vertex v G {0,1}^. Lemma 1: (Li, 1992) Suppose that for some input symbol x of M there exists a vertex v EVX,V^ TX{T-^X,V)- Then there exists a neuron gain /XQ such that for all ^ > /XQ there is an attractive fixed point ofr^ inVx,v^Tx{Vx,v)' It was also shown that as fi tends to infinity, the attractive fixed point tends to the vertex v. For two recurrent neurons under certain conditions on weights Wun^ this is made more specific in the next section (Corollary !)• Theorem 1: In addition to the assumptions in Observation 1, assume that there is an x-loop in a state q of M. Suppose there is a vertex v G {0,1}^ such that {q)j^ C Vx,v 0,'nd v G rx{{q)j\j'). Then there exists a neuron gain /io such that for all /i > /XQ there exists an attractive fixed point 5* G Vx,vnTx{Vx,v) ofrl^. ^^Recall that Ar^ {{Q)/^) is the absorbing region of (q)/^ under m a p TX^^Tx{s)i denotes the ith component of TX(S). When viewed as an iterative map, TX operates on ( 0 , 1 ) ^ , but here we allow 5 G 3f?^.
188
Tino, Home, Giles, and Collingwood
Proof: From rx{{q)Ar) ^{q)^r QVx,v
and r^HqM
Q ra:{Vx,v)
it follows that Tx{{q)M) Q T^x,v H TX{VX,V)' Hence
Employing Lemma 1, the result follows immediately.
•
Loosely speaking, Theorem 1 says that if arbitrarily close to a vertex v G {0,1}^ there is a network state from Tx{{q)jsr) C {q)j^ C Vx,v, i-e., if network states that are equivalent to the state g of A^ in which there is an x-loop are "accumulated" around the vertex v within Vx,vi then if the weights are "large enough," so that /io < 1, an attractive fixed point of r^ exists in a neighborhood of v (Figures 3 and 5). As mentioned in the introduction, the approach presented in Casey addresses representational issues concerning recurrent neural networks trained to act as regular language recognizers [Cas95a]. Recurrent neural networks are assumed to operate in a noisy environment. Such an assumption can be supported by an argument that in any system implemented on a digital computer there is a finite amount of noise due to round-off errors, and "we are only interested in solutions that work in spite of round-off errors^^ [Cas95a]. Orbits of points under a map / and attractive sets of / are substituted for by the notions of an e-pseudo-orbit of points under / and an e-pseudo-attractor of / . These concepts correspond to the idea that instead of the precise trajectory of a point under a map, we should consider each sequence of points (pseudo-trajectory) having the distance from the precise trajectory less than e > 0. It is proved that when there is a loop in the reduced acceptor of a regular language also recognized by the network, then there must be an e-pseudo-attractor (and hence an attractor) of the corresponding map in the network state space. The network accepts and rejects a string of symbols if e-pseudo-orbits driven by the string end in subregions denoted by accept and reject regions respectively. It is assumed that the accept and reject regions are closed in the network state space.
6.1
Experiments
To see how loops and cycles of an FSM M are transformed into global dynamical properties of an RNN J\f that is able to exactly mimic M^ the following experiments were performed: Consider again the FSM M presented in Figure 2. In Figure 3 it can be seen how the RNN M with two state neurons organizes its state space (0,1)^
6. Finite State Machines and Recurrent Neural N e t w o r k s
189
FIGURE 5. Absorption diagrams of {A)js/' and (C)jv' under the map Ta. Network states lying in the Ughtest region need one or no iteration step under the map Ga to get to their absorption set. The more iteration steps that are needed, the darker the region is, with the exception of the region "close to" the "border line" between the two absorption diagrams. The region is light so that the border contours are clearly visible. The figure should be compared with the figure in the previous section showing {A)j\r and (C)M- Note the two attractive points of Ta placed inside {A)j^ and {C)M induced by a-loops in states A and C respectively.
into three distinct connected regions ( A ) ^ , {B)x, a n d {C)j^, corresponding to s t a t e s A, B, and C respectively. It was observed^^ t h a t trajectories starting in {A)j^ converged t o a single attractive point placed inside {A)//. T h e same applies t o t h e state C a n d its corresponding region {C)j\j-. So t h e a-loops in t h e states A and C induce a t t r a c t i v e points of Ta placed inside t h e corresponding regions of equivalent R N N states. Actually, this represents t h e only RNN stable representation of loops in M we have observed during our simulations. (A)jV' a n d {C)j^ are absorbing sets of themselves under t h e m a p Ta. Since t h e state C is a-accessible from S , {C)j^ is an absorbing set of (5)AT under TQ. Absorption diagrams of {A)j^ a n d {C)x under Ta together with t h e attractive points are presented in Figure 5. ^^As before, during the simulations, the network state space was "covered" with a regular grid of points, and only the orbits starting from these points were taken into account.
190
Tino, Home, Giles, and CoUingwood
FIGURE 6. Absorption diagram of {C)M under the map TT,. Network states from the two white regions do not belong to the absorption region of {C)M- The figure should be compared with the figure in the previous section showing {C)M- Note the attrjictive point of 77, placed inside {C)^ induced by the 6-loop in the state C, as well as two periodic points of n placed inside {A)j^ and (B)//, constituting an attractive periodic orbit of period two. The orbit is induced by the 6-cycle
{A,B}. If we presented M only with input symbol b, we would end up either in a 6-cycle of length two involving states A and B or in a 6-loop in the state C. When, during the experiments, we started in a state from {C)j\f and presented to the network input only the code of the symbol 6, the trajectory converged to an attractive point inside {C)j^. An absorption diagram of {C)js/ under r^ together with the attractive point can be seen in Figure 6. On the other hand, when started in a state from (A)jv-, the trajectory jumped between the sets {A)// and {B)j^, converging to a periodic orbit of length two. Again, this was observed to be the typical stable RNN representation of a cycle corresponding to an input symbol of M. The states constituting the orbit can be seen in Figure 6. In the second experiment, an FSM M. shown in Figure 7 was used to generate the training set for an RNN Af with three state neurons. The acycle {A, B, C, D^ E} of length five induced an attractive periodic orbit of Ta of period five. Projections of the orbit to a two-dimensional subspace (0,1)^ of the network state space can be seen in Figures 8, 9, 10. To illustrate the convergence of orbits, the orbits were plotted after 60, 100, and
6. Finite State Machines and Recurrent Neural N e t w o r k s
191
al4
START
FIGURE 7. FSM M whose state transition diagram contains a cycle of length five.
300 pre-iterations (Figures 8, 9, and 10 respectively). No plotting occurred during the pre-iterations.
7
RNN with Two State Neurons
Usually, studies of the asymptotic behavior of recurrent neural networks assume some form of structure in the weight matrix describing the connectivity pattern among recurrent neurons. For example, symmetric connectivity and absence of self-interactions enabled Hopfield [Hop84] to interpret the network as a physical system having energy minima in attractive fixed points of the network. These rather strict conditions were weakened in Casey [Cas95b], where more easily satisfied conditions are formulated. Blum and Wang [BW92] globally analyze networks with asymmetrical connectivity patterns of special types. In the case of two recurrent neurons with sigmoidal activation function p, they give results for weight matrices with diagonal elements equal to zero.^^ Recently, Jin, Nikiforuk, and Gupta [JNG94] reported new results on absolute stability for a rather general class of a recurrent neural networks. Conditions under which all fixed points of the network are attractive were determined by the weight matrix of the network. The purpose of this section is to investigate the position and stability types of fixed points of maps r^ under certain assumptions concerning the signs and magnitudes of weights Wun. The iterative map under considera^^In such a case the recurrent network is shown to have only one fixed point and no "genuine" periodic orbits (of period greater than one).
192
Tino, H o m e , Giles, and Collingwood
^ ^ : : . . «
m ^^^H
H FIGURE 8. Convergence of orbits of the map Ta to an attractive periodic orbit of period five. The attractive periodic orbit constitutes a stable representation of the a-cycle in the FSM M presented in the previous figure. The orbits were plotted after 60 pre-iterations. No plotting occurred during the pre-iterations. RNN has three state neurons. Shown are the projections of the orbits to a two-dimensional subspace (0,1)^ corresponding to activations of two of the recurrent neurons.
FIGURE 9. Convergence of orbits of the map Ta to an attractive periodic orbit of period five shown in the previous figure. This time, the number of pre-iterations is 100. No plotting occurred during the pre-iterations.
6. Finite State Machines and Recurrent Neural Networks
193
FIGURE 10. Attractive periodic orbit of period five of the map Ta convergence illustrated in the last two figures. The attractive orbit is approximated by plotting the trajectories of Ta after the preceding 300 non-plot iterations. tion can be written as follows: (14) where {un,Vn) G (0,1)^ is the state of recurrent network with two state neurons at the time step n, and a,S and /?,7 are positive and negative real coefficients respectively. Thus we investigate the case when the two recurrent neurons are self-exciting (a,(5 > 0), with the tendency to inhibit each other (/3,7 < 0). For c > 4, define A(c) = -
1 - 1c
In the following it will be shown how the network state space (0,1)^ can be partitioned into regions according to the stability types of fixed points of (14) found in the regions. Regions (^0,i-A{a))x(^0,i-A(<J)), A(a),-
<(0,i-A(<5))u(0,i-A(a))x(i-A(<5),i]
and
(^-A(a),i]x(i-A(.),i]
194
Tino, Home, Giles, and Collingwood
1
0.5+A(5)
K.
^01
<
<,
K
R,*,
x:, ""
0.5
S
R
Roo
Roo
R;.
<
<
Rio
0.5-A(6;
Ko
R-oo
0.5-A(a)
0.5
1
0.5+A(a)
FIGURE 11. Partitioning of RNN state space according to stability types of fixed points of maps TX • are denoted by RQQ.RQQ, and RQQ respectively. Regions symmetrical to RQQ, RQQ, and R§) with respect to the line u = 1/2 are denoted by RIQ, R^Q, and R^Q, respectively: < = Q -^10
-
+ A(a),l) x ( ^ 0 , i - A ( J ) ) ,
i , 1 + A ( a ) ) X ( o , 1 - A(^)) U ( 1 + A(a), l ) x ( 1 - A(^), i R« —
1
1
A.
«))K^ AW,
2
Similarly, let i?Q\, RQ^ , and iZgi denote the regions symmetrical to RQQ, RQQ, and i?QQ with respect to the line v = 1/2. Finally, i?i\,i?fi, and iif\ denote regions that are symmetrical to i?oi ? -^oi ? ^^^ i?oi with respect to the line u = 1/2 (Figure 11). Theorem 2: Suppose a > 4 , ^ < 0,7 < 0,(5 > 4 , a > |/3|,(5 > |7|. Then the following can be said about the fixed points of (14)• attractive and repulsive points can lie only in UiGJ ^t ^^^ Uiex ^ ^ respectively. X is the index set X = {00,10,01,11}. If max{a{5 — 4),(J(a — 4)} < (3^, there are no repellors.
6. F i n i t e S t a t e Machines a n d Recurrent Neural N e t w o r k s
• all fixed points in UiGi ^f
195
^'^^ saddle points}^
Proof: Any fixed point (u^v) of (14) satisfies (u, v) = {g{au H- f3v),g{'yu + Sv)).
(15)
The Jacobian J{u,v) of (14) in (u^v) is given by aGi{u,v) 7G2{u,v)
PG i{u,v) SG2
\
where Gi{u,v) = g'{au -h Pv) and G2{u,v) = g'{ju + Sv). Since g'{p) = g{p){l — g{p)), considering (15) we have {Gi{u,v),G2{u,v))
= {u{l-u),v{l-v))
= (t>{u,v).
(16)
The eigenvalues of J are^^ aGi Ai,2 =
-\-6G2±VD 2 '
where D = {aGi - 5G2f -f 4GiG2/?7. D is always positive and so is aGi -h 8G2- It follows that to identify possible values of Gi and G2 such that |Ai,2| < 1, it is sufficient to solve the inequality aG\ -h 5G2 -h \/D < 2, or equivalently, 2 - a G i - 6G2 >
VD^
(17)
Consider only Gi,G2 such that a G i -{'SG2 < 2, that is, (Gi,G2) lies under the line p : aGi H- SG2 = 2. All (Gi,G2) above p lead to at least one eigenvalue of J greater than 1. Squaring both sides of (17), we arrive at {aS - h)GiG2
- a G i - 8G2 > - 1 .
(18)
The "border" curve K : ( a ( J - ^ 7 ) G i G 2 - a G i - ( 5 G 2 = - 1 in (Gi,G2)-space is a hyperbola G2 = K{GI) = A[l -h B/{Gi - C)], where
a
5
Since 0 < cJ - /37/a < 8 and 0 < a - /37/(5 < a, it follows that A > l / J , G > 1/a and B > 0. Ac(l/a) = 0,/c(0) = 1/5 and (Gi,G2) satisfying (18) lie under the "left branch" and above the "right branch" of K (see Figure 12). It is easy to see that since we are confined to the space below the line p. ^®Note that this does not exclude the existence of saddle fixed points in other regions. ^^To simplify the notation, the identification {u^v) of a fixed point in which (14) is linearized is omitted.
196
Tifio, Home, Giles, and Collingwood
FIGURE 12. An illustration for the proof of Theorem 2. (Gi,G2)-space is the space of derivatives of the sigmoid transfer functions with respect to the weighted sum of neurons' inputs. All (Gi,G2) G (0,1/4]^ bellow the left branch of K correspond to the attractive fixed points. only (Gi,G2) under the left branch of K will be considered. Indeed, p is a decreasing line going through {C,P), and A — P = 2{A — 1/6) > 0, so it never intersects the right branch of K. A necessary (but not sufficient) condition for a fixed point {u^v) of (14) to be attractive is that the corresponding (Gi,G2) = (t>{u,v) £ (0,1/4]^ lie in (0,1/a) x (0, l / J ) , where the map (f) is defined by (16). For each (Gi,G2) G (0,1/4]^, under 0, there are four preimages: (. , . ) = r ' ( G „ G a ) = { ( i ± A ( i ) , i ± A ( i ) ) } .
(19)
The set of preimages of (0,1/a) x (0,1/(5) is the set [Jiei^f^ ^ ~ {00,10,01,11}. A fixed point {u^v) of (14) is a saddle if IA2I < 1 and |Ai| = Ai > 1. Since a6 > Pj, 0 < ViaGi
-h (5G2)2 - 4GiG2{aS - /?7) = VD < aGi + SG2.
6. Finite State Machines and Recurrent Neural Networks
197
It follows that if aGi -h 8G2 < 2, i.e. (Gi,G2) lies under the line p, 0 < aGi -h SG2 - \fD < 2 holds, and 0 < A2 < 1. For (Gi,G2) above the line p, i.e., aGi -f SG2 > 2, we solve the inequality aGi -h 6G2 — 2 < > / D , which leads to the "border" curve G2 = «(Gi) we have already described. This time, only (Gi,G2) "between" the two branches of hyperbola K are considered. It can be seen that in all fixed points {u,v) of (14) with (/>(u, v) e [0, -
X f 0, min
ruin | G , - I j x fo, -
the eigenvalue A2 > 0 is less than 1. This is certainly true for all {u^v) such that 0(tx, v) € (0,1/4] x (0,1/5) U (0,1/a) x (0,1/4]. In particular, the preimages of (Gi,G2) € (1/a, 1/4] x (0,1/(5) U (0,1/a) x (1/5,1/4] under 0 define the region IJiex ^ f where only saddle fixed points of (14) can lie. Fixed points (u, v) whose images under 0 lie above the right branch of K are repellors. No (Gi,G2) can lie in that region, if G, A > 1/4, that is, if 5{a - 4) < ^ 7 and a{5 — 4) < /?7, which is equivalent to max{a{S — 4 ) , ( 5 ( a - 4 ) } 37. D The condition max{a{5 — 4),(5(a — 4)} < ^ 7 implies that when selfexcitations of recurrent neurons are not significantly higher than their mutual inhibition, there are no repulsive fixed points of (14). As self-excitations a and (J grow, stable fixed points of (14) move closer towards {0,1}^. More precisely: Corollary 1: Savfit assumptions as in Theorem 2. All attractive fixed points of (14) lie in the e-neighborhood of vertices of the unit square^ where e = 7 ( 0 . 5 - A(a))2 -f (0.5 -
A{6))^.
The tendency of attractive fixed points in discrete-time RNNs with exclusively self-exciting recurrent neurons to move towards saturation values as neural gain grows is also discussed in Hirsch [Hir94]. So far, we have confined the areas of the network state space (0,1)^ where (under some assumptions on weights) fixed points of (14) of particular stability types can lie. In the following, it will be shown that those regions correspond to monotonicity intervals of functions defining fixed points of (14). The reasoning about the stability type of a fixed point can be based on the knowledge of where the functions intersect. Recall that any fixed point (w*,v*) of (14) satisfies (tx*,i;*) = {giau^ -\- (3v^),g{ju^
+Sv^)),
198
Tino, Home, Giles, and Collingwood
or equivalently, (v*, f *) lies on the intersection of two curves v = f^p{u),u fsni^), where /ci,c2 : (0,1) -^ K,
/c.,c.(^) = - ^ ^ + ; ? - l n - ^ . C2
C2
=
(20)
I — t
Um£^o+ fci,c2W = 00, hm^_^i- /ci,c2W = -oo.^^ fc^^c2 is convex and concave on (0,0.5) and (0.5,1), respectively. If ci < 4, fci,c2 is nonincreasing; otherwise, it is decreasing on (0,0.5 — A(ci)) U (0.5 H- A(ci), 1) and increasing on (0.5 - A(ci),0.5 -h A(ci)). The graph of fci.c^i^) is presented in Figure 13. The "bended" graph of /ci,c2 for ci > 4 gives rise to a potentially complicated intersection pattern of fa,p{u) and fs^-yiv). In the following, we shall consider only the case Ci > |c2|, since it is sufficient to explain some interesting features of the training process observed in our experiments. Note that ci > |c2| means that for both neurons, the self-excitation is higher than the inhibition from the other neuron. Lemma 2: Assume a > 0 , ^ < 0,7 < 0,5 > 0. / / a > |/3| and 6 > |7|, then fa,(3{'^) o.nd fs^-yiv) do not intersect in (0,0.5)^. Proof: Assume that both fa,(3{u) and fs,y{y) lie in (0,0.5)^; otherwise, the result follows trivially. For u G (0,0.5), both (ln(u/(l - u))/l3 and —au//3 are positive. It follows that in (0,0.5)^, fa,(3{u) lies above the line v = aul\l3\. Similarly, in (0,0.5)^, fs.-yiv) lies above the line u = 8vl\^\. In terms of the coordinate system (w,t;), this can be restated as follows: in (0,0.5)^, the graph of /a,/? lies above the line v = aul\^\^ while the graph of fs^-y lies below the line v = \')\ul8. Since |7|/(5 < 1 < CK/|/?|, fa,p{u) and f6,y{v) do not intersect in (0,0.5)^. • The correspondence between regions i?^ , i,j = 0,l; Q = A,S, R; and the regions of monotonicity of /a,/?(^) and fd,y{v) enables us to interpret the training process as a process of "shaping" Z^,/? and fs^-y so that the desired behavior of (14), as prescribed by the training set, is achieved. Denote the set {{u, fa,p{u))\ u G (0,0.5 — A ( Q : ) ) } of points lying on the "first decreasing branch" of fa.^iv) by / ^ ^ . Analogously, the set of points {(u, /a,/3(^^))| i^G (0.5 -f A(a), 1)} in the "second decreasing branch" of /a,/3('w) is denoted by / ^ ^ . Finally, let / ^ ^ denote the set of points {{u,fa,(3{y))\ ixG(0.5-A(a),0.5 + A(a))} on the increasing part of/a,/?(u). Similarly, fs^,fl~ and //"^ are used to denote the sets {{fs'y{v),v)\ v ^ (0,0.5 - A{S))},\{fs,Mly)\ y ^ (0.5 + A ( ( 5 ) , 1 ) } and {{fsn{v).v)\ v ^ ^^note that since a, 6 and 0,^ are assumed to be positive and negative respectively, we have ci > 0 and C2 < 0
6. Finite State Machines and Recurrent Neural Networks
199
FIGURE 13. Graph of the function /ci,c2(^) when C2 < 0. SoHd and dashed lines represent the cases 0 < ci < 4 and ci > 4 respectively. For ci > 4, the function "bends" and becomes increasing on (1/2 — A(ci), 1/2 + A(ci)).
(0.5 - A((J), 0.5 -h A(cJ))}, respectively. Using Theorem 2 and Lemma 2, we state the following corollary: Corollary 2: Same assumptions as in Theorem 2. Attractive fixed points of (14) can lie only on the intersection of the decreasing parts of fa,p and fs^-yWhenever the increasing part of fa,(3 intersects with a decreasing part of fs,^ (or vice versa), it corresponds to a saddle point of (14)- I'n particular, all attractive fixed points of (I4) are from f^~^nfl~, fl'p^fs^^ ^ ^ / a ^ 1^/5,7 • Every point from f^p O / j ~ or / ^ ^ Pi /^^ is a saddle point of (14)-
200
Tino, Home, Giles, and Collingwood
The usual scenario of the creation of a new attractive fixed point of (14) is that typical of saddle-node bifurcation, in which a pair attractive -I- saddle fixed points is created. Attractive fixed points disappear in a reverse manner: an attractive point coalesces with a saddle, and they are annihilated. This is illustrated in Figure 14. fs,y{v), shown as a dashed curve, intersects fa^piu) in three points. By increasing S, fs^-y bends further (solid curve) and intersects with /a,/? in five points.^^ Saddle and attractive points are marked with squares and circles, respectively. Note that as 5 increases attractive fixed points move closer to vertices {0,1}^. A similar approach to determining the number and stability types of fixed points of the underlying dynamical systems in continuous-time recurrent neural networks can be found in Beer [Bee94].
FIGURE 14. Geometrical illustration of saddle-node bifurcation in RNN with two state neurons.
^^At the same time, |7| has to be also appropriately increased so as to compensate for the increase in 6 so that the "bended" part of fs^-y does not move radically to higher values of u.
6. Finite State Machines and Recurrent Neural Networks
al 1
al2
al3
201
al4
bll
FIGURE 15. FSM M with four a-loops and "transition" input symbol h.
8
Experiments—Learning Loops of FSM
A RNN with two state neurons was trained with the FSM M presented in Figure 15. In each of its four states there is an a-loop. The input symbol h causes subsequent transitions between states up to the "trap" state D. The training set representing M was constructed as follows: Transitions to states 5 , C, and D from the initial state A are represented by one, two, and three consecutive 6's respectively. Apart from transition, each a-loop is represented by strings of consecutive a's up to length 5. The b-loop in the state D is represented by a string of 5 consecutive 6's. To each input string w;, its corresponding output string \'^{A^w) is determined. During training, after each epoch, attractive sets of TQ were numerically detected. The evolution of position and number of attractive fixed point(s) of Ta in (0,1)^ can be seen in Figure 16. Near the points the corresponding epoch numbers are shown. At the beginning, there is only one fixed point of Ta. A bifurcation during the 59th epoch produces two attractive fixed points. From the 138th epoch to the 321st epoch there are three attractive fixed points and two saddle points of Ta. These are determined by the intersection of the corresponding lines faa^Pa ^^^ /(5„,7„, where eta,/3a? 7a, sind Sa are coefficients of the map Ta as in (14). The episode of existence of the attractive fixed point /^~^^ ^fs'ia ^^&^^ when faa,i3a is "bended" enough so that fl~^ intersects with both increasing and decreasing parts f^^ p^ and fa~j3^i respectively. At the same time, in order for the intersection ^ot~Pa ^ ^ta 7a ^^ ^xist, /(j^,7^ uccds also to be sufficiently "bended" (Figure 17). The degree to which /««,/?« ^^^ f^a^ia ^^^ "bended" is primarily controlled by QQ and 5a respectively, while the vertical positions of bended parts are mainly determined by Pa and 7a, respectively. During the 322nd epoch, the attractive fixed point /[~^^ H fl~^^ together with saddle point •^a~/?a ^ fsa -fa ^isappcar because the increase in |7a| pushes the "bended" part of fsa^a inside the state space (0,1)^ (Figure 18).
202
Tino, Home, Giles, and Collingwood
p J gg25
00 J
d 1
137
1"
/
to J •• d 1
/ /
^
i
7
1321
^ \
{.
-t J d 1
\
CM J
d 1
|ll39
o J
^12
1225 d
1
1
1
C).0
0.2
0.4
1 0.6
1
1
0.8
1.0
1
S1
FIGURE 16. Evolution of position of attractive sets of Ta during RNN training on FSM M (two state neurons).
The training error was 0.08, yet the only attractive sets of Ta that were detected were two attractive fixed points SA and So near vertices (0,1) and (1,0) corresponding to a-loops in states A and D, respectively. Starting in a small neighborhood of SA and SD, upon repeated presentation of input a, the decoded network outputs are 1 and 4 with trajectories of Ta approaching SA and S^, respectively. There is no stable representation of the a-loops in states B and C; i.e., there are no positively invariant sets of Ta leading to the network output 2 and 3 respectively when input a is presented to the network. However, the net is able to simulate the training set perfectly. It follows that after it is reset^^ and presented with b, when five consecutive a's arrive, the decoded output will be five consecutive 2's. Hence, the network must have developed a mechanism for acting as if the a-loops in B and C were represented in a stable manner, at least for strings having no more than five consecutive a's. It turns out that the underlying mechanism for pretending that there are stable representations of a-loops for short input ^^with (possibly repeated) presentation of "reset" input #
6. Finite State Machines and Recurrent Neural Networks
203
FIGURE 17. faa^^aM ^^^ fsanaiv) after the 150th training epoch. Coefficients of the map Ta are ota = 5.21, ^a = -2.58,7a = -2.63, 6a = 5.23. strings involves a behavior of trajectories starting "near" the stable manifold W^ of the saddle fixed point Ss lying "between" attractive points SA and SDI with W^ constituting the border of regions of attraction of SA and SD-
Consider a point S "near" W^. Due to the continuity of Ta, the orbit of S under Ta first moves towards Ss along W^ and then away from Ss along a branch of the unstable manifold W^ of Ss, gradually approaching one of the attractive points SA-> SD- TO which of the two points the trajectory actually converges is determined by the "side" of W^ on which the initial point 5 lies. Assume that the trajectory of 5 converges to SA- If we slightly displace S into S' on "the other side" of the curve W^, the trajectories of 5 and S' move towards Ss close to each other, but as they approach Ss-, the trajectory of S' follows the other branch of W^ towards SD (see Figure 19). As we move the starting point S towards SA and SD^ the trajectories less and less follow the pattern described above; they move towards SA and SD in a straightforward manner^^ and approach a vicinity of SA and SD respectively much faster than trajectories starting "near" W^. Hence, the network is able to "cheat" by pretending stable behavior as described by ^^Due to the coefficients of Ta, the eigenvalues of its Jacobian in every point from (0,1)^ are real thus implying an absence of rotation in neighborhoods of fixed points.
204
Tino, Home, Giles, and Collingwood
FIGURE 18. foca,0aM ^'-nd fsa,iai'^) stfter 1000th training epoch. Coefficients of the map Ta are aa — 8.61,/?a = —3.96,7a = —3.08, (5a = 5.17. the a-loop in the state B because it takes advantage of different convergence rates of orbits starting near W^ and SD- The decoded output of the net with input a and a state near SD is 4 (region T>), while for states involving first several steps in trajectories starting near W*, the output is 2 (region B ). An analogous statement can be made about trajectories starting near SA and W^ and regions A and C, respectively. Most of the time towards the end of the learning session was spent on learning the output function z/^(5) = v{S, a) in closely neighboring regions of B and C so that the outputs for states from B and C are 2 and 3 respectively (see Figures 20, 21). The map r# associated with the "reset" input symbol # has one attractive fixed point in the region A. Under the "reset" map T:^, trajectories of network states 5 G (0,1)^ quickly approach region A, thus preparing the ground for processing of a new input word. The key role, however, is played by the transfer function u. It simulates transition between states with a-loops in M. Starting in S £ A^ u{S) G B, and r^{S) G C lie near W^, and the behavior of Ta in B and C appears to be stable for several iterations. Upon repeated presentation of a, T^{S) G V converges to So with network output 4. The delicate role of n responsible for transitions A -> B -^ C -^ V with jumping on the "appropriate" sides of W^ while staying close to W*,
6. Finite S t a t e Machines and Recurrent Neural N e t w o r k s
205
FIGURE 19. Illustration of a mechanism that enables RNN to "pretend" stable representation of loops in M for short input strings.
206
Tino, H o m e , Giles, and Collingwood
FIGURE 20. The map (1^0)2 representing the output of the second output neuron that corresponds to the output symbol 2. Note the sharp activity change along the border of regions of attraction of SA and SD •
FIGURE 21. The map {ua)3 representing the output of the third output neuron that corresponds to the output symbol 3. A sharp activity change along the border of regions of attraction of SA and SD is clearly visible.
6. Finite State Machines and Recurrent Neural Networks
207
together with different convergence rates of orbits under Ta starting close to W^ and near SA, SD are principal tools enabling the net to behave nicely for testing strings of smaller length, although it generalizes poorly on strings with many consecutive a's after b or 66. In particular, the outputs of the net for input strings ba^ and 66a^ are consistent with the training set for n = 8 and m = 10. As further a's keep coming, trajectories of Ta move away from B and C towards SD and SA respectively. To visualize the process of state degradation upon repeated presentation of input a, a state degradation diagram for input a is constructed as follows (Ma denotes the set of states of M in which there is an a-loop): • Construct a finite vocabulary F of short distinguishing words for Ma, such that r does not contain a word ua^v,i > 2, where u is leading to a state of M in which there is an a-loop. To each state q of Ma associate a minimal input word m^ leading to q. • For each i e {1,2,..., Nmax} - For each w eT * For each state q G Ma • present the reset network with niqa^ and then • present the network with w and check whether the net output equals \'^{q,w). If not, check whether there is a state p of M. such that the network output equals A"^(p, w). If so, draw an arrow in a diagram from q to p. State degradation diagram for input a is presented in Figure 22. Note that when only short input strings are presented to the network and quantization of network state space individually captures regions A,B,C,T>, a correct state transition diagram can be obtained, even though on longer input strings the net generalizes poorly.
FIGURE 22. State degradation diagram for input a. Nmax = 100. When the network with three state neurons was trained with the FSM M, it generalized correctly over the training set by forming four attractive fixed points of Ta corresponding to loops in states A,B,C,D of M. The training process looked at from the point of view of the asymptotic behavior of Ta is illustrated in Figure 23. The horizontal axis corresponds to time (in
208
Tino, Home, Giles, and Collingwood
epochs); The network state space (0,1)^ is orthogonally projected into the 2-dimensional space of activations of a couple of state neurons. Bifurcations leading to formation of new attractive fixed points appeared during the 53rd, 115th and 121st epochs. If the network is able to exactly mimic the FSM M, the state degradation diagram for each input symbol has no arrows.
FIGURE 23. Evolution of position of attractive sets of TQ during RNN training on FSM M (three state neurons). As another example, consider the FSM M in Figure 24. It is an FSM taken from the database of the International Symposium on Circuits and Systems (Portland, Oregon, 1989) [BBK89]. In each of its seven states there is an a-loop with output 0 except for a-loops in states 4 and 7. The training set consists of 3500 training strings^^ of input string length 3-35 and is ordered according to their length starting with the shortest ones. The machine A1 is hard to learn because the training set is very sparse in output symbols other than 0. The training process is disrupted by a tendency ^"^input word w -> corresponding output word X'^(qojW).
6. Finite State Machines and Recurrent Neural Networks
dddeadfdaeaafaaadddaddadfeeedeaeeet affedfeefaedeededfdefddaafeeeeeadd« dffdadedfadaddffeeafeafdffdffefaad# fdaadaafddafafdadfdffdeaffaa«feade» ddfaddadfaaddddeafdafdfaeedaedaedat defadedefdeffdefdafdaaadeaeddaaefd# ddfedaaffdedeaeadeefdfefaadadeaaff# aaf aaeef af eaff eeef eaf aef eeadaef af a# dddeeafffafeaadaddfdffadfeafdddefd# fdaaddaadadffefaeadddfeddeafdddaea# dedaddadaafeaaddaafaaefaefdeeffafe# ddaeeafddfaaffffaeeefeadaefdfedfee# dddedeeafdfddfaeeaddafdfafadedfaaft
— > —> —> —> ->> —> —> ~> ~> ~> —> ~> ~>
209
0000000000000000000000000000000002z 0000000000000000000000000000022200x OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOz OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx 0000000001lOOOOOOOOOOOOOOOOOOOOOOOx OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx 000000000000001llOOOOOOOOOOOOOOOOOx OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOx
TABLE 1. A part of the training set characterizing the FSM M. Output strings are sparse in output symbols other than 0.
to find a trivial solution represented by the automaton with only one state and loops for every input symbol with the output 0. An example of a part of the training set is given in Table 1. After 53 training epochs, the RNN with 6 state neurons is able to perform well on short test strings (training error was 0.06). Generalization on long test strings was found to be poor. Part of the problem was the unstable network representation of a-loops in M. The state degradation diagram for input a can be seen in Figure 25. a-loops in states 4, 6, and 7 are "well represented" by fixed points 54, Se, and Sy, respectively, in that when starting in a small neighborhood of 5g, q = 4,6,7, the resulting output sequences of the RNN for input words a^w^ it; € F, i > 0, equal A'^(g, a^w). This is not true of a-loops in states 1, 2, 3, and 5. When the net is reset and presented with m^, q = 1,2,3,5, for i > Nq it does not emulate X'^{q,a^w), w E r. States 5 and 3 degrade to states 1 and 2, respectively. In particular, N^ = S and A^3 = 5. Both states 1 and 2 degrade to the attractive fixed point SQ with Ni = 27 and A^2 = 40. The network state So does not represent any state of M even for short input strings. Sj, j = 0,4,6,7, are the only attractive sets of Ta that were detected. There are trajectories of Ta starting near border of regions of attraction of So and some other attractive fixed point of TQ that passes through the region assuming the role of state 5 of M. for short input strings. Then, further towards .So, they pass through the region of network states that for short input strings seem to be equivalent to the state 1 of M, finally making their way to a close neighborhood of So and converging to it. A similar statement can be made about states 3 and 2 of At.
210
T i n o , H o m e , Giles, and Collingwood
FIGURE 24. FSM M taken from the database of the International Symposium on Circuits and Systems (Portland, Oregon, 1989). M is the reduced form of a machine defined in the file b b a r a . k i s s 2 . Inputs 01, 10, and 00 axe represented as the input symbol a since in every state, they initiate the same transition with the same output. Inputs 0011,-111, and 1011 are represented as input symbols d, e, and / , respectively. Outputs 00,01, and 10 are coded as output symbols 0 , 1 , and 2, respectively.
6. Finite State Machines and Recurrent Neural Networks
211
J s.
FIGURE 25. State degradation diagram for input a extended with network state So not representing any state of M. So = (0.89,0.01,0.55,0.95,0.99,0.92), S4 = (0.16,0.98,0.02,0.87,0.04,0.92), Se = (0.98,0.03,0.97,0.09,0.99,0.87), 57 = (0.94,0.98,0.95,0.01,0.05,0.15). Nmax = 100.
9
Discussion
Two views on the relationship between an RNN and an FSM M such that the RNN exactly mimics M have been presented. First, the network was treated as a state machine. The notion of regions of equivalent network states that are also equivalent to a state of M link the first approach with the second, dynamical systems approach to the RNN. Our experiments suggest that in the most usual stable RNN J\f, representations of loops and cycles in M can be described as follows: An x-loop in a state ^ of A^ induces an attractive fixed point of r^ inside {q)^, and an x-cycle {gi, ...,^m} of M induces an attractive periodic orbit of period m of Tx periodically visiting {ql)^f,..., {qm)]^The present paper provides us with the opportunity to look at the learning process from the point of view of bifurcation analysis. If the network is supposed to operate as an FSM, its state space must have multiple attractor basins to store distinct internal states. The network solves the task of FSM simulation by location of point and periodic attractors and the shaping of their respective basins of attraction [Cum93]. Before training, the connection weights are set to small random values, and as a consequence.
212
Tino, Home, Giles, and Collingwood
the network has only one attract or basin. This imphes that the network must undergo several bifurcations [Doy92a]. This can have an undesirable effect on the training process, since the gradient descent learning may get into trouble. At bifurcations points, the output of a network can change discontinuously with the change of parameters, and therefore convergence of gradient descent algorithms is not guaranteed [Doy92b]. In the following a possible application of these ideas to the problem of determination of the complexity of language recognition by neural networks will be discussed briefly. Any FSM with binary output alphabet {0,1} can function as a recognizer of a regular language. A word over the input alphabet belongs to the language only if the output symbol after presentation of the word's last symbol is 1. Hence, the network output is used to decide whether a word belongs to the language or not. One of the most promising neural acceptors of regular languages [Shi87] is the second-order RNN introduced by Giles et al. [GMC'^92a]. However, the practical aspects of the acceptance issue are still unclear [SSG92]. The difficulty of acceptance of a given language by a neural network (the neural complexity of the language) can be quantified by the minimal number of neurons needed to recognize the language. In the context of mealy machines and threshold networks, a similar problem was attacked by Alon et al. [AD091] and Home and Hush [HH94]. An attempt to predict the minimal second-order RNN size so that the network can learn to accept a given regular language is presented in Siegelmann et al. [SSG92]. The predicted numbers of neurons were shown to correlate well with the experimental findings. Essentially, a good starting point for the estimation of neural complexity of a given regular language is the representation of the language with the reduced recognizer. The most usual, very rough, approach to the neural complexity estimation takes into account only the number of states of such a recognizer [SSG92]. What plays a principal role in making the internal structure of a regular language rich is • the number of input symbols of the recognizer, • the number of loops associated with each input symbol, • the number and corresponding lengths of cycles associated with each input symbol, • the relationship among loops and/or cycles (i.e., an xi-cycle is passing through a state q in which there exists an X2-loop, etc... ). In every recognizer of a regular language, for each input symbol there exists at least one loop or a cycle. During the training process, the weights of a network are modified so that the corresponding attractive sets evolve in dynamical systems defined by the iterative maps r^. A hint for a lower
6. Finite State Machines and Recurrent Neural Networks &I0
213
all
FIGURE 26. Acceptor of the language L = Li U L2, nG {0,2,4,5,6,...}, L2 = {a,6}^a, mG {1,3}.
Li =
{a,fe}'*fe,
bound on the minimal number of neurons can be obtained by exploring the possibilities of the existence of attractive points and/or periodic orbits that are to be induced during the training process. The expected relationship among their basins of attraction has to be taken into account at the same time [Cas93]. As an example consider the FSMs Mi and M2 in Figures 26 and 27 respectively. Apparently, the the internal structure of a regular language accepted by M2 is "more complex" than that accepted by Mi. In the latter case, only one attractive fixed point of r^ is sufficient to represent the a-loop in the state E. The same applies to the 6-loop in E and the map Tfe. In the former case, an attractive periodic orbit of period four of the map Ta and four attractive points of the map r^ have to be induced. Even though the FSM M2 has only four states, the RNN needed four state neurons to accomplish a successful learning. On the other hand, two state neurons were sufficient for the RNN to learn the FSM Mi. A mechanism underlying generalization loss on longer input strings due to unstable representation of loops in an FSM to be learned was investigated. It was shown that even in such cases a correct state transition diagram of the FSM can potentially be extracted even though the network performs badly on longer input strings (as reported by Giles et al. [GMC"*"92a]). The state degradation diagram for an input symbol x illustrates how regions of network state space, initially acting as if they assumed the role of states of the FSM in which there is an x-loop, gradually degrade upon repeated presentation of x. The degradation may lead to a network state not representing any state of the FSM even for short input strings. Zeng et al [ZGS93] and Das and Mozer [DM94] view the RNN state space quantization as an integral part of the learning process in which the network is trained to mimic a finite state machine. In particular, in [ZGS93] the
214
T i n o , H o m e , Giles, and Collingwood
bl0
bll FIGURE 27. Acceptor of the language L = L^, where Ls = 6*a6* U {b*afb-^ U {b*a)\
bll
b!0
6. Finite S t a t e Machines and Recurrent Neural N e t w o r k s
215
activation pattern of state units is mapped at each time step to the nearest corner of a hypercube as if state neurons had a hard threshold activation function. Das and Mozer [DM94] used a "soft" version of the Gaussian mixture model^^ in a supervised mode as a clustering tool. The mixture model parameters were adjusted so as to minimize the overall performance error of the whole system (recurrent network -f clustering tool). Both Zeng et al. and Das and Mozer report better asymptotic behavior for long, unseen test input strings. It would be interesting to investigate such approaches to training on RNN on finite state problems as a form of "dynamical selfreinforcement" learning encouraging bifurcations to attractive fixed points and periodic orbits of the underlying dynamical systems.
Acknowledgments Thanks to Maria Markosova, Pavol Brunovsky, and Phil Holmes for useful discussions on dynamical systems. The work of Mike Casey and Randall Beer contributed greatly to the preparation of this chapter.
10
REFERENCES
[AD091]
N. Alon, A.K. Dewdney, and T.J. Ott. Eflacient simulation of finite automata by neural nets. Journal of the Association of Computing Machinery, 38(2):495-514, 1991.
[BBK89]
F. Brglez, D. Bryan, and K. Kozminski. Combinational profiles of sequential benchmark circuits. In Proceedings of the International Symposium on Circuits and Systems, Portland, OR, May 1989.
[Bee94]
R.D. Beer. On the dynamics of small continuous-time recurrent networks. Technical Report CES-94-18, Case Western Reserve University, Cleveland, OH, 1994.
[BW92]
E.K. Blum and X. Wang. Stability of fixed points and periodic orbits and bifurcations in analog neural networks. Neural Networks, (5):577-587, 1992.
[Cas93]
M.P. Casey. Computation dynamics in discrete-time recurrent neural networks. In Proceedings of the Annual Research Symposium, volume 3, pages 78-95, UCSD, La Jolla, CA, 1993. Institute for Neural Computation.
^^Instead of the center with greatest posterior probability given a pattern of state units' activation, a linear combination of centers is used, where each center is weighted by its posterior probability given current network state.
216
Tino, Home, Giles, and Collingwood
[Cas95a]
M.P. Casey. Computation in Discrete-Time Dynamical Systems. PhD thesis, University of CaUfornia, San Diego, Department of Mathematics, March 1995.
[Cas95b]
M.P. Casey. Relaxing the symmetric weight condition for convergent dynamics in discrete-time recurrent networks. Technical Report INC-9504, Institute for Neural Computation, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0112, 1995.
[CSSM89]
A. Cleeremans, D. Servan-Schreiber, and J.L. McClelland. Finite state automata and simple recurrent networks. Neural Computation, 1(3):372-381, 1989.
[Cum93]
F. Cummins. Representation of temporal patterns in recurrent neural networks. In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, pages 377-382, 1993.
[Dev86]
R.L. Devaney. An Introduction to Chaotic Dynamical Systems. Benjamin/Cummings Publishing Company, Inc., Menlo Park, CA, 1986.
[DGS92]
S. Das, C.L. Giles, and G.Z. Sun. Learning context-free grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University, 1992.
[DM94]
5. Das and M.C. Mozer. A unified gradientdescent/clustering architecture for finite state machine induction. In J.D. Cowen, G. Tesauro, and J. Alspector, editors. Advances in Neural Information Processing Systems 6, pages 19-26. Morgan Kaufmann, San Mateo, CA, 1994.
[Doy92a]
K. Doya. Bifurcations in the learning of recurrent neural networks. In Proceedings of the 1992 IEEE International Symposium on Circuits and Systems, pages 2777-2780, 1992.
[Doy92b]
K. Doya. Bifurcations in the learning of recurrent neural networks. In Proc. of 1992 IEEE Int. Symposium on Circuits and Systems, pages 2777-2780, 1992.
[Elm90]
J.L. Elman. Finding structure in time. Cognitive Science, 14:179-211, 1990.
[GF89]
M. Garzon and S. Franklin. Global dynamics in neural networks. Complex Systems, (3):29-36, 1989.
6. Finite State Machines and Recurrent Neural Networks [GH82]
217
J. Guckenheimer and P. Holmes. Nonlinear Oscilations, Dynamical Systems, and Bifurcations of Vector Fields. SpringerVerlag, Berlin, 1982.
[GMC+92a] C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun, and Y.C. Lee. Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation^ 4(3):393-405, 1992. [GMC+92b] C.L. Giles, C.B. Miller, D. Chen, G.Z. Sun, H.H. Chen, and Y.C. Lee. Extracting and learning an unknown grammar with recurrent neural networks. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors. Advances in Neural Information Processing Systems 4, pages 317-324. Morgan Kaufmann, San Mateo, CA, 1992. [HH94]
B.G. Home and D.R. Hush. Bounds on the complexity of recurrent neural network implementations of finite state machines. In J.D. Cowen, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 359-366. Morgan Kaufmann, San Mateo, CA, 1994. Also submitted to Neural Networks.
[Hir89]
M.W. Hirsch. Convergent activation dynamics in continuous time networks. Neural Networks, 2(5):331-349, 1989.
[Hir94]
M.W. Hirsch. Saturation at high gain in discrete time recurrent networks. Neural Networks, 7(3):449-453, 1994.
[Hop84]
J.J. Hopfield. Neurons with a graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science USA, 81:3088-3092, May 1984.
[HZ92]
S. Hui and S.H. Zak. Dynamical analysis of the brain-state-ina-box neural models. IEEE Transactions on Neural Networks, (l):86-94, 1992.
[JNG94]
L. Jin, P.N. Nikiforuk, and M.M. Gupta. Absolute stability conditions for discrete-time recurrent neural networks. IEEE Transactions on Neural Networks, (6):954-963, 1994.
[Jor86]
M.I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Conference of the Cognitive Science Society, pages 531-546, Hillsdale, NJ, 1986. Erlbaum.
218
Tino, Home, Giles, and Collingwood
[Li92]
L.K. Li. Fixed point analysis for discrete-time recurrent neural networks. In Proceedings of IJCNN, volume 4, pages 134139, Baltimore, 1992.
[MF94]
P. Manolios and R. Fanelli. First order recurrent neural networks and deterministic finite state automata. Neural Computation, 6(6):1155-1173, 1994.
[Min61]
R.C. Minnick. Linear-input logic. IRE Transactions on Electronic Computers, EC-13:6-16, 1961.
[nHG95]
P. Tifio, B.C. Home, and C.L. Giles. Fixed points in twoneuron discrete time recurrent networks: Stability and bifurcation considerations. Technical Report UMIACS-TR-95-51, Institute for Advance Computer Studies, University of Maryland, College Park, MD 20742, 1995.
[nJV94]
P. Tiiio, I.E. Jelly, and V. Vojtek. Non-standard topologies of neuron field in self-organizing feature maps. In Proceedings of the AIICSR'94 conference, Slovakia, pages 391-396. World Scientific Publishing Company, 1994.
[Shi87]
M.W. Shields. An Introduction to Automata Theory. Blackwell Scientific Publications, London, UK, 1987.
[SSG92]
H.T. Siegelmann, E.D. Sontag, and C.L. Giles. The complexity of language recognition by neural networks. In J. van Leeuwen, editor. Algorithms, Software, Architecture (Proceedings oflFIP 12^^ World Computer Congress), pages 329-335, Amsterdam, 1992. North-Holland.
[Vid93]
M. Vidyasagar. Location and stability of the high-gain equilibria of nonlinear neural networks. IEEE Transactions on Neural Networks, 4(4):660-672, July 1993.
[Wan91]
X. Wang. Period-doublings to chaos in a simple neural network: An analytical proof. Complex Systems, (5):425-441, 1991.
[WB90]
X. Wang and E.K. Blum. Discrete-time versus continuoustime models of neural networks. Journal of Computer and Systems Sciences, 45:1-19, 1990.
[WK92a]
R.L. Watrous and G.M. Kuhn. Induction of finite-state automata using second-order recurrent networks. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors. Advances in Neural Information Processing Systems 4^ pages 309-316. Morgan Kaufmann, San Mateo, CA, 1992.
6. Finite State Machines and Recurrent Neural Networks
219
[WK92b]
R.L. Watrous and G.M. Kuhn. Induction of finite-state languages using second-order recurrent networks. Neural Computation, 4(3):406-414, 1992.
[ZGS93]
Z. Zeng, R.M. Goodman, and R Smyth. Learning finite state machines with self-clustering recurrent networks. Neural Computation, 5(6):976-990, 1993.
This Page Intentionally Left Blank
Chapter 7 Biased Random-Walk Learning: A Neurobiological Correlate to Trial-and-Error Russell W . A n d e r s o n ABSTRACT Neural network models offer a theoretical testbed for the study of learning at the network level. The only experimentally verified learning rule, Hebb's rule, is extremely limited in its ability to train networks to perform complex tasks. An identified cellular mechanism responsible for Hebbian-type longterm potentiation, the NMD A receptor, is highly versatile. Its operation is modulated by a wide variety of conditions and may be involved in several non-Hebbian processes. We have shown that another neural network learning rule, the chemotaxis algorithm, is theoretically much more powerful than Hebb's rule and is consistent with neurobiological observations. A biased random walk in synaptic weight space is a learning rule immanent in nervous activity and may account for some types of learning — notably the acquisition of skilled movement.
1
Introduction
In their landmark paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity", McCulloch and Pitts [1943] demonstrated how a network of extremely simplified ("all-or-nothing") neurons could compute any Boolean function. Mathematical analyses of recurrent neural network models have shown them to be universal computing devices [Siegelman and Sontag 1991, 1994, 1995]. Neural network modeling has not only been helpful in understanding the collective behavior of existing networks, it also provides a theoretical framework with which one can experiment with models of learning. Rosenblatt [1958] demonstrated that these networks, when endowed with modifiable connections ("perceptrons"), could be "trained" to classify patterns (see also Arbib [1987]). Thus Rosenblatt had developed a theoretical testbed for the study of learning. Theoretical neural network studies (mathematical analyses and empirical computer simulations) are useful for exploring the capabilities and limita221
222
Anderson
tions of a proposed learning rule. The only experimentally verified learning rule, Hebb's rule, has profound limitations in this respect. Engineering optimization algorithms (such as backpropagation or genetic algorithms) are capable of training neural networks to perform much more sophisticated tasks but are biologically implausible [Churchland and Sejnowski 1989; Hecht-Nielsen 1989; Stork 1989; Crick 1989a,b; Mel 1990; Anderson 1991]. Long underestimated by both the experimental and theoretical neural network communities is perhaps the most intuitive mode of learning—trialand-error. We have shown [Bremermann and Anderson 1989,1991] that the mathematical analog to trial-and-error, a Gaussian biased random walk in synaptic weight space, is capable of training neural networks to perform the same complex, nonlinear mappings as backpropagation. In this paper, the biological evidence for and theoretical limitations of Hebbian learning are reviewed. Next, theoretical and empirical studies of random-walk learning rules are presented. I argue the biological plausibility of trial-and-error learning rules through a discussion of existing neurobiological data and identified molecular mechanisms. Finally, new directions of experimental research are suggested.
2 Hebb's Rule In 1949, Hebb proposed a neuronal learning rule that could integrate associative memories into neural networks [Hebb 1949]. Hebb postulated that when two neurons in synaptic contact fire coincidentally, the synaptic knobs are strengthened. Hebb's hypothesis is appealing as a cellular mechanism for associative learning. Hebb's rule is also appealing from a genetic point of view, since it requires very little genetic "overhead" to implement in actual nervous systems. All that is required is a mechanism for distinguishing simultaneous stimuli at the cellular level. Verification has taken time, but there is now evidence that Hebbian-type long-term potentiation (LTP) (with some modifications of the original hypothesis) does indeed occur [Lynch 1986; Kennedy 1988; Stevens 1989; BUss and Collingridge 1993]. Long-term depression (LTD) has been observed in the same system supporting an ancillary "Hebbian covariance learning rule" [Stanton and Sejnowski 1989].
2,1
Experimental Evidence: The NMDA Receptor
Long-term potentiation is mediated by the N-methyl-D-aspartate (NMDA) receptor. It is useful to review the mechanisms in the current model of LTP for two reasons. First, it illustrates how the proposed (Hebbian) learning
7. Biased Random-Walk Learning
Coincident (depolarizing) Axon Terminal
223
Active Axon Temiinal
Non-NMDA Receptor
Nitric Oxide? arachidonic acid? Retrograde Messenger Synthase?
Ca /calmodulin kinase Protein kinase C Tyrosine kinase MAP kinase
FIGURE 1. NMDA implementation of Hebbian learning. Simultaneous membrane depolarization and activation of the NMDA receptor allows calcium ions to flow into the cell. Calcium-dependent proteins trigger a cascade of intracellular events leading to structural and/or chemical changes postsynaptically as well as potential presynaptic changes via retrograde messengers. (Adapted from [Montague ei al. 1991; Kandel and O'Dell 1992].) rule has influenced experimental eflForts. Second, the actual mechanisms discovered are subtly different from the Hebbian ideal of strengthening correlated inputs. According to the current model of LTP [Zalutsky and Nicoll 1990; Buonomano and Bryne 1990; Kandel and O'Dell 1992], for the NMDA receptor channel to open, two conditions must be met simultaneously: (i) the receptor must bind glutamate, and (ii) the postsynaptic cell must be depolarized through activation of non-NMDA receptors. At resting potential, the NMDA receptor channel is blocked by Mg^"*". Depolarization removes the voltage-dependent Mg^"^ block, allowing Ca^"*" to flow into the cell. Ca^"^ appears to trigger LTP, through the activation of several diff"erent protein kinases (see Figure 1). There is also evidence for chemical and/or structural presynaptic changes [Zalutsky and Nicoll 1990; Edwards 1991]. Presynaptic modification is thought to be effected via retrograde messengers released across the synaptosomal junction. The retrograde messenger is presumed to be a labile.
224
Anderson
diffusible substance synthesized and released by the postsynaptic cell. The synthesis and/or release of such messengers is also thought to be a calciumdependent process. Several substances have been postulated to function as retrograde messengers. Among them are nitric oxide [Gaily et al. 1990], hydrogen peroxide [Colton et al. 1989; Zoccarato et al 1989] and archidoinic acid [Williams et al. 1989]. (For a review, see [Montague et al. 1991].) Many substances have been shown to have modulatory effects on LTP. A partial list of proteins, hormones, neurotransmitters and other compounds includes glycine and D-serine [Salt 1989], serotonin [Ropert and Guy 1991], acetylcholine and noradrenaline [Bear and Singer 1986; Brocher et al. 1992], human epidermal growth factor [Abe and Saito 1992], antidepressant drugs [Birnstiel and Haas 1991], milacemide [Quartermain et al. 1991], opioids [Xie and Lewis 1991] and ethanol [lorio et al. 1992]. Thus, it is not surprising that mental states and other factors such as "attention", blood flow, and "excitement" can influence learning. That so many compounds can modulate LTP indicates that the NMDA receptor may be a much more universal tool for synaptic modification than previously thought, and not solely employed in local, Hebbian-type learning. Finally, NMDA clearly mediates some, but not all, forms of learning. For instance, Malenfant et al. [1991] showed that application of an NMDA receptor antagonist (MK801) could block the acquisition of a spatial maze task in a dose-dependent manner. However, MK801 did not block the acquisition of experience-based maternal behavior. The same maternal experience effects can be blocked by chemical inhibition of protein synthesis. In summary, the NMDA receptor requires coincident events and makes possible a type of associative learning. Its discovery required intricate experiments at synaptic junctions. It is currently unclear whether synaptic change occurs at the postsynaptic dendritic spine, the presynaptic glutamate axon terminal, the presynaptic depolarizing axon, the axonal processes themselves, or a combination of all of these structures. Several chemical compounds have been identified that can facilitate or inhibit LTP. Many compounds that modulate LTP are common physiological chemical compounds, proteins or neurotransmitters and do not necessarily originate from either the pre- or postsynaptic neuron(s). Thus, it is conceivable that several forms of learning are operating in neural tissues, and these other forms of learning can he mediated via the NMDA receptor as well as by other, independent, neural processes.
2.2
Limitations of Hebbian Learning
Theoretically, Hebbian learning can account for some types of biological learning. Hebbian mechanisms have been shown to be sufficient to account for topographic mappings [Kohonen 1984; Grajski and Merzenich 1990], plasticity in cortical representation [Merzenich et al. 1987; Montague et al.
7. Biased Random-Walk Learning
225
1991] and, when applied to "sigma-pi" neurons, some nonlinear pattern recognition tasks [Mel 1992]. But there is more to the brain than conditioned reflexes and associative memories. For anything but special cases, Hebb's rule is insufficient as a learning rule [Rosenblatt 1962; Rumelhart et al. 1986]. Since Hebbian learning requires near simultaneous or synchronous stimuli, it is limited temporally. For many tasks, instantaneous performance results are not available. Motor control problems, for example, are inherently sequential. Temporal delays are also involved in many phenomena observed in psychophysical and electrophysiological studies of classical conditioning, such as anticipation of an unconditioned stimulus [Chester 1990; Deno 1992]. Hebbian learning would have to be combined with additional memory mechanisms or neuronal structures to account for such phenomena. Recent attempts to expand Hebbian learning rules to include short-term memory [Sutton and Barto 1981; Klopf 1989; Grossberg and Schmajuk 1989] have met with limited success [Chester 1990].^ Since the Hebbian rule applies only to correlations at the synaptic level, it is also limited locally. Strengthening a local correlation in the context of a nonlinear mapping of several variables (such as the A^-bit parity problem) often reduces overall performance. Consequently, Hebbian learning is unable to train a multilayer perceptron network to learn arbitrary, nonlinear decision boundaries [Rumelhart et al. 1986].
3
Theoretical Learning Rules
Current artificial neural network (ANN) research has provided valuable insights into the collective behavior of small networks of neurons [Hopfield 1984; Lehky and Sejnowski 1988, 1990; Lockery et al. 1989]. However, most of these results were obtained using more sophisticated algorithms than Hebb's rule. Learning rules employed to train ANNs are more appropriately referred to as optimization procedures. These algorithms, most of which are based on minimization of a defined error function, are capable of overcoming the limitations of Hebb's rule. Among the most popular today are genetic algorithms [Montana and Davis 1989; Fogel et al. 1990; ^To account for more complex phenomena, such as skilled movement, many have postulated that the brain utilizes "model-reference control," that is, the brain develops an internal model of the musculature and environment to predict performance of a control signal. A Hebbian mechanism can then be used to control such a system, since presumably, the temporal delay has been removed from correlated events. Such a system may in fact be used, especially for rapid, open-loop eye and hand movements [Grossman and Goodeye 1983; Anderson and Vemuri 1992]. But the "model" must still be updated by a global supervisory signal, which takes its cues from the external environment.
226
Anderson
Austin 1990] and gradient-descent learning [Rumelhart et al. 1986]. (For an overview of "connectionist" learning rules, see Hinton [1989].) Most of these algorithms have little biological basis and are used primarily for engineering problems in pattern recognition, classification, signal reconstruction, and so on. Do any of the multitude of ANN learning rules have any implications for experimental neurobiology? Criticisms of the biological plausibility of ANN training algorithms are abundant in the literature [Churchland and Sejnowski 1989; Hecht-Nielsen 1989; Crick 1989a; Mel 1990; Anderson 1991]. In his article "The recent excitement about neural networks," Francis Crick [1989a] writes: "It is hardly surprising that such achievements [referring to the successes using backpropagation] have produced a heady sense of euphoria. But is this what the brain actually does? Alas, the back-prop nets are unrealistic in almost every respect....Obviously what is really required is a brain-like algorithm which produces results of the same general character as backpropagation^^ {emphasis added). Bartlett Mel [1990] poses the problem this way: "[I]s it...a fundamental law that neural associative learning algorithms must be either representationally impoverished or mechanistically overcomplex?" What are the necessary features of a biologically plausible learning rule? First, it must have a mechanism for synaptic modification that is consistent with experimental data. Second, a learning rule must not involve so much specific neural structure that an excessive number of genes are required for its coding. Lastly, to be of any use to biologists, it must be observable. Clearly, Hebb's rule satisfies these criteria, while backpropagation, to varying degrees, violates all three. The debate over the biological plausibility of backpropagation continues [Dayhoff et al. 1994; Gardner 1993]. However, as the title of this paper suggests, there is at least one other ANN learning rule that satisfies these criteria—a biased random walk [Bremermann and Anderson 1989, 1991].
3.1
Learning via Random Walks
In its most basic form, a random walk can be generated by spontaneous, random variation in synaptic strength. This way, the mechanism for synaptic change is local and independent of any higher-level teaching signals. Successful changes in architecture or synaptic strength are rewarded or punished after the fact. Such a biased random walk in synaptic weight space can be considered a cellular analogue of trial-and-error. The first attempt to apply such an algorithm to artificial neural networks was by Lewey Gilstrap, Jr., Cook and Armstrong at Adaptronics, Inc. (McLean, VA) around 1970. They called their method "guided, accelerated random search" (GARS): "[T]he accelerated random search begins by exploring the vicinity of its initial estimate. The random trials are gov-
7. Biased Random-Walk Learning
227
erned by a normal distribution of probabilities which is centered on the initial point.... The accelerated random search follows an unsuccessful random step, with a step of equal magnitude in the opposite direction. By this means, a successful step is usually achieved on the second trial if not on the first random trial.... A successful step is always followed by another step in the same direction.... Each successive step is given double the magnitude of the prior step" [Barron 1968]. Barron [1968, 1970] used GARS to optimize control parameters in flight control systems. Mucciardi [1972] applied GARS to neural net-like classification structures called "neuromine nets." Mucciardi's paper presented an analysis of neuromine nets and the algorithm but provided only simple examples of its application. Interest in neural networks was waning at that time, especially because of well-known limitations of simple perceptrons acknowledged by Rosenblatt [1962] and highlighted in Perceptrons [Minsky and Papert 1969]. Unfortunately, Mucciardi and his colleagues never applied their algorithm to the complex classification problems emphasized in Perceptrons—the exclusive OR and "connectedness" problems. Another aspect of random search, overlooked by the group at Adaptronics, was its potential relevance to biology. In 1988, we began experimenting with a similar algorithm, which we dubbed the "chemotaxis algorithm" [Bremermann and Anderson 1989, 1991]. (See Inset.) The name was chosen by analogy to the strategy employed by bacteria to find chemoattractants in a spatial concentration gradient [Bremermann 1974; Alt 1980; Koshland 1980; Berg 1983]. Subsequently, Jabri and Flower [1992] have advocated the name "weight perturbation" for essentially the same algorithm. We showed that a biased Gaussian random walk could, in fact, train neural networks to solve the same difficult Boolean mappings that had eluded single-layer perceptrons and Hebbian networks (exclusive OR, A^-bit parity, etc.). Since then, randomwalk learning has been subjected to several criticisms. Here, I discuss or refute the most common criticisms: Criticism # 1 : Random walks are known to get trapped in local minima in conventional optimization problems. In the case of neural networks, local minima are not as much of a problem as one might expect. What is a local minimum in a small network with a lower-dimensional weight space often becomes a multidimensional saddle point in higher dimensions [Baldi and Hornik 1989; Conrad and Ebeling 1992; Yu 1992]. This is because of the degeneracy inherent in neural network architectures: There are usually many more free parameters (weights) than are theoretically required to solve the task at hand.
228
Anderson
The Chemotaxis Algorithm The "chemotaxis training algorithm" is one possible implementation of a biased random walk in weight space. One advantage to this training method is that it does not require gradient calculations or detailed error signals. It also allows for automatic adjustment of the single learning parameter, which otherwise has to be found empirically. The network is initialized with an an arbitrary set of weights, w°^ and performance E{w°) is evaluated. A random vector Aw is chosen from a multivariate Gaussian distribution with zero mean and unit standard deviation. This random vector is added to the current weights to create a "tentative" set of weights (w*):
where /i is a stepsize parameter. Performance E{w^) is then calculated for the tentative weights. If the error of the new configuration is lower than the original configuration, the tentative changes in the weight vector are retained; otherwise, the system reverts to its original configuration. If a successful direction in weight space is found, weight modifications continue along the same random vector until progress ceases. A new random vector is then chosen, and the process is repeated. More details are available in the cited literature.
Evolutionary optimization is also easier in high-dimensional, redundant systems [Conrad 1983]. A biased random walk can be considered a rudimentary genetic algorithm—one where the environment selects one of two possible mutant structures at each step. Conrad and Ebeling [1992] have shown that saddle points, not isolated peaks, dominate high-dimensional fitness landscapes: "Increasing the dimensionality of a system...increases the chances of finding an uphill [favorable] pathway to still higher peaks." Conrad refers to this phenomenon as "extradimensional bypass." Criticism # 2 : Random walks are inefficient. A biased random walk is also a form of gradient descent (random descent) and is quite efficient. In the case of a 3-dimensional spherical gradient (a condition that is ideal for gradient descent), the path taken to reach the optimum by the chemotaxis algorithm is, on average, only 39%
7. Biased Random-Walk Learning
229
longer than the optimal direct gradient path [Bremermann 1974]. Empirical studies show that the chemotaxis algorithm, while usually slower to converge, compares favorably in final network performance with backpropagation on a variety of benchmark tasks [Bremermann and Anderson 1989; Wilson 1991]. Furthermore, in cases where local minima do exist, there is no reason to expect that the Chemotaxis algorithm is more prone to local minima than backpropagation [Anderson 1991; Baldi 1991]. An extensive analytical comparison of random descent and gradient descent learning is given by Baldi [1991]. Criticism # 3 : Random walks cannot train neural networks to solve complex, nonlinear mappings such as the exclusive OR. This belief, reinforced by the perceived problem of local minima, is simply untrue [Bremermann and Anderson 1989] (Table 1). In addition to the benchmark problems, the chemotaxis algorithm has been applied successfully to training neural networks to solve a variety of problems: discrimination of seismic signals [Dowla et al. 1990; Anderson 1991], training "recurrent" neural networks [Anderson 1991], process control [Willis et al. 1991a,b], and motor control [Anderson and Vemuri 1992; Styer and Vemuri 1992a,b, 1995]. Experiments with other stochastic training algorithms have had similar successes [Harth and Tzanakou 1974; Tzanakou et al. 1979; Harth et al. 1988; Smalz and Conrad 1991; Jabri and Flower 1992]. Criticism # 4 : ^^Reinforcement" learning models are not biologically plausible. Reinforcement signals are generally thought to carry only general information about the overall performance ("good," "better," "target was missed by x amount," etc.). Specific information to individual synapses as to their relative responsibility in the task would be very difficult to determine. Biological mechanisms for assigning responsibility to each individual synapse are highly unlikely [Crick 1989a]. Most proposed reinforcement learning rules are also "mechanistically overcomplex" [Bremermann and Anderson 1989, 1991]. In Barto and Sutton's reinforcement learning schemes, for example, synaptic change is generated by the reinforcement signal itself, as interpreted by an adaptive critic element [Barto et al. 1981; Barto and Sutton 1983]. Although this work has generated many interesting and nontrivial applications, the complexity of its synaptic adjustment rules makes it an unlikely candidate for a biological learning rule. Other reinforcement algorithms have similar drawbacks [Williams 1992]. Surprisingly, in a comparison between adaptive critic and chemotaxis in controlling a cart-pole system, chemotaxis performed as well or better t\i^Ii the more complicated (and less biological) adaptive critic net-
230
Anderson
Chemotaxis Algorithm Performance
Dimension
(N) 2(X0R)
3 4 5 6 7
Chemotaxis (epochs)
Backpropagation (epochs)
113 251 962 1259 4169 5789
25 33 75 130 310 800
TABLE 1. Training time for the A^-bit parity problem. A^-bit parity can be considered a generalization of the 2-bit "exclusive OR" (XOR) problem, since class membership of a given pattern is dependent on all A^ inputs. Network architecture was A^ — (2A" -h 1) — 1, where A^ represents the number of hidden units. The networks were trained on all 2^ possible binary input patterns. Training was continued until the network responses were within 10% of the ideal Boolean values. Chemotaxis averages are taken from Bremermann and Anderson [1989]. No attempt was made to optimize algorithm parameters. Backpropagation averages are taken from Tesauro and Janssens [1988], who used optimal values for the learning and momentum parameters. Note that the computational time is double these values in the case of backpropagation.
works [Styer and Vemuri 1992a,b, 1995]. Criticism # 5 : Random-walk learning is not experimentally
observable.
The final, and most important, obstacle to finding biological evidence for reinforcement learning has been, and continues to be, experimental observability. This is because random walks are a nonlocal phenomenon. Experimental protocols involving single neurons, synapses, or even a small collection of interacting neurons cannot directly verify a nonlocal learning rule. Local measurements of a global phenomenon can verify only two of the necessary elements: local synaptic variation and neuromodulation (facilitation or inhibition of synaptic change). The remainder of this article addresses this issue.
7. Biased Random-Walk Learning
4
231
Biological Evidence
Reinforcement learning requires three components: (i) a mechanism for the generation of synaptic change, (ii) a structure for evaluating performance, or "trainer," and (iii) a reinforcement signal. To build a case for biological plausibility, one must show that all of the necessary elements are consistent with biological observations. Two components required for random-walk learning are clearly consistent with biological observations: random synaptic variation and neural structures for evaluating performance. Indeed, it is generally believed that local random explorations account for some types of neural development [Montague et al 1991]. In developmental models, however, the reinforcement signal is provided by the target cell. The random walk ends when a process finds its target. This type of locally reinforced random walk has the same limitations as Hebbian learning. The difference with what is being proposed here is that the reinforcement signals are not generated locally, through retrograde messengers or cell-adhesion molecules. Instead, reinforcement is generated and broadcast from "supervisory" neural structures (Figure 2).
4.1
Random Structural Variation
Cellular events are dominated by stochastic processes. It has been shown that structural variation can be guided or influenced by chemical or neural signals. What remains to be found is whether this modulation is a local phenomenon or one mediated by higher centers. Here, I cite just two examples of experimental systems that are consistent with this view. Growth of neurites in cerebellar granule cell cultures progresses stochastically [Rashid and Cambray-Deakin 1992]. Stimulation with NMD A results in a marked increase in growth rate, while the addition of an NMDA receptor antagonist, aminophosphonovalerate (APV), causes a marked retraction of preexisting processes. Either of these effects could be directed from more distant neural structures. In another experiment, Glanzman et al. [1990] studied an in vitro coculture of Aplysia sensory neurons and their target (L7 motor) cells. The sensorimotor cocultures were grown for 5 days and observed by fluorescence video micrographs. One group of preparations was repeatedly treated with the facilitating transmitter serotonin (5-HT) for 24 hours. At the end of the experiment, the coculture was imaged again to look for structural changes. Morphological changes (changes in the size of varicosities or new processes) at the junctions between the sensory and motor cells were rated on a subjective scale. This study was significant in that the researchers were able to directly image structural changes—rather than relying on comparisons between two different populations of neurons. In the control group, mor-
232
Anderson Performance Appraisal (higher brain centers)
Sensory Systems
External Environment
Neural Circuits (Site of random variation) Reinforcement (facilitation or Inhibition signal)
Effector Organs FIGURE 2. Neural implementation of a biased random walk. Random variation in synaptic connectivity and efficacy is rewarded after the fact if performance has improved. Performance is evaluated by sensory systems (somatosensory, visual, auditory, etc.) and a nonspecific reinforcement signal is broadcast to the participating neural circuitry. The reinforcement signal could be chemical (hormonal) or neural in origin. phological changes were found to be normally distributed with a mean change of zero on their rating scale. In the cocultures treated with serotonin, however, structural change was shown to be highly biased toward increases in varicosities or processes. Furthermore, they showed that these structural changes corresponded to measurable changes in monosynaptic excitatory postsynaptic potential (EPSP) produced in L7 motor cells by firing the sensory neuron. Thus, they were able to show that both physical and electrophysiological facilitation can be induced in vitro by a single chemical signal—serotonin. I suggest that these random variations serve a vital role in learning, that is, generating trial connections and efficacies. Serotonin release in a cluster of neurons may serve as a local "print" (or fixing) signal to retain effective changes. However, the experiment described by Glanzman et al. was not designed to differentiate between serotonin's putative role as a simple growth factor or a reinforcement signal. Serotonin has been shown to serve a role as a neuromodulator as well as a
7. Biased Random-Walk Learning
233
facilitation signal. There is evidence for a brainstem serotonergic projection to the ventrobasal thalamus, thus linking facilitory signal to higher brain centers [Eaton and Salt 1989]. Does facilitation reinforce existing changes, or does the change occur as a result of the presence of serotonin?
4-2
Reinforcement Signals
A biased random walk requires that the performance of a net be evaluated. This requirement may not be that problematic, since evaluation of performance tends to be computationally easier than improvement, and evaluation could be accomplished by other brain circuits. For example, throwing a ball requires precise coordination and timing of numerous muscles. Good performance is hard to achieve and may require extensive training. But how close a ball comes to hitting the target is relatively easy to determine. Evaluation of accuracy can be processed separately by the visual cortex—independently of networks involved in generating the movement. One portion of the brain thus could act for another system as "supervisor." The reinforcement signal is likely to carry only general, nonspecific, information. It could be neural or chemical (hormonal) in origin. Dayhoff et al. (1992, 1993) suggest that retrograde reinforcement signals could be mediated by intracellular signaling through the neuronal cytoskeleton. Glial cells are also cabable of long-range intracelluar communication via Ca^"*" signaling (Cooper 1995). Many of the substances that have been shown to modulate LTP (including the candidate retrograde messengers) are candidate reinforcement signals as well. To complete a model of random-walk learning, one must demonstrate that other brain centers have projections to the sites of synaptic variation that release (directly or indirectly) substances and can act to facilitate or inhibit the process of structural change. One known reverse pathway is a projection from the locus ceruleus to the olfactory bulb. Locus ceruleus neurons are activated by unconditioned stimuli [Aghajanian and Vandermaelen 1982]. Reverse pathways from the locus ceruleus are diffuse but may still serve a neuromodulatory role [Crick 1989a]. Locus ceruleus neurons release the neurotransmitter norepinephrine (NE), which when infused into the rabbit olfactory bulb can prevent or delay the habituation to unreinforced odors [Gray et al. 1986]. Several forms of use-dependent synaptic plasticity in cortical tissues require the presence of NE [Bliss et al 1983; Bear and Singer 1986]. Sullivan and colleagues argue that "it is now clearly established that activation of NE terminals in the olfactory bulb is necessary for memory formation, but not recall" [Sullivan et al. 1992]. Taken together, these data suggest that norepinephrine signals projecting from the locus ceruleus could be functioning as a reinforcement signal.
234
5
Anderson
Conclusions
It is self-evident that some form of trial-and-error learning is involved in the acquisition of skilled movement [Grossman and Goodeye 1959; Anderson 1981]. But training a tabula rasa of randomly connected masses of neurons to perform complex control tasks is evidently a hopeless endeavor [Anderson 1991]. High level control of movement is thought to involve the coordination or modulation of existing central pattern generators (GPGs) [Selverston 1980]. A biased random walk can be used to optimize a crudely organized network of GPGs during the acquisition of skilled movement [Anderson 1991; Anderson and Vemuri 1992; Styer and Vemuri 1992a,b, 1995]. This is somewhat analogous to Edelman's selectionist hypothesis in that learning entails the "selection," or education, of an existing repertoire of dynamical "groups" [Edelman 1987; Grick 1989b]. Furthermore, the chemotaxis algorithm is only the most primitive form of trial-and-error; undoubtedly, more sophisticated, higher-level, neural mechanisms will have evolved to coordinate and complement this process [Smalz and Gonrad 1991]. Experimental verification of this type of learning will require protocols involving collections or assemblies of neurons, rather than individual synaptic junctions, to observe the stochastic variation and the effects of putative reinforcement signals. Furthermore, a more ambitious effort must be made to link reinforcement signals backwards to their projective sources. McGulloch and Pitts offered a solution to the embodiment problem by demonstrating the computational properties of neural networks. Hebb proposed a neurobiological correlate to associative learning or classical conditioning. Biased random walks in synaptic weight space can be seen as the neurobiological "embodiment" of trial-and-error learning. A biased random walk may some day be shown to be a learning rule immanent in nervous activity.
6
Acknowledgments
I thank Daniel Ghester for calling to my attention the work done at Adaptronics. Inc. I also thank Hans J. Bremermann, Lee Segel, Michael Gonrad, Judith Dayhoff, Omid Omidvar, and V. (Rao) Vemuri for their encouragement and editorial comments. This work was performed under the auspices of the U.S. Department of Energy and supported by the Genter for Nonlinear Studies at Los Alamos.
7. Biased Random-Walk Le£u:ning
7
235
References
K. Abe and H. Saito (1992). Epidermal growth factor selectively enhances NMDA receptor-mediated increase of intracellular Ca^"'" concentration in rat hippocampal neurons. Brain Research 587: 102-8. G.K. Aghajanian and C.P. Vandermaelen (1982). Intracellular identification of central noradrenergic and serotonergic neurons by a new double labeling procedure. J. of Neuroscience 2: 1786-1792. W. Alt (1980). Biased random walk models for chemotaxis and related diffusion approximations. J. of Mathematical Biology 9: 147-177. J.R. Anderson, Ed. (1981). Cognitive Erlbaum Associates, Hillsdale, NJ
Skills
and Their
Acquisition.
R.W. Anderson (1991). Stochastic optimization of neural networks and imphcations for biological learning. Ph.D. Dissertation, University of California, San Francisco. R.W. Anderson and V. Vemuri (1992). Neural networks can be used for open-loop, dynamic control. Int. J. Neural Networks. 3(3): 71-84 (1992). M.A. Arbib (1987). Brains, Machines, and Mathematics, Second Edition (First Edition: McGraw-Hill, 1964), Springer-Verlag. New York. S. Austin (1990). Genetic solutions to XOR problems. AI Expert pp. 52-57 P. Baldi (1991). Gradient descent learning algorithms: A general overview. JPL Technical Document. P. Baldi and K. Hornik (1989). Neural networks and principle component analysis: learning from examples without local minima. Neural Networks 2: 53-58. R.L. Barron (1968). Self-organizing and learning control systems. In: Cybernetic Problems in Bionics (Bionics Symposium, May 2-5, 1966, Dayton, OH), Gordon and Breach, New York, pp. 147-203. R.L. Barron (1970). Adaptive flight control systems. In: Principles and Practice of Bionics (NATO AGARD Bionics Symposium, Sept. 18-20,1968, Brussels, Belgium, pp. 119-167. A.G. Barto, R.S. Sutton and P.S. Brouwer (1981). Associative search net-
236
Anderson
work: A reinforcement learning associative memory. Biological Cybernetics 40: 201-211. A.G. Barto and R.S. Sutton (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13 (5):835-846. M.F. Bear and W. Singer (1986). Modulation of visual cortical plasticity by acetylcholine and noradrenaline. Nature 320: 172-17. H. Berg (1983). Random Walks in Biology. Princeton University Press, Princeton, NJ. S. Birnstiel and H.L. Haas (1991). Acute effects of antidepressant drugs on long-term potentiation (LTP) in rat hippocampal slices Naunyn-Schmiedebergs Archives of Pharmacology 344: 79-83. W.W. Bledsoe (1961). The Use of Biological Concepts in the Analytical Study of Systems, Technical Report, Panoramic Research Inc., Palo Alto, CA. T.V.P. Bliss and G.L. Collingridge (1993). A synaptic model of memory: long-term potentiation in the hippocampus. Nature 361: 31-39. T.V.P. Bliss, G.V. Goddard and M. Riives (1983). Reduction of long-term potentiation in the dentate gyrus of the rat following selective depletion of monoamines J. of Physiol. 334: 475-491. H.J. Bremermann (1974). Chemotaxis and optimization. J. of the Franklin Institute (Special Issue: Mathematical Models of Biological Systems) 297: 397-404. H.J. Bremermann and R.W. Anderson (1989). An Alternative to Backpropagation: A Simple Rule of Synaptic Modification For Neural Net Training and Memory. Technical Report: U.C. Berkeley Center for Pure and Applied Mathematics PAM-483. H.J. Bremermann and R.W. Anderson (1991). How the brain adjusts synapses—maybe. In Automated Reasoning: Essays in Honor of Woody Bledsoe, R.S. Boyer (Ed.), Chapter 6, pp. 119-147, Kluwer Academic Publ., Boston. S. Brocher, A. Artola and W. Singer (1992). Agonists of cholinergic and noradrenergic receptors facilitate synergistically the induction of long-term potentiation in slices of rat visual cortex. Brain Research 573: 27-36.
7. Biased Random-Walk Learning
237
D.V. Buonomano, and J.H. Bryne (1990). Long-term synaptic changes produced by a cellular analog of classical conditioning in Aplysia. Science 249: 420-3. D.L. Chester (1990) A comparison of some neural network models of classical conditioning. Proc. 5th IEEE International Symposium on Intelligent Control, Philadelphia 2: 1163-1168. P.S. Churchland and T.J. Sejnowski (1989). Neural representation and neural computation. In: Neural Connections, Mental Computations, Nadel, L., L A. Cooper, P. Culicover, and R. M. Harnish (Eds.) pp. 15-48, MIT Press, Cambridge, MA. C.A. Colton, L. Fagni and D. Gilbert (1989). The action of hydrogen peroxide on paired pulse and long-term potentiation in the hippocampus. Free Radical Biol. Med. 7: 3-8. M. Conrad (1983). Adaptability
(Chapter 10), Plenum Press, NY.
M. Conrad and W. Ebeling (1992). M.V. Volkenstein, evolutionary thinking and the structure of fitness landscapes. BioSystems 27: 125-128. M.S. Cooper (1995). Intercellular signaling in neuronal-gUal networks. BioSystems 34: 65-85. F. Crick (1989a). The recent excitement about neural networks. Nature 337: 129-132. F. Crick (1989b). Neural Edelmanism. Trends in Neurosciences 12 (7): 240-248. E.R.F.W. Crossman (1959). A theory of the acquisition of speed-skill. Ergonomics 2 (2): 153-166. E.R.F.W. Crossman and P.J. Goodeye (1983). Feedback control of handmovement and Fitt's law. Quarterly Journal of Experimental Psychology 35A: 251-278. Y. Dan and M. Poo (1992). Hebbian depression of isolated neuromuscular synapses in vitro. Science 256: 1570-1573. J.E. Dayhoff, S.R. Hameroff, C.E. Swenberg and R. Lahoz-Beltra, 1992. Biological plausibility of back-error propagation through microtubules. Technical report of the Institute for Systems Research, University of Maryland, College Park, MD 20742. SRC TR92-17.
238
Anderson
J.E. Dayhoff, S.R. HamerofF, C.E. Swenberg, and R. Lahoz-Beltra, 1993. The neuronal cytoskeleton: A complex system that subserves neural learning. In Rethinking Neural Networks^ Eds. K. H. Pribram and Sir J. Eccles. Lawrence Erlbaum Assoc. J. DayhofF, S. Hameroff, R. Lahoz-Beltra, and C.E. Swenberg (1994). Cytoskeletal involvement in neuronal learning: a review. Eur. Biophys. J. 23: 79-93. D.C. Deno (1992). Control theoretic investigations of the visual smooth pursuit system. Ph.D. Thesis, Dept. EECS, U.C. Berkeley. F.U. Dowla, S.R. Taylor and R.W. Anderson (1990). Seismic discrimination with artificial neural networks: Preliminary results with regional spectral data. Bulletin of the Seismological Society of America 80 (5): 13461373. S.A. Eaton and T.E. Salt (1989). Modulatory effects of serotonin on excitatory amino acid responses and sensory synaptic transmission in the ventrobasal thalamus. Neuroscience 33 (2): 285-292. G.M. Edelman (1987). Neural Darwinism. Basic Books, N. Y. F. Edwards (1991). LTP is a long term problem. Nature 350: 271-272. D.B. Fogel, L.J. Fogel and V.W. Porto (1990). Evolving neural networks. Biological Cybernetics 63: 487-493. J.A. Gaily, P.R. Montague, G.N. Recke and G.M. Edelman (1990). The NO hypothesis: Possible effects of a rapidly diffusible substance in neural development and function. Proc. Natl. Acad. Sci. USA 87: 3547-3551. D. Gardner (1993). Backpropagation and neuromorphic plausibihty. World Congress Neural Networks II: 590-593. D.L. Glanzman, E.R. Kandel and S. Schacher (1990). Target-dependent structural changes accompanying long-term synaptic facilitation in aplysia neurons. Science 249: 799-802. K.A. Grajski and M.M. Merzenich (1990). Hebb-type dynamics is sufficient to account for the inverse magnification rule in cortical somatotopy. Neural Computation. C M . Gray, W.J. Freeman and J.E. Skinner (1986). Chemical dependencies of learning in the rabbit olfactory bulb: Acquisition of the transient
7. Biased Random-Walk Learning
239
spatial pattern change depends on norepinephrine. Behavioral Neuroscience 100 (4): 585-596. S. Grossberg and N.A. Schmajuk (1989) Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks 2 (2): 79-102. E. Harth and E. Tzanakou (1974). A stochastic method for determining visual receptive fields. Vision Research 14: 1475-1482. E. Harth, T. Kalogeropoulos, A.S. Pandyaand K.P. Unnikrishnan (1988). A universal optimization network. AT&T Technical Memorandum 11118881026-23TM. D.O. Hebb (1949). The Organization of Behavior. Wiley, New York. R. Hecht-Nielsen (1989). Theory of the backpropagation neural network. IJCNN, Washington, DC. June 1989, I: 593-605. G.E. Hinton (1989). Connectionist learning procedures. Artificial Intelligence 40 (1): 143-150. J.J. Hopfield (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. (USA) 81: 3088-3092. K.R. lorio, L. Reinlib, B. Tabakoff and P.L. Hoffman (1992). Chronic exposure of cerebellar granule cells to ethanol results in increased N-methylD-aspartate receptor function. Molecular Pharmacology 41: 1142-1148. Y. Izumi, D.B. Clifford and C.F. Zorumski (1992). Inhibition of longterm potentiation by NMDA-mediated nitric oxide release. Science 257: 1273-1276. M. Jabri and B. Flower (1992). Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. IEEE Transactions on Neural Networks 3 (1): 154157. E.R. Kandel and T.J. O'Dell (1992). Are adult learning mechanisms also used for development? Science 258: 243-245. J.A. Kauer, R.C. Malenka and R.A. Nicoll (1988). NMDA application potentiates synaptic transmission in the hippocampus. Nature 334: 249252.
240
Anderson
M.B. Kennedy (1988). Synaptic memory molecules. Nature 335: 770-772.
A.H. Klopf (1989). Classical conditioning phenomena predicted by a drive-reinforcement model of neuronal function. In Neural Models of Plasticity: Experimental and Theoretical Approaches, J.H. Byrne and W.O. Berry (Eds.), Chapter 7, pp. 104-132, Academic Press, Orlando, FL. T. Kohonen (1984). Self-Organization and Associative Memories. Springer Verlag, BerUn D. Koshland (1980). Bacterial Chemotaxis as a Model Behavioral System Raven Press, New York. S.R. Lehky and T.J. Sejnowski (1988). Computing 3-D Curvatures from Images of Surfaces Using a Neural Model. Nature 333: 452. S.R. Lehky and T.J. Sejnowski (1990). Neuronal Model of Stereoacuity and Depth Interpolation Based on a Distributed Representation of Stereo Disparity. J. of Neuroscience 10 (7): 2281-2299. S.R. Lockery, G. Wittenberg, W.B. Kristan and G.W. Cottrell (1989). Function of Identified Interneurons in the Leech Elucidated Using Neural Networks Trained by Back-Propagation. Nature 340: 468-71. G. Lynch (1986). Synapses, Circuits, and the Beginnings of Memory. Bradford/MIT Press, Cambridge, MA. S.A. Malenfant, S. O'Hearn and A.S. Fleming, (1991). MK801, an NMDA antagonist, blocks acquisition of a spatial task but does not block maternal experience effects. Physiology and Behavior 49: 1129-1137. W.S. McCuUoch and W. Pitts (1949). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5: 115133. B.W. Mel (1990). Connectionist Robot Motion Planning. Academic Press, Boston, San Diego. B.W. Mel (1992). NMDA-based pattern discrimination in a modeled cortical neuron. Neural Computation 4: 502-517. M.M. Merzenich, R.J. Nelson, J.H. Kaas, M.P. Stryker, W. M. Jenkins, J.M. Zook, M.S. Cynader and A. Schoppman (1987). Variability in hand surface representations in areas 3b and 1 in adult owl and squirrel monkeys.
7. Biased R a n d o m - W a l k Learning
241
J. of Comparative Neurology 258 (2): 281-96. M. Minsky and S. Papert (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA P.R. Montague, J.A. Gaily and G.M. Edelman (1991). Spatial signaling in the development and function of neural connections. Cerebral Cortex 1 (1): 1047-3211. D. Montana and L. Davis (1989). Training feedforward neural networks using genetic algorithms. Proc. 11th IJCAI. P.G. Montarolo, E.R. Kandel and S. Schacher (1988). Long-term heterosynaptic inhibition in Aplysia. Nature 333: 171-174. A.N. Mucciardi (1972). Neuromine nets as the basis for the predictive component of robot brains. In: Cyberneticsj Artificial Intelligence, and Ecology, H.W. Robinson and D.E. Knight (Eds.), Fourth Annual Symposium Amer. Soc. of Cybernetics, pp. 159-193, Spartan Books. J.C. Pearson, L.H. Finkel and G.M. Edelman (1987). Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. of Neuroscience 7 (12): 4209-4223. D. Quartermain, T. Nuygen, J. Sheu and R.L. Herting (1991). Milacemide enhances memory storage and alleviates spontaneous forgetting in mice. Pharmacology, Biochemistry and Behavior 39: 31-5. N.A. Rashid and M.A. Cambray-Deakin (1992). N-methyl-D-aspartate effects on the growth, morphology and cytoskeleton of individual neurons in vitro. Brain Research 67: 301-308. N. Ropert and N. Guy (1991). Serotonin facilitates GABAergic transmission in the CAl region of rat hippocampus in vitro. J. of Physiology 441: 121-36. F. Rosenblatt (1958). The Perceptron, a probabilistic model for information storage and organization in the brain. Psych. Reviews 62: 386-408.
F. Rosenblatt (1962). Principles of Neurodynamics: Spartan Books, Washington, DC. D.E. Rumelhart, G.E. Hinton and R.J. Williams (1986). Learning internal representations by error propagation. In: Parallel Distributed Pro-
242
Anderson
cessing, D.E. Rumelhart and J.L. McClelland, Eds., Vol 1, MIT Press, Cambridge, MA pp. 318-362. T.E. Salt (1989). Modulation of NMDA receptor-mediated responses by glycine and D-serine in the rat thalamus in vivo. Brain Research 481: 403406. A.I. Selverston (1980). Are central pattern generators understandable? Behavioral and Brain Sciences 3: 535-571. H.T. Siegelmann and E.D. Sontag (1991). Neural nets are universal computing devices, Technical Report SYCON-91-08, Rutgers University, Center for Systems and Control, New Brunswick, NJ. H.T. Siegelmann and E.D. Sontag (1994). Analog computation via neural networks. Theor. Comput. Sci. 131: 331-360. H.T. Siegelmann and E.D. Sontag (1995). On the computational power of neural nets. J. Computer Syst. Sci. 50: 132-150. R. Smalz and M. Conrad (1991). A credit apportionment algorithm for evolutionary learning with neural networks. In: Neurocomputers and Attention. A. V. Holden and V.I. Kryukov, eds.. Vol. 2, Manchester University Press, New York, pp. 663-673. P.K. Stanton and T.J. Sejnowski (1989). Associative long-term depression in the hippocampus induced by hebbian covariance. Nature 339: 215218 (1989). C.F. Stevens (1989) Strengthening the synapses. Nature 338: 460-461. D. Stork (1989). Is back-propagation biologically plausible? IJCNN Washington, DC. II: 241-246. D.L. Styer and V. Vemuri (1992a). Adaptive critic and chemotaxis in adaptive control. Conf. Artificial Neural Networks in Engineering (ANNIE), St. Louis, MO. D.L. Styer and V. Vemuri (1992b). Control by artificial neural networks using model-less reinforcement learning. Preprint: Biomedical Engineering Graduate Group, University of California, Davis (submitted to Simulations). D.L. Styer and V. Vemuri (1995). A comparison of adaptive critic and chemotaxis methods in adaptive control. Math. Comput. Modeling 21 (1/2):
7. Biased Random-Walk Learning
243
109-118. R.M. Sullivan, D.R. Zyzak, R Skierkowski and D.A. Wilson (1992). The role of olfactory bulb norepinephrine in early olfactory learning. Brain Res. Dev. Brain Res. 70: 279-282. R.S. Sutton and A.G. Barto (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review 88 (2): 135170. G. Tesauro and B. Janssens (1988). Scaling relationships in backpropagation learning. Complex Systems 2: 39-44. E. Tzanakou, R. Michalak and E. Harth (1979). The alopex process: visual receptive fields by response feedback. Biological Cybernetics 35: 161174. J.H. Williams, M.L. Errington, M.A. Lynch and T.V.R Bliss (1989). Arachidonic acid induces a long-term activity-dependent enhancement of synaptic transmission in the hippocampus. Nature 341: 739-742. R.J. Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8: 229-256. M.J. Willis, C. Di Massimo, G.A. Montague, M.T. Tham and A.J. Morris (1991a) Artificial neural networks in process engineering. IEEE Proceedings D 138: 256-266. M.J. Willis, G.A. Montague, C. Di Massimo, M.T. Tham and A.J. Morris (1991b). Non-linear predictive control using optimization techniques. Proc. ACC, Boston, pp. 2788-2793. J.M. Wilson (1991). Back-propagation neural networks: A comparison of selected algorithms and methods of improving performance. Proc. 2nd Annual Workshop Neural Networks WNN-AIND, Auburn, AL. C.W. Xie and D.V. Lewis (1991). Opioid-mediated facilitation of longterm potentiation at the lateral perforant path-dentate granule cell synapse. Journal of Pharmacology and Experimental Therapeutics. 256: 289-96. X.H. Yu (1992). Can backpropagation error surface not have local minima. IEEE Trans. Neural Networks 3: 1019-1021. R.A. Zalutsky and R.A. NicoU (1990). Comparison of two forms of longterm potentiation in single hippocampal neurons. Science 248: 1619-1624.
244
Anderson
F. Zoccarato, R. Deana, L. Cavallini and A. Alexandre (1989). Generation of hydrogen peroxide by cerebral cortex synaptosomes. Eur. J. Biochem. 180: 473-478.
Chapter 8 Using SONNET 1 to Segment Continuous Sequences of Items A l b e r t Nigrin^ ABSTRACT This chapter discusses self-organizing neural networks that were designed to classify temporal sequences. Three of the major constraints that the networks were designed to satisfy are: (1) The networks must be able to both learn and classify temporal sequences at the pace at which the sequences are presented (no off-line processing), (2) the networks must be able to learn to segment patterns that have no predefined beginnings or endings, and (3) the networks must be able to incrementally learn in an unsupervised fashion without degrading previously established categories. This chapter will discuss constraints and describe architectures for achieving these goals. The transformation of sequences of temporal events into spatial patterns of activity will be shown, and the properties that classifying systems should have, to enable them to classify the transformed patterns, will be discussed. A selection of simulations is given to show that the ideas presented are plausible.
1
Introduction
In order for an a u t o n o m o u s agent to o p e r a t e in a real-world environment, it must overcome at least three major problems. First, t o allow t h e agent to operate in real time, it must be able to respond t o events at t h e pace a t which they occur. Second, since real-world events usually have no predefined beginning, middle, or ending, an agent must be able t o form its own segmentations. And finally, since t h e r e is often no external teacher present to guide it, t h e agent must be able t o learn its categories in an unsupervised fashion. This chapter will examine these a n d other issues. It will a t t e m p t to design a real-time neural network t h a t learns t o segment a never ending stream of input items in an unsupervised fashion.^ T h e ability t o segment p a t t e r n s in real time is i m p o r t a n t in areas such as ^Supported in part by ARC grant DAG-29-84-K-0072 2 Some of this chapter is excerpted from Nigrin (1993). 245
246
Nigrin
object recognition, reinforcement learning, and speech recognition. Let us consider speech recognition. First, since language is interactive, real-time operation is a must. Second, as anyone who has ever listened to an unfamiliar foreign language can attest, there seldom are any clear-cut boundaries between words of a sentence. Thus, the ability to form segmentations in the presence of extraneous information is also a must. And third, if we wish to model humans, our systems must have the ability learn language in an unsupervised fashion. This is clear, since infants learn their native language simply by listening to it. Obviously, it is still beyond the scope of current technology to create neural networks that can learn continuous speech. However, by considering a simplified version of the problem, it may be possible to model the continuous and real-time nature of speech without complicating the problem excessively. After the simplified problem has been solved, additional issues can be dealt with. Therefore, the input patterns will not consist of continuous speech signals. Instead, they will be composed of never-ending sequences of items. For convenience, the items will be represented by capital letters. However, it is also possible to think of the items as phonemes, numbers, musical notes, etc. These items will be sequentially presented to the network at a constant rate and intensity. Then, the neural network's task will be to learn to segment the input sequences by discovering the significant patterns embedded within them. For example, consider the sequence below, which is repeatedly presented to the network. The letters are presented one at a time, with no breaks in the presentation. Therefore, after the last letter, Z, is presented, the first letter in the sequence, E^ is immediately presented again. Notice that the lists EAT and NOW are embedded within several different locations (contexts). Because of this, the network should learn to recognize EAT and NOW as significant patterns. E E E
A A A
T B C D N O W T J K L M N O T S U V N O W
F
G W X Y
H P Z
I Q
R
If the network is viewed as a black box, it can be pictured as in Figure 1. Each input line represents a discrete event or item. An item is presented to the network by briefly activating its input line, and a list is presented by sequentially presenting different items. Figure 2 shows the activation of the input lines when the network is presented the lists ABC and CAB. In this chapter, several restrictions are made to the type of inputs that are allowed. Methods to overcome these restrictions are discussed in Nigrin (1990, 1993). 1. No item can be repeated within the same Ust. For example, the list
8. Using SONNET 1
Outputs ><
><
k
> • • •
Black Box 1
)^
;^
^
)
• ••
^A h
^C
Inputs FIGURE 1. Black box of the network.
247
248
Nigrin
ABC
'B
'c (a)
CAB
'B
(b) FIGURE 2. (a) Sequential activation of the input lines during the presentation of the list ABC. (b) Activation of the input lines when CAB is presented. Reprinted with permission from Nigrin (1993). ABC can be presented to the network, but the list ABA cannot. This restriction is placed for two reasons. First, in the current model only one storage location exists for each distinct item, and second, it is presently unknown how to use a single node to represent multiple occurrences of an item in an unambiguous fashion. Therefore, before an item can be repeated, the activity in its storage location must be reset. 2. Items are presented by activating input lines at fixed intensities for fixed periods of time. Therefore, the only important information that can be varied is the sequence of the items. This chapter does not deal with the possibility of different rhythms or different intensities in the presentation of items. 3. For simplicity, no noise is present during any of the simulations, and the items are not garbled in any way. However, the network was tested on noisy patterns in simulations involving static spatial patterns (Nigrin, 1993).
8. Using SONNET 1
249
Outputs A
A
A
Xi) (X2) (5C3
©
piV
F(1)
^A h
^C
Inputs FIGURE 3. The structure of the network in this chapter. F^^^ transforms temporal sequences of events into spatial patterns of activity. F^^^ classifies these evolving spatial patterns. Reprinted with permission from Nigrin (1993). We will tackle this problem using a Self Organizing Neural NETwork called SONNET 1. SONNET 1 is separated into two distinct fields of units, F^^^ and F^'^\ as shown in Figure 3. F^^^ contains nodes that represent the various input items that can be presented to the network. Each node represents a specific item (or event) and receives external input from a single input line that represents the item. F^^) transforms the successive activation of different input lines into a spatial pattern of activity in short-term memory (STM) to represent the sequence of items. F^^^ nodes represent serial order, so they will be referred to by the letter 5. Both the name and the activity of the ith cell in F^^^ is given by Si. F^^) has cell assemblies that represent lists of items. To be more precise, F^^) learns to chunk the evolving spatial patterns of activity across
250
Nigrin
Both the name and the activity of the ith cell assembly in F^^^ is given by Using the architecture described above, there are three major problems that must be solved. First, constraints must be found to govern the transformation of temporal events into a spatial pattern of activity. Second, once an adequate transformation has been devised, neural network mechanisms must be designed at F^^^ to implement this transformation. And third, networks must be designed at F^^^ that can classify these unsegmented spatial patterns in real time. The chapter is organized in the following manner. Section 2 will discuss the general manner by which patterns are classified at F^^\ Then, Sections 3 through 5 will discuss various constraints on transforming a sequence of temporal events into spatial patterns of activity. Sections 6 and 7 will discuss the specific architectures whereby this can be implemented. Section 8 will discuss properties that a classifying system should have to enable it to classify transformed patterns. Section 9 will present a small number of simulations to show that this scheme is a plausible one, and Section 10 will present some additional discussion.
2
Learning Isolated and Embedded Spatial Patterns
Before discussing how temporal patterns are transformed, let us first discuss the general manner by which a classifying network can learn to classify arbitrary patterns at F^^\ First, let us consider the network. In SONNET 1, F^^) and F^^) ^re fully connected. Thus, every F^^^ node sends signals to every F^^^ node and every F^^^ node sends signals back to every F^^^ node. These signals are gated (multiplied) by excitatory weights. The long-term memory (LTM) weight from Si to Xj will be referred to by Zij, and the feedback weight from Xk to Sm will be referred to by z[]l. At F^^\ inhibitory connections exist between all the F^^^ nodes. Thus, when an input pattern is presented to F^^\ the F^^^ nodes will compete for the right to activate and classify the pattern. The excitatory weights from F^^^ to F^^^ are initially small. Thus, when a novel F^^^ pattern is initially presented, many F^^^ nodes will weakly activate in response to it. However, after learning has taken place, a single F^^^ node will activate strongly to represent the F^^^ pattern. This strong activation will occur due to the fact that the learning rule causes the weights at the classifying F^^^ node to become large and parallel to the F^^^ pattern.
8. Using SONNET 1
251
The rule for modifying the feedforward excitatory weights is given by^
— Zji = €i Xi[—LZji
-j- OjiXi\,
(1)
where 6i is a constant that governs the learning rate, L is a decay constant, and Sji is the normalized value of Sj at Xi. (See equation 12.) Learning occurs in the following way. When a novel pattern is presented, some F^^^ nodes will activate more strongly than others, due to the fact that the LTM weights are initially set to random values. The active nodes increase their LTM weights at active input lines and decrease their weights at inactive input lines. The higher the activity of an F^^^ node, the faster its weights will increase and decrease. Since higher activity nodes increase their weights more quickly than do lower activity nodes, they boost their competitive advantage over the lower activity nodes. The asymmetry in the activity of the cells is also enhanced by the competitive structure of the network. F^^^ nodes compete for the right to activate via lateral inhibitory connections. Higher activity nodes output more inhibition than do lower activity nodes. Consequently, they inhibit lower activity nodes more strongly than the reverse. The differences in the inhibition coupled with the differences in the learning rates cause the asymmetry in the activities of nodes to continually increase. Eventually, the asymmetries become so large that a single F^^^ node activates fully and inhibits the remainder of the field. That node is then considered to have classified the F^^^ pattern. The network learning rate can be set to provide two different types of learning. Fast learning occurs when a high learning rate enables F^^^ nodes to classify F^^^ patterns as quickly as in a single trial. In this case an F^^^ node simply memorizes the active F^^^ pattern. Slow learning occurs when a lower learning rate allows the network to generalize over many different examples. This allows the network to be able to learn patterns that are embedded within larger patterns. An embedded pattern will be learned when that pattern occurs in multiple different contexts. For example, suppose the pattern B is embedded within multiple larger patterns such as ABC, DBE, and FBG. Then, the network will learn to classify the pattern B in the following way. Suppose the F^^^ node Xi responds to the presentation of the patterns above. When ABC is presented, the LTM weights ZAI, ZBI, and zci will increase while the remaining weights decrease (in this example, ZDI^ ZEI, zpi, and ZQI all decrease). Similarly, when DBF is presented, zoi, ZBI-, and ZEI increase ^This and future equations are presented only to give the reader concrete instantiations for various quantities in the network. However, it is possible to skip over any of the equations without any loss of continuity.
252
Nigrin
while the remaining weights decrease. The weight ZBI increases for all the patterns that are presented. Conversely, the remaining weights increase infrequently (when their input line is active) and decrease frequently (in all contexts where their input line is not active). Therefore, ZBI will increase to much larger levels than the other weights, and Xi will establish a category for the pattern B. In SONNET 1, by varying a single parameter it is possible to control the number of different contexts that an embedded pattern must appear in before it is learned. For example, in the paragraph above we assumed that the parameter was set so that the appearance of B in only 3 contexts was sufficient for generalization to occur. However, with a parameter choice, more than three contexts might be necessary. If this were true, then when ABC, DBE, and FBG were presented, the network would not create a category for B, but instead create categories for ABC, DBE, and FBG. More discussion on the classifying network will be presented later, since some of motivation for its design depends on the manner by which temporal events are transformed into spatial patterns.
3
Storing Items with Decreasing Activity
Before a temporal pattern can be classified, it must first be stored in a way that allows a neural network to process it. This section will give guidelines that show one possible way to transform a sequence of temporal events into a spatial pattern of activity. Once the network has performed this transformation, it will be able to classify the patterns using any classifying network that is sufficiently powerful. Suppose a list of events r i , r 2 , . . . , Tn sequentially activates the F^^^ nodes 5i, 5 2 , . . . , 5n- After this list has been presented, some spatial pattern of activation must exist across F^^^ to represent the temporal information of the list. (For convenience, the following examples use the list r i , r 2 , . . . , T n However, any arbitrary sequence of items can be presented and unambiguously stored. One possible different sequence is r ^ , r n - i , . . . , r i . ) It is clear that this activation pattern cannot be a binary one, since then it would be impossible to distinguish between different lists composed of the same items. For example, the network should be able to distinguish between the words LEFT and FELT, even though they are composed of the same items (Grossberg, 1978). There are two obvious ways to represent order information with a spatial pattern of activity. After the list r i , r 2 , . . . ,rn has been presented, the activity across the items nodes can either be monotonically decreasing, as in equation 2, or monotonically increasing, as in equation 3. Sl > S2 > ••• > Sn,
(2)
8. Using S O N N E T 1
'x'l'^
(a)
'x'2
(b)
253
Wl'x
(c)
FIGURE 4. Transformation of temporal events into spatial patterns of activity. (a) The activity of various nodes in a field after the list rir2rz has been presented. (b) The activity of these nodes in response to the list rir2. (c) The activity of these nodes after in response to the list rzr2r\. 5i < 52 < • • • < 5n.
(3)
When order information is represented by a monotonically decreeising pattern, nodes representing items that occurred earlier have higher activations than nodes representing items that occurred later. The activation pattern that would result from a variety of different lists is shown in Figure 4. Herein, order information will be represented by a decreasing pattern of activation, as in Figure 4. However, many other neural networks use exactly the opposite strategy. Networks such as those implemented by Sung and Jones (1988, 1990), and by Anderson, Merrill, and Port (1988) use an increasing pattern of activity to represent order information, as in equation 3.1 believe that this has been done because increasing patterns are easier to achieve. However, as Section 6 will show, \iis possible to design networks that obtain decreasing patterns of activity. Furthermore, as will be shown below, networks that use a decreasing pattern of activity do not have the fundamental problems with feedback that plague networks that use an increasing pattern of activity. (The networks that use equation 3 do not use feedback to bias their classifications.) Since I believe that networks must be able to use feedback, the transformation from temporal to spatial patterns will follow equation 2. Let me present an example that shows why networks that use increasing patterns of activity have problems with feedback. Suppose that after the items r i , r 2 , . . . , r n have been presented, the activation pattern Si,S25---55n is monotonically increasing, as in equation 3. Furthermore, let that pattern be learned by some F^^^ node Xj. After learning, both the bottom up and top down LTM weights will become parallel to the pattern
254
Nigrin
of activation, so that Zj^^ < Zj2 < • • • < ZjJ. This is shown in Figure 5a. Now suppose that at a later time the same Hst is presented. After a portion of the Ust r i , r 2 , . . . , rk-i has been presented, Xj will partially activate and send feedback signals to F^^K Then, since Zjl_^^ > z^^^ the gated signal received by Sk-\-\ will be larger than that received by s\z. Therefore, ^i-fi ^"^ receive more expectation signals than Sjfc. T?hus, because the LTM weights have equilibrated to an increasing pattern, the F^^^ field is biased to activate more easily to r^_|.i than to r^This occurs even though rjk occurs earlier than Vk-^-i and thus should be more expected! The problem becomes even more apparent when we realize that the F^^^ node that is most biased to activate is the last item in the list, even though that item is not expected to occur for some time. This is clearly an error, since expectation signals should most bias the network to react to those items that are about to occur. Conversely, if the items are stored with a decreasing pattern of activation, as in equation 2 (and shown in Figure 5b), this dilemma would not occur. Expectancy signals are generated correctly, since nodes representing earlier items receive more feedback than nodes representing later items. Thus, ij feedback is used to generate expectation, then successive items should be stored with decreasing activation.
4
The LTM Invariance Principle
The previous section discussed one possible way to transform a temporal pattern into a spatial one. However, merely transforming a sequence of temporal inputs into a decreasing pattern of activation is not sufl5cient. For correct operation it is necessary to satisfy another constraint, called the LTM invariance principle. This principle can be stated as follows: "Once a sequence of input items [ri, r2,..., r^] is presented, its spatial pattern represents "past" order information. Presenting a new input [ri_|_i] can reorganize the total pattern of coded STM activity at F^^^, but . . . does not recode that part of the coded pattern which involves only past order information. In other words, new inputs can weaken the strength of past codes but do not deny the fact that the past events did occur" (Grossberg, 1978). The LTM invariance principle applies to any representation of a sequence of items. It makes no assumptions about the manner by which input items are represented. In addition, it makes no assumptions about the rule used at the classifying layer to cluster the input. It states only that if some STM representation creates a match (arbitrarily defined) at a node in the classification layer, then that match should continue to occur after additional items are presented. (This is this case even for partial matches.) One common rule that is used at the classifying layer is a dot product
8. Using S O N N E T 1
255
F(2)
Top Down Weights
Activation Pattern
pV2)
(b) Top Down Weights
Activation Pattern
FIGURE 5. Weights that evolve for different choices of transformations, (a) Items r ' i , r 2 , . . . ,rn are sequentially presented and are stored with increasing activity across the nodes that represent them. At equilibrium, the top down LTM weights from an F^^^ node become parallel to the pattern of activation that is present while the F^^^ node is active. (In this and all subsequent figures, larger LTM connections are indicated by larger squares incident on the cell.) (b) Same as part (a) except that items are stored with decreasing activity. Whenever modifiable feedback weights are needed, this is the correct choice for the temporal to spatial transformation. Reprinted with permission from Nigrin (1993).
256
Nigrin
rule, which states that the input to the ith F^^^ node is given by:
^i=
^
^j^ji'
(4)
In this case, The LTM invariance principle reduces to the following rule: When new items are presented to a fields the total activity of the field can change, but the relative pattern of activation among the nodes activated by past items must remain constant. For example, suppose that after ri and r2 activate si and 52, it is the case that si = 2 and 52 = 1. This might be coded by an F^^^ node Xi with weights zu = 1 and Z2i = 0.5. If the LTM invariance principle is followed, then when additional items such as rs are presented, the activities of si and 52 must stay in proportion to one another. Depending on the parameters in the system, they may rise to activities like si = 4 and S2 = 2, or 5i = 6 and 52 = 3. (If the activity of the field has saturated, they may even fall to activities like 81 = 1.2 and 82 = 0.6.) However, the relative activities between the two nodes must remain constant. This allows Xi to continue to know that its list was presented, since the input vector of activities across si and 52 remains parallel to its LTM weights. The LTM invariance principle is important for the following reason. The network cannot know a priori which subsequences will be significant and which will not. (For example, can you pick out the Turkish word in the list "alborsalab"?) In principle, all subsequences of a list are legitimate sequences in their own right and should be able to be learned by the LTM weights to F^^^. This becomes difficult to do if the invariance principle is not followed. If each new item can change the relative pattern of activity across nodes representing previous items, then no pattern would remain in STM long enough to be coded by LTM weights. Another problem that would result if the invariance principle were not followed has to do with the STM operation of the network. Consider some F^^^ node that receives a large amount of excitatory input immediately after its sequence is presented. If the relative pattern of activity can change (at F^^) nodes representing previous items), then after additional items are presented, that F^^^ node may no longer receive large amounts of input. This is incorrect, since there would no longer be any way of knowing that the event represented by the F^^^ node ever occurred. While it is true that future items should be able to weaken the significance of past events, these future items should not make it impossible to tell that these past events occurred. For example, after the word CARGO has been presented, it is still possible to tell that the word CAR has occurred. The significance of the word CAR may be reduced when embedded in the larger context. However, it is still possible to tell that the word occurred.
8. Using SONNET 1
257
Approaches that do not satisfy the LTM invariance principle have been used by Sejnowski and Rosenberg (1987), Jordan (1986), Miikkulainen and Dyer (1991), Elman (1990), St. John and McClelland (1990), Cottrell (1985), Elman and Zipser (1988), and Hanson and Kegl (1987). Although some of these networks could recognize patterns that were embedded in a sequence, the fact that the networks did not satisfy the invariance principle meant that they could not address the problem of segmenting overlapping patterns. For example, consider the presentation of the spoken words All turn, while noticing that the phonemes in the word alter are contained within that utterance. Since the above networks do not satisfy the invariance principle, it will be the case that after All turn has been fully presented, the pattern that initially represented the word All will no longer exist. And since a reliable segmentation cannot be performed until after the word turn has been fully presented (immediately after the phonemes in All have been presented it is impossible to know whether the next phoneme will start a new word or complete the word Alter), the networks will not behave in a robust fashion. It is possible for networks that use delay lines to satisfy the LTM invariance principle (Unnikrishnan, Hopfield, and Tank, 1992; Tank and Hopfield, 1987; Waibel, Hanazawa, Hinton, Shikano, and Lang, 1989). However, that approach has problems in hierarchical networks, since higher-level events of long duration require large amounts of network hardware to represent the required windows of time. A rule that does satisfy the invariance principle is presented in Figure 6. The nodes si,S2,S3, and S4 are activated sequentially by r i , r 2 , r 3 , and r4. After presentation of item ri, node si attains an activation of fii. Concurrently, the activity of all other nodes is multiplied by the factor Ui. Since all other nodes are multiplied by the same factor, their activities remain proportional. To satisfy the LTM invariance principle, /Xi and uji can be any nonnegative constants. However, to allow a classifying network to operate correctly, it is sometimes useful to place extra constraints on the values for these parameters. One useful constraint (Grossberg, 1978) is that Vz,j : /ij = ^j = /i, where /x is some constant. When this constraint is followed, then immediately after r^ is presented, Si = fi, regardless of whether r^ is the first, last or fifth item in the list. Allowing the activity reached by Si to be independent of rj's position in the list is a reasonable approximation, since the network should be able to respond to any item, regardless of the number of items that have previously been presented. If /JLI were allowed to diminish with list position, then the network would be unable to attend to the last item in a long Hst. An additional constraint on the invariance principle has to do with the parameter cj^. Since fii remains the same at each list position, uJi must be greater than 1 to allow the spatial pattern across F^^^ to become monotoni-
258
Nigrin
Time
Si
S2
S3
54
^1
/^l
0
0
0
t2
/^1^2
/^2
0
0
h
IJilUJ20J^
fJ'2(^3
/^3
0
fJ'3^4
//4
U
HlU2U;3U4
fJ'2^3^4
FIGURE 6. Figure showing the sequential activation of nodes si through 54 in a field that obey the LTM invariance principle. Nodes Si is activated at time ti and reaches an activity of /jii. As each node Si is activated, the activity of all other nodes is multiplied by the factor Ui. (Figure taken from Grossberg (1978).) cally decreasing. For example, suppose /x = 1 and the list r i , r2 is presented. After ri is presented, 5i = 1 and 52 = 0. After r2 is presented, 52 = 1 and 5i = CL;2. If c«;2 < 1, then 52 > Si, and the network would not achieve the decreasing activity pattern that was discussed in the last section.
Using Rehearsal to Process Arbitrarily Long Lists The previous section established constraints for the values of fii and Ui. That section determined that it should be the case that Vi, j : /JL^ = fij = fj, and LJi > 1. However, these choices necessitate that we find some way to reset items that have been active for a long time, since otherwise a problem will result. In any real system, cui cannot be greater than 1 for all list positions, since there will eventually be a time at which the storage field saturates, and at this point, the activity of all nodes will be prevented from rising further. In fact, after saturation has occurred it will be the case that when new items are presented, the activity of nodes representing previous items must fall {uj < 1) to make room for the activity of nodes representing new items. Thus, a bow can occur in the STM activation pattern. (See Figure 7.) The number of items that occur before the occurrence of the bow is defined as the transient memory span (TMS). The TMS is a useful concept, since correct order information can be obtained for all lists whose lengths are shorter than the TMS (since they are stored with a monotonically decreas-
8. Using SONNET 1
259
FIGURE 7. The activation of a field of nodes after a long list has been presented. The bow in the curve occurs at position j , and thus the TMS is j items. Reprinted with permission from Nigrin (1993). ing pattern). The question then arises: How can the network retain order information for lists that are longer than the TMS? This can be achieved by introducing the concept of rehearsal (Grossberg, 1978, 1985). After the presentation of a long list, order is not necessarily confused if the items of the list can be classified and then reset. This results in no loss of information since the order information could still be obtained from the activity of the classifying nodes. For example, by using rehearsal, a network with a TMS of 4 could process the sequence THEBOYRAN even though it contains 9 letters. The network could do this by classifying the letters T, H, and E into a unitized representation of the word THE and deleting the letter representations from STM. Similarly, the network could create the chunks BOY and RAN, deleting the letters that make up those words. At no time does the network ever need to store a list in STM whose length is longer than 4. However, if no preexisting nodes exist with which to chunk a long list, then order information would indeed be confused. This is not a problem if natural intelligence is being studied, since confusion also occurs in humans. Experiments have been done in which subjects were presented a long list of items and then asked to recall them in the correct order. It was found that subjects did well for items at the beginning and end of the list while doing poorly for items in the middle of the list (Grossberg, 1978). In fact, if the percentage of correct responses versus list position is plotted (the socalled serial position curves), a graph is obtained whose shape is identical
260
Nigrin
FIGURE 8. On-center ofF-surround network used to store input patterns. Shown axe the excitatory connections to Si and the inhibitory connections from it. Analogous connections exist for other cells. Reprinted with permission from Nigrin (1993). to Figure 7. Thus, the LTM invariance principle is in good agreement with experimental data from humans.^ This and previous sections have given guidelines that show how to properly transform a temporal pattern into a spatial one. The next several sections will present a network that implements these guidelines.
Implementing the LTM Invariance Principle with an On-Center Off-Surround Circuit Section 4 presented two constraints that should be used for transforming a sequence of temporal inputs into a spatial pattern of activity (Vz,^ : /Xi = fij = fi and Ui > 1). This section will show that by arranging F^^^ in the on-center off-surround configuration shown in Figure 8, it will be able to (approximately) satisfy both these constraints and transform a sequence of temporal events into a spatial pattern of activity (for a limited number of items). The equation for the ith cell in F^^^ is given by d_
(5)
"^This is a good example where data from humans can help in neural network design. If this data did not exist, one might spend an inordinate amount of time trying to design a network that never confused order information. However, the data from human subjects shows that this property may be very difficult to achieve and that a designer might be better off by avoiding it for the time being.
8. Using SONNET 1
261
where A, Bg, and vs are constants such that A represents passive decay, Bs is the maximum activity of Sj, and vs is a constant weight between cells that helps control the shape of the transformed activity pattern. !{ represents external input and is nonzero only while the item represented by Si is being presented. Finally, f{si) is the output signal of Si. As was proved in Grossberg (1973), if a linear signal (such as f{si) = Si) is used in a network with this architecture, then once the external input to a set of nodes is shut off, the relative activities of the nodes in that set will remain constant forever. (However, as also shown in that paper, a linear signal amplifies noise. Thus, if the network is to be used in the presence of noise, a sigmoid signal should be used instead of a linear signal.) This is useful, since it means that when a linear function is used, the LTM invariance principle is automatically satisfied. For example, suppose that after ri and r2 are presented, si and S2 reach some levels of activity. Then, if additional items, such as rs, are presented, the total activity across si and 52 may change, but the relative activity across the cells will remain fixed forever (as long as neither si nor 52 receives any additional external input). With certain choices for the parameters / j . A, V3, Bg, coupled with the amount of time that each input line is active (A^ie), it is possible to achieve a decreasing pattern of activity across the nodes. For example, in equation (5) above, suppose the parameters are set such that A = 0, li = 0.007, ^3 = .075, Bg = 50, KiQ = 0.4, and f{si) = Sj. Then, when a long list is presented, a decreasing pattern of activity will be obtained as shown in Figure 9. Furthermore, Figure 10 shows that for the first few items in a list the network will achieve values of /x « 0.1 and cjj « 2. (With a different set of parameters, different behaviors can be obtained. Notice that in the parameter choices, Bg^ u. This will cause the TMS to be very large and allow many items to be presented before the activity of the field saturates. Let us examine the dynamics of this network and show why it satisfies the two constraints given at the beginning of the section. Suppose the list r i , r 2 , . . . , Tn is presented to the network, where each item r^ activates cell Si. Initially, there is no activity in the field. Then, the first item is presented by activating li. Suppose the parameters are set so that when /i is shut off, 5i = /i. Now let us examine what occurs when the ith item in the list is presented, under the assumptions that the total activity of the field is small compared to Bg and that i's < 1. We will see that when these assumptions are true, then Si will reach an activity of approximately ^, regardless of r^'s position in the list. To see that this is true, let us compare the positive input to negative input at Si. The positive input to Si is given by {Bg — Si)[vzSi -f li]. Since Si <^ Bg^ this quantity reduces to approximately Bg[v'iSi -\- U]. The negative input is given by v^^Si^^-^^Sj]. Since Ylj^iSj is much less than B5, it follows that the negative input is almost negligible in relation
262
Nigrin
20 H
i 2
4
6
10
12
14
^ S Activity
(a)
(b) FIGURE 9. The activity of s cells when a long list (ro, r i , . . . , r ^ ) is presented, (a) Equation 5 is simulated with parameter values of ^ = 0, li = 0.007, vs = 0.075, Bs = 50, Kie = 0.4 and f{si) = Si. With these parameters no bow is exhibited in the activity pattern, (b) Same as part (a) except that a log plot is used. Order information is represented well only in the linear region of the graph. Reprinted with permission from Nigrin (1993).
8. U s i n g S O N N E T 1
263
|Xj VS. list postion
10
12
14
(a) (Oj vs. list postion
(b) FIGURE 10. Values for fn and Ui for activities of Figure 9. Although constant values for these parameters are desired, the graphs show that this occurs for only the first few items, (a) Values of fn vs. list position, (b) Values of Ui vs. list position. Notice that Ui approaches 1 as the list position increases. This makes it increasingly difficult to represent order information. Reprinted with permission from Nigrin (1993).
264
Nigrin
to the positive input. Thus, the positive and negative input to the cell representing the ith item is approximately the same as what is seen by the cell representing the first item. As long as the negative input can be considered negligible, newly presented item nodes will reach approximately the same level of activity as earlier item nodes. Just as the network achieves /ii « L/A for all items in the TMS, the network also obtains Ui to be roughly constant and greater than 1. This happens since the activity of nodes representing past events will continually rise due to the term {Bg — Si)si. As long as Si remains much smaller than Bg and while the negative input can be discounted, that rise will occur at an almost constant rate. This achieves the goal of having uji ^ cj > 1 for all items in the TMS. Furthermore, the in variance principle is satisfied. As long as Si and Sj do not receive any further external input, the ratio between their activities will remain fixed forever. Presenting rk will alter the relative activity between Sk and all other nodes. However, it will not alter the relative activities between any other nodes. The size of the TMS can be varied, by varying li, ^3, Bg, and KIQ. In this network, there is no fundamental limitation to the maximum length of a list that can be represented. However, the ability to process longer lists does not come for free, since the dynamic range of the system must increase exponentially with the size of the maximum length list possible. Consider that if a; = 2 and /i = 1, then after the presentation of the 10-item list r i , r 2 , . . . , rio, the activity of 5i equals 2^.
7
Resetting Items Once They can be Classified
The previous section showed how an on-center off-surround architecture could create a spatial pattern that accurately represented a sequence of items. Unfortunately, this representation was accurate only for a limited number of items, since the activity of F^^^ eventually saturated. To allow unlimited numbers of items to be presented, it must be possible to reset active F^^^ nodes after they have been classified by an F^^^ node. (They must also be reset after the activity of the field saturates.) Let us first discuss the reset of items once those items have been classified by some F^^^ nodes. This will be done through the use of feedback. To allow the feedback to affect only cells in F^^^ that are part of a classified pattern, the feedback signals will be gated by LTM weights that are modified by an equation that is similar to the equation that modified the feedforward weights. The equation that modifies the feedback weights is similar to lz(l>=e,xj{-z'jl^+SiXj),
(6)
8. Using SONNET 1
265
where ei is the LTM learning rate, Xj is the feedback signal from the jth F^^) cell, Zj-^ is the feedback weight from Xj to Sj, and Si is the normalized activity of Si {Si = Sif^/^Sk)When this equation is followed, the feedback weights will become symmetric to the feedforward weights, and consequently, anytime the feedforward weight from an F^^^ cell to an F^^^ cell is large, the corresponding feedback weight from the F^^^ cell to the F^^^ cell will also be large. After the feedback weights have equilibrated to their desired values, it is possible to use rehearsal to process lists that are longer than the TMS. Recall that Section 4 showed how that could be done. That section showed how it was possible (in principle) to prevent F^^^ from becoming saturated, by resetting item representations once the items had been classified into some Ust. This section will show how this reset can be implemented. The key obstacle to implementing this reset has to do with the restriction of using only local information, and it is as follows. The F^^^ patterns are classified at F^^K However, reset must occur at F^^K Since each node must operate on purely local information, how can the network know which F^^^ nodes to reset and which to keep active? For example, suppose the list ABC has been presented to F^^^ and that the sublist AB has been classified by XAB' HOW can the network know, purely on the basis of local information, to reset SA and 5 B , while allowing sc to remain active and be classified by some other F^^^ node? The answer to this question depends on that fact that the feedback weights from each F^^^ cell assembly to F^^^ are modified in such a way as to become symmetric to the feedforward weights into that cell assembly. By virtue of this, those F^^^ cells that provide a large amount of input to some node Xj (by being part of the pattern classified by Xj) receive large amounts of feedback when Xj is active. Conversely, those cells that are not part of Xj's pattern do not receive large amounts of feedback. It is this distinction that allows the network to know which cells to reset and which to keep active. This is most easily shown by diagramming the sequence of events during a classification, as is done in Figure 11. Suppose a pattern is active at F^^\ part of which can be classified by some F^^^ cell assembly Xj. Then, Xj will activate and send feedback to only those portions of the F^^^ pattern that are part of its classification. Once the activity of Xj has exceeded some threshold for a short period of time, classification will be considered to have occurred, and Xj will be reset (by a mechanism to be discussed in Nigrin (1990, 1993). This causes two things to occur. First, once Xj is shut off, lateral signals from it will no longer inhibit other assemblies. This will allow other cells to classify the remaining (or evolving) F^^^ pattern. Second, Xj will no longer send feedback to F^^\ It is this abrupt shutoff in feedback that provides F^^^ with enough information to know which cells to reset. Those
266
Nigrin
Activation
Activation
Expectancy
Expectancy
Activation
Activation
(b)
Activation
Activation
Expectancy
Expectancy
Activation
Activation
FIGURE 11. Sequence of events during a classification, (a) Signals from F^^^ activate a node at F^'^K (b) An F^^^ node that represents the first two active items of F^^^ emits large feedback to F^^K (c) Feedback to F^^^ ceases after that F^^^ node has reached some threshold necessary for classification, (d) The s cells that are part of the classified pattern are reset when feedback from F^^^ abruptly terminates. Reprinted with permission from Nigrin (1993).
8. Using SONNET 1
267
cells at F^^^ that previously received large amounts of feedback turn off once their feedback abruptly terminates. Only those portions of the F^^^ activity pattern not yet classified (that consequently never received large feedback from F^^^) remain active. Then, other assemblies at F^^^ compete to classify this remaining pattern. An ideal mechanism for incorporating this reset is called a gated dipole (Grossberg, 1982, 1987a, 1987b, 1988). However, to allow an easier implementation, SONNET 1 does not use the gated dipole. Instead, the following rule is used: Whenever the feedback to an s cell drops abruptly by more than two thirds, that s cell resets itself to make room for new items at F^^\ By continually resetting F^^^ nodes once they have been classified, arbitrarily long lists can be processed. This method works well after F^^^ has created many classifications. Unfortunately, it cannot be used before learning has taken place, since until then no F^^^ cell will fully activate. Consequently, some additional mechanism is needed to deal with the saturation of activity at F^^K The simplest way to avoid this problem is for a network to reset the entire activity of F^^^ after many items have been presented and the field begins to saturate (Nigrin, 1993). (A more complicated procedure involves resetting only the high activity nodes in F^^K)
8
Properties of a Classifying System
The previous sections showed how to transform a sequence of temporal inputs into a spatial pattern of activity at F^^K This section will discuss how these evolving spatial patterns are classified at F^^^. Since previous works (Nigrin, 1993) have discussed the motivation for and the construction of F^^^ in detail, this section will only briefly describe the properties that a classifying system should achieve. Some of the properties are important to any classifying system, while some are important only when processing temporal patterns. First, let us discuss the properties that are important to any classifying system. A classifying network should be able to: 1. Self-organize using unsupervised learning. A network should be able to form its own categories in response to the invariances in the environment. This allows the network to operate in an autonomous fashion, without the need for an external teacher. 2. Form stable category codes. A network should be able to learn new categories without degrading previous categories that it established. This is one aspect of what is called the "stability-plasticity"
268
Nigrin
dilemma. Networks that solve this dilemma can operate using both fast and slow learning (see next property). Conversely, those that do not are restricted to slow learning so as not to degrade previous categories (Grossberg, 1988). Networks that solve this dilemma can continue to operate when they encounter novel situations. Those that do not must be brought back into the lab and retrained on both the new and previous examples to insure that new learning does not degrade existing categories. 3. Perform fast and slow learning. A network should be able to perform fast learning to allow it to classify patterns as quickly as in a single trial, when it is clear exactly what should be learned and it is important that the network learn quickly. (For example, one should not have to touch a hot stove 500 times before learning that one will be burned.) Furthermore, a network should also be able to perform slow learning to allow it to generalize over multiple different examples. 4. Operate under the presence of noise. Networks should be able to operate in more than just laboratory conditions; they should also be able to operate in real-world environments. This requires the ability to operate in the presence of noise. Noise can occur in three different areas. It can be present within an object, within the background of the object, and within the components of the system. A network must be able to handle noise in all of these areas. 5. Scale well to large problems. There are at least two aspects to this property. First, as the size of a problem grows, the size of the required network should not grow too quickly. Second, as the number of different patterns in a training set increases, the number of required presentations for each pattern (to obtain successful classifications) should not increase too rapidly. 6. Create arbitrarily coarse or tight classifications. Patterns in a category often differ from the prototype (average) of the category. A network should be able vary the acceptable distortion from the prototype in at least two ways. It should be able to globally vary the overall error that is acceptable. The network should also be able to allow different amounts of variance at different dimensions of the input pattern (the different input lines). This would allow the network to create categories that are more complex than just the nearestneighbor variety. There are also a few properties that are especially relevant to processing temporal patterns. These properties are the ability to perform real-time
8. Using SONNET 1
269
operations and the ability to classify patterns that are embedded within larger patterns (this property is also important for static patterns). These properties will be discussed in greater detail in the following two sections.
8,1
Real-Time Operation
Classifying networks should achieve real-time operation. To make this discussion clear, the term "real time" must be defined. In this chapter, this term will be used in a more restrictive sense than that used in Carpenter and Grossberg (1987a, 1987b). There, the term real time was defined to be equivalent to no off-line processing. For example when some pattern is presented to an ART network, that pattern is sustained until the network equilibrates. Then another pattern is presented and so forth. When dealing with purely spatial patterns, this is an adequate formulation. However, when dealing with temporal patterns, a more restrictive definition of the term real time is necessary. Here, we will say that a network operates in real time when the network performs its classifications at the correct pace (not too slowly or too quickly) in response to the continuous evolution of input patterns. For example, suppose the sequential presentation of the letters T, H, and E has been classified by a network into a category for the word THE. Then, when the list THEDOGRAN is sequentially presented, the network should classify the pattern THE shortly after the E has been presented. Otherwise difficulties would result. If classification occurred too slowly, then the network would not be able to keep up with the items, and eventually, the STM buffer would overfiow. Conversely, if classification occurred too quickly (for example, immediately after the letter T), then after the word THE was learned, it would be difficult for the network to learn longer patterns like THEY. Real-time processing must continue to occur at the correct pace regardless of how much or how little competition occurs between different categories. This is not necessarily trivial to achieve. For example, suppose that the only classification made by a network was the sequence HER by some cell XHER- In this case, the equilibration time for the network must be calibrated so that XHER classifies HER shortly after the R is presented. However, suppose that after some period of time the network also learns a new sequence, HE by the node XHE- Even though XHER now competes with this new node, the network must still operate at the correct pace, with XHER still classifying HER shortly after the R has been presented. This must also occur even after many additional categories like HERO, HERD, HEN, HELP, have been learned by the network. While slight variations in classification time are acceptable, that time should increase only slightly due to the increased competition. To allow real-time operation, several changes need to be made in the operation of a classifying field. One important change concerns the manner
270
Nigrin
by which input combines at F^^\ Typically, the total gated input to the ith classifying cell {!{) is given by the dot product rule: li = ^pn) SjZji. Unfortunately, if this rule is used, real-time operation will be very difficult to achieve. Consider the following situation. Let the sequentially presented inputs r i , r 2 , . . . , r n form a decreasing pattern of activity across the F^^^ nodes 5i, 5 2 , . . . , Sn- Let the activity of F^^^ be normalized and let Sj = 2sj-^i. By equation 1, once the F^^^ node, Xi, has chunked this list, its LTM weights will become parallel to the STM activity across F^^\ and the total LTM at Xi will also become normalized. Let us consider how the presence or absence of the last item in the list will affect the percent change in total input to Xi. Since Sj = 2sj^i, Si = 2"~^Sn- Furthermore, since input is gated by the LTM weights at Xj, the total input to Xi from si is 2^^~^ times greater than from Sn (siZu = 2'^^~'^SnZni)' Thus, the percent increase in total input to Xi due to Sn is less than 1/2^^"^. If n = 2, then this is acceptable, since the second item increases input to Xi by 25%. However, if n = 4 (a reasonable-size chunk), the fourth item increases the input to Xi by less than 1.56%. Thus, as the lists get longer, the F^^^ cells representing later items will have increasingly less significance. This will make it very difficult for the network to distinguish between lists that differ only in the last item (for example, PAR and PART). For long lists, small amounts of noise could easily cause errors. Furthermore, these tiny differences in the amount of input received make real-time operation very difficult. This follows since different nodes that receive very similar amounts of input will need to equilibrate for long periods of time before the small differences in input result in large differences in activity. One solution to this problem that can easily be dismissed is to decrease the difference in activation between F^^^ nodes representing successive items. This solution can be dismissed because it would obscure the differences between lists that are permutations of the same items (such as LEFT and FELT). Thus, to solve this problem, SONNET 1 resorts to a nonlinear equation to generate the total input to a cell. There are three basic properties that this nonlinear rule must achieve. First, when a large pattern is active, the input to a cell that classifies that pattern should be larger than the input to a cell that classifies a subset of that pattern. For example, when a pattern like ABC is presented, XABC should receive more input than XABSecond, when a subset of the pattern coded by an F^^^ cell is active, the input to the F^^^ cell should be reduced. This reduction must be done in the following way. If some pattern is presented that is a subset of a larger pattern, then the input to a cell that codes exactly the active pattern should be greater than the input to a cell that classifies a superset of the active pattern. For example, if AB is presented, XAB should receive more input
8. Using SONNET 1
271
t h a n XABC-
Third, even when some weight is (reasonably) small, it should still be possible for input that is gated by that weight to affect the total input received by the classifying cell. For example suppose some cell Xi classifies ABCD. Then, even if ZAi is eight times greater than ZDI^ the presence or absence of the item D should significantly influence the total input received by Xi.
One rule that satisfies these constraints has been implemented in SONNET 1. It is presented below only to illustrate one possible way to satisfy the constraints above. Other rules are possible, and in fact improvements to this rule have been proposed in Nigrin (1993). Let li be the total input to the ith cell in F^^^. Then li is given by
ii=itir^
(7)
where
li
= 2_^ SjiZji,
//<=max((n^/i),^8),
If,=K,+K2mm{l,^),
(8)
(9)
(10)
where Ki and K2 are constants such that iiTi < 1 and Ki -{• K2 > I. Kg is a constant that is used to prevent the value of I^ from ever getting too small. {Ks is actually an ad hoc constant that contributes to weaknesses in the current implementation.) Zji is the normalized LTM weight given by
Similarly, Sji is the normalized input from Sj to Xi given by
In these equations, Ti is the set of indices of F^^^ cells in x^'s classified pattern. For example, if x^ has classified a pattern composed of activity in S3 and 55, then Ti = {3,5}. The quantity Ti is needed for the following reason.
272
Nigrin
Recall above that it was stated that inputs to reasonably small weights should be able to affect the total input to a classifying cell. However, inputs to extremely small weights should be ignored; otherwise, small fluctuations in noise could dramatically change the total input received by a cell. Ti is a cutoff used to discriminate which weights are significant and which are not. Its method of computation was illustrated in Nigrin (1990, 1993). The quantities I^ and I^ serve different functions. The quantity I^ increases as the weights increase. It is exactly the dot product between the normalized input vector and the weight vector. Notice that the inputs are normalized and that this normalization is performed over the set of F^^^ cells in Ti. Furthermore, notice that the input is normalized over only the set of cells in Ti. This is done so that activity in F^^^ cells not in the classified pattern of an F^^^ assembly will not affect the input to the assembly. This allows patterns to be easily classified even when they are embedded within arbitrarily large patterns. The quantity I^ compares how well a node's LTM weights match the current F^^^ pattern. It compares the normalized input vector to the normalized weight vector. The use of I^ allows the presence or absence of activities that are coded by small weights to affect the total input received by the F^^) cell.
8.2
The Classification of Embedded Patterns
An additional property that classifying networks should achieve is the ability to classify patterns that are surrounded by extraneous information. This is essential in areas such as continuous speech, where there are usually no clear-cut boundaries between words. One way a network can deal with the extraneous information is to use both inhomogeneous nodes and a nonuniform pattern of connectivity between the nodes. In SONNET 1, the nodes evolve to have different input/output characteristics, and the connectivity pattern evolves so that nodes inhibit only other nodes that classify similar patterns. One possible justification for the necessity of inhomogenous nodes concerns the predictive power of the classifying nodes and has been discussed elsewhere (Cohen and Grossberg, 1986,1987; Marshall, 1990a, 1990b, 1992, 1995; Nigrin, 1990, 1993). Another justification arises if we analyze the structure that a network must have if it is to satisfy two simple constraints: (1) The network should be able to classify patterns that are surrounded by extraneous information; and (2) The network should be able to make clear-cut decisions. For example, suppose some F^^^ cell XCAR represents the pattern CAR. The first constraint implies that XCAR should receive the full input that is possible for it, even when additional items like I or S are present in an input pattern like CARIS. Otherwise, if the presence of extraneous items
8. Using SONNET 1
273
reduced the input to XCAR significantly, then XCAR would not be able to activate when its pattern was embedded in larger patterns (as is often the case in speech signals). The second constraint implies that when multiple classifications are competing for an input pattern, then the network should choose whichever cell best represents the pattern and allow that cell to fully activate, while suppressing the activity of other cells. For example, if CAR is presented to a network that has the classifications XQAR and XCARGO, then XCAR should fully activate and xCARGO should be suppressed, even though xCARGO Partially represents the input pattern. Conversely when CARGO is presented, xcARGO should fully activate and XCAR should be suppressed. This is true even though the pattern that XCAR represents is entirely present, and therefore (by the first constraint) XCAR must receive the full input that is possible for it! To allow a single network to be able to satisfy both constraints simultaneously, it must have some kind of inhomogeneity in the structure of its classifying cells. One possible inhomogeneity that solves the problem involves the use of different cell "sizes," with larger cells classifying larger patterns and smaller cells classifying smaller patterns. Larger cells dilute their input (both excitatory and inhibitory) to a greater degree than do smaller cells. Thus, they are difficult to turn on, and they respond well only to larger patterns. However, once the larger cells are activated, they are difficult to turn off, and thus they inhibit smaller cells more easily than the reverse. For example, when the word CAR is presented, XCARGO does not receive enough input to activate, thus allowing XCAR to activate. However, when the word CARGO is presented, the node xCARGO receives enough input to activate, and through unequal competition it can suppress (mask out) the activity of XCARA second reason to prefer inhomogeneous nodes is called the temporal chunking problem (Grossberg, 1982, 1987b, 1988). Suppose that some pattern ABCD is presented at F^^\ Furthermore, suppose that all the subparts of that pattern already exist as classifications, so that different F^^^ nodes already code the patterns A, B^ C, and D. If the F^^^ nodes were homogeneous, then the pattern ABCD would continually be processed as subparts instead of eventually being treated as a unified whole. (A more realistic example occurs when the network should learn the word CARGO ^ even after it has established categories for CAR and GO.) To prevent this, there must be some mechanism that favors the formation of larger categories. A second area of nonuniformity in the structure of the classifying field concerns the inhibitory connections within the field (Cohen and Grossberg, 1986, 1987; Marshall, 1990a, 1990b, 1992, 1995; Nigrin, 1990, 1993). In SONNET 1, nodes compete only with other nodes that attempt to classify similar patterns. This nonuniformity increases the power of the network.
274
Nigrin
as the following example shows. Suppose that the lists AB, CD, and ABC have been learned. (Consider these lists to be abstractions for the spoken words ALL, ALTER, and TURN.) When ABC is presented, XABC should activate and XAB and XQD should be inhibited. However, when ABCD is presented, the reverse should be true. The list should be segmented as AB and CD, with XABC inhibited, since it is not part of the segmentation. This will not happen if the connections are homogeneous. Since XABC must activate whenever ABC is presented, it must be true that neither XAB nor xcD can individually suppress the activity of XABC- When ABCD is presented, only by combining inhibition can XAB and XCD possibly mask out XABC' However, if the connections are uniform, then XAB and XCD will inhibit each other as much as they inhibit XABC- Consequently, XABC will activate, even for ABCD. To remedy this, F^^^ nodes should inhibit only other nodes in F^^^ that respond to similar patterns, thus allowing multiple smaller nodes to combine and overpower larger ones. In the example above, XABC should compete with both XAB and XCD, hut XAB and XCD should not compete with one another. (Another advantage to using nonuniform connections is that it allows a network to be able to classify multiple patterns simultaneously. This is a great advantage when a network is forced to operate in complex, unsegmented environments.)
9
Simulations
This section will present some simulations to illustrate the operation of the network. This will show that the network operates as previously indicated. All the simulation equations and parameters are described in Nigrin (1993). Patterns were presented to the network, as was described in the introduction. Items were presented one at a time with fixed intensities. Immediately after the presentation of one item was finished, the next item was presented. Thus, the patterns consisted entirely of sequences, and issues involving different rhythms or input intensities were completely avoided. Since Section 6 demonstrated the behavior of F^^^ after it was presented a sequence of inputs, the only thing left to do is examine the behavior of F^'^\ This will be done with five simulations. (The reader should note that SONNET 1 was simulated more extensively on static spatial patterns than on temporal patterns. Therefore, the simulations in this section should be treated as preliminary. Furthermore, due to the minimal analysis, it is likely that the network's behavior on temporal patterns could easily be improved.) 1. The first simulation will deal with the STM response of F^^^ It will demonstrate that the network can respond in a real-time fashion to
8. Using SONNET 1
275
1.5 7
0.5
0
0
A
B
D ^ABC
FIGURE 12. Response of the cell CABC to the list ABCDEF. In this and the next three figures, the labels on the x-axis refer to the time immediately after each item has been presented. Furthermore, the s cells were not reset, either because of a classification or because of saturation. Reprinted with permission from Nigrin (1993). the presentation of a single list. To ensure that the parameters have not been optimized for different situations, I will compare the network's behavior when it has classified a single list against the network's behavior when it has classified multiple similar lists. This will show how the presence of similar classifications affects the dynamics of the competition. Figure 12 shows the response of CABC to the list ABCDEF^ when ABC is the only pattern that has been classified by the network. Notice that CABC does not fully activate until after the D has been presented. The network response was then tested to the situation where two similar categories existed at F^'^\ Figure 13 shows the network response to the list ABCDEF at categories for both AB and ABC. Notice that earlier in the presentation, CAB has a competitive advantage. However, this advantage is quickly eliminated once CABC sees its full pattern. Notice also that CABC takes slightly longer to equilibrate than the previous case (reaching its full value after ABCDE rather than after ABCD).
276
Nigrin
.5-
1" %
.5-
X
X % X \
y #
^#
^^
^^^ ^^^^ ^^^^"^^^^
^#
B
0
^ \\
^X^^^ ^
C
D
FIGURE 13. Response of the cells CAB and C>IBC to the list ABCDEF. Reprinted with permission from Nigrin (1993). The final test involved the addition of three more categories to see if this would further slow down the processing of the network. As can be seen in Figure 14 and Figure 15, this had little effect on the output of CABC-
2. Since the previous simulation showed that the STM dynamics of the network could operate in real time, the remaining simulations tested whether or not the network could learn in real time. This simulation showed that the network could learn a list even when it was embedded within a larger list. The following sequence was presented to the network, where after the last item in the sequence was presented, the first item was presented again. (Thus, immediately after item 23 was presented, item 0 was presented.) 0 0 0
1 1 1
2 3 2 10 2 17
4 11 18
5 6 7 8 9 12 13 14 15 16 19 20 21 22 23
Here, each item is represented by a number instead of a letter, since in later simulations, more items will exist than there are letters in the alphabet. In this simulation, the list (0,1, 2) occurs embedded in three different contexts. Thus, the network should learn to recognize that
8. Using SONNET 1
277
1.5 1
0
A S ^AB
B e ^ABC
C B ^ACB
D EE] ^BCA
E
F
EI] ^ABCG
FIGURE 14. Response of the cells CABI CABC, CACB-, CBCA, and list ABCDEF. Reprinted with permission from Nigrin (1993).
CABCG
to the
list as a significant category. With the parameters set as described in Nigrin (1993), this occurred in about 10 presentations of the full list (therefore (0, 1, 2) occurred about 30 times). The values for the weights from items 0, 1, and 2 equilibrated to 0.85, 0.46, and 0.24, respectively. After learning had occurred, the F^^^ items 0, 1, and 2 were reset after they were classified. This occurred approximately three items after the Ust (0, 1, 2) was completely presented. For example, if the list 0 1 2 3 4 5 6 7 8 was presented, the items 0, 1, and 2 were reset while item 5 was being presented. The network was robust in its behavior. The list (0, 1, 2) wais reliably learned and those items were reliably reset after they were classified. However, if the presentation of the full list continued, additional information was usually learned. For example, after learning (0, 1, 2), the network might later learn to classify the list (16, 0, 1, 2). This occurred since each longer list occurred in exactly one context. Therefore, since the network was designed to solve the temporal chunking problem, the continued presentation of a longer list could eventually cause the shorter list to be overshadowed. 3. This simulation demonstrated that multiple lists in a training set
A
B B 1
C
D B 5
E3 2
FIGURE 15. Superposition of CABC^S activity in each of the last three figures. Figure legends indicate the total number of categories that were present at F^^^ in the different simulation runs. Reprinted with permission from Nigrin (1993). could be learned by the network. The network was presented the following training set in which the lists (0, 1, 2) and (24, 25, 26) were embedded within three contexts: 0 0 0
1 1 1
2 2 2
3 4 5 10 11 12 17 18 19
24 13 24
25 24 25
26 25 26
26 20
6 7 8 9 14 15 16 21 22 23
Both lists were classified by the network in approximately 10 trials. (In one simulation (0, 1, 2) was classified on the 11th trial and (24, 25, 26) was classified on the 10th trial.) Both the learning of the lists and the resetting of F^^^ items were robust. However, just as in the preceding simulation, after these lists were learned, longer lists that contained these shorter lists were classified by the network. This simulation was very similar to the previous one. It involved the use of exactly the same parameters as in the preceding example, and demonstrated that the network could learn lists of different lengths. The network was presented the following training set in which the lists (0, 1) and (24, 25, 26, 27) were each embedded within four contexts.
8. Using SONNET 1 0 0 0 0
1 1 1 1
3 10 17 28
4 11 18 29
5 12 19 30
24 13 24 24
25 24 25 25
26 25 26 26
27 26 27 27
6 27 20 31
7 14 21 32
8 15 22 33
279 9 16 23 34
In this case, the hst (0, 1) was classified on the 8th trial, and the list (24, 25, 26, 27) was classified on the 9th trial. While this simulation showed that it was possible for the network to classify lists of diflFerent lengths, the network needed more contexts than for the three-item lists. (Even with four contexts, the network occasionally made errors.) This was especially true in the case of the two item list, since that F^^) representation of the list was a smaller percentage of the total activity at F^^^ (see discussion in Nigrin (1993), Chapter 3). 5. The final simulation tested to see what the network response would be to the repeated presentation of the following set:
0 0 0 0 27 27 27 27 33 33 33 33 45 45 45
1 1 1 1 28 28 28 28 34 34 34 34 46 46 46
2 2 2 2 29 29 29 29 35 35 35 35 47 47 47
3 10 17 39 3 10 17 39 3 10 17 39 3 10 17
4 11 18 40 4 11 18 40 4 11 18 40 4 11 18
5 12 19 41 5 12 19 41 5 12 19 41 5 12 19
24 24 24 24 30 30 30 30 36 36 36 36 48 48 48
25 25 25 25 31 31 31 31 37 37 37 37 49 49 49
26 26 26 26 32 32 32 32 38 38 38 38 50 50 50
6 14 20 42 6 14 20 42 6 14 20 42 6 14 20
7 15 21 43 7 15 21 43 7 15 21 43 7 15 21
8 16 22 44 8 16 22 44 8 16 22 44 8 16 22
45
46
47
39
40
41
48
49
50
42
43
44
Close examination of this training set will reveal that it contains 16 three-item lists, each of which occurs in four different contexts. This list was repeatedly presented to the network. By the 16th epoch, the network could learn all the lists that were present in the training set. The network created the following categories: On the 6th epoch, the network learned the lists (0, 1, 2), (24, 25, 26), (14, 15, 16), (45, 46, 47), and (48, 49, 50) (in that order). On the 7th epoch, the network learned (3, 4, 5), (17, 18, 19) and (39, 40, 41). On the Sth epoch, the network learned the lists (30, 31, 32), (27, 28, 29), (6, 7, 8), and (42, 43, 44). On the 9th epoch, the network learned the list (10, 11, 12).
280
Nigrin
On the 13th epoch, the network learned the Ust (20, 21, 22). Finally, on the 16th epoch, the network learned the lists (36, 37, 38) and (33, 34, 35). Thus, in this simulation run, the network learned to perfectly segment the list. However, this was not true of all simulation runs. Using the same parameters, sometimes the network correctly classified as few as 13 of the 16 lists in the training set. Of the Usts that were incorrectly handled, there were two possibilities. Either the list was not classified at all, or the classification made by the network was erroneous. (For example, the classification (0, 1, 2, 3) would be considered incorrect.) The main reason the network had diflSculty had to do with the fact that while the network could classify multiple lists concurrently, it could only learn one list at a time. Thus, since the Ust (3, 4, 5) is embedded between (0, 1, 2) and (24, 25, 26), the nodes that classified (0, 1, 2) and (24, 25, 26) could interfere with any node attempting to classify (3, 4, 5). It was for this reason that the training set consisted of lists that were all of the same length. However, simulations conducted by replacing some of the lists in the training set with lists of lengths 2 and 4 have shown that variable-length lists could be learned. However, due to the diff^erences in list size, even more interference was created during learning.
10
Discussion
This chapter has examined one fundamental issue—designing a neural network that could learn to segment arbitrarily long temporal patterns in real time. To allow the problem to be tackled at all, it was simplified to its bare essentials. A network was presented a continuous stream of input items and was required to learn to segment them into significant chunks. This task was broken down into two subproblems. The first subproblem was to convert a sequence of temporal events into ever expanding spatial patterns. The second subproblem was to classify the evolving spatial patterns. Thus, the neural network was divided into two fields of cells. The input field F^^^ transformed a sequence of temporal events into a spatial pattern of activity, and the output field F^^^ classified those patterns. (Nigrin (1990, 1993) discussed how the architectures at F^^^ and F^^^ could be combined to allow the addition of extra layers above F^^\) Within this framework, it became possible to present at least six guidelines for the construction of networks to solve the segmentation problem. (Additional constraints were presented in Nigrin (1990, 1993).) They are as follows: (1) Any transformation of a sequence of events to a spatial pattern at F^^^ should follow the LTM invariance principle. (2) When representing
8. Using SONNET 1
281
a sequence of events, a monotonically decreasing pattern is a better choice than a monotonically increasing pattern. (3) One possible method for satisfying the previous two constraints is to use an on-center off-surround architecture at F^^\ (4) Rehearsal should be used to reset the activity of F^^^ nodes whose activity has been classified into a category at F^'^\ (5) To allow it to operate in real time, F^^^ should combine its inputs in a nonlinear fashion instead of the traditional dot product rule. (6) The architecture of F^^^ should be nonuniform, to allow it to deal with patterns that are subsets and supersets of one another (context sensitivity). When these guidelines are observed, it becomes possible for a network to learn to segment temporally presented sequences of items. However, significant limitations still exist. One restriction is that items cannot be repeated within the same list. (The network can process the list ABC but not ABA.) Another limitation is that while the network can classify multiple patterns simultaneously, it can learn only a single pattern at a time. These problems are addressed in a network called SONNET 2 (Nigrin, 1990, 1993) by changing the manner by which classifying nodes interact. In current implementations, competition is implemented by having the classifying nodes at F^^^ compete for the right to classify input signals on active input lines. This will be changed in SONNET 2. There, competition will be implemented by having the input lines compete for the right to activate their respective classifying nodes at F^^^. Analysis has indicated that this change will dramatically increase the power of the classifying network.
11
References
Sven Anderson, John Merrill, and Robert Port. 1988. Dynamic speech categorization with recurrent networks. Technical Report 258, Indiana University, Bloomington, IN. Gail Carpenter and Stephen Grossberg. 1987a. A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37:54-115. Gail Carpenter and Stephen Grossberg. 1987b. ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26(23):4919-4930. Michael Cohen and Stephen Grossberg. 1986. Neural dynamics of speech and language coding: Developmental programs, perceptual grouping, and competition for short-term memory. Human Neurobiology, 5(l):l-22.
282
Nigrin
Michael Cohen and Stephen Grossberg. 1987. Masking fields: A massively parallel neural architecture for learning, recognizing, and predicting multiple groupings of data. Applied Optics, 26:1866-1891. Garrison W. Cottrell. 1985. Connectionist parsing. In Proceedings of Cognitive Science Society, pp 201-211. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive 14:179-211.
Science,
Jeffrey L. Elman and David Zipser. 1988. Discovering the hidden structure of speech. Journal of the Acoustical Society of America, 83:1615-1626. Stephen Grossberg. 1973. Contour enhancement, short term memory and constancies in reverberating neural networks. Studies in Applied Mathematics, 52:217-257. Stephen Grossberg. 1978. A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. In R. Rosen and F. Snell, editors. Progress in Theoretical Biology, Vol. 5, Academic Press, New York. Stephen Grossberg. 1982. Studies of Mind and Brain: Neural Principles of Learning, Perception, Development, Cognition, and Motor Control. Reidel Press, Boston. Stephen Grossberg. 1985. The adaptive self-organization of serial order in behavior: Speech, language, and motor control. In E. C. Schwab and H.C. Nusbaum, editors. Pattern Recognition by Humans and Machines, Vol. 1: Speech Perception, Academic Press, New York. Stephen Grossberg. 1987a. The Adaptive Brain, I: Cognition, Learning, Reinforcement, and Rhythm. Elsevier Science Publishing Company Inc., North Holland, Amsterdam. Stephen Grossberg. 1987b. The Adaptive Brain, II: Vision, Speech, Language, and Motor Control. Elsevier Science Publishing Company Inc., North Holland, Amsterdam. Stephen Grossberg. 1988. Neural Networks and Natural Intelligence. MIT Press, Cambridge, MA. S. J. Hanson and J. Kegl. 1987. Parsnip: A connectionist network that learns natural language grammar from exposure to natural language sentences. Ninth Annual Conference of the Cognitive Science Society, Seattle, Washington. Erlbaum Associates, Hillsdale, NJ.
8. Using SONNET 1
283
Michael I. Jordan. 1986. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society^ pp. 431-546, Erlbaum Associates, Hillsdale, NJ. Jonathan A. Marshall. 1990a. A self-organizing scale-sensitive neural network. In International Joint Conference On Neural Networks^ Vol. 3, pp. 649-654, San Diego. Jonathan A. Marshall. 1990b. Representation of uncertainty in self-organizing neural networks. In International Conference on Neural Networks^ pp. 809-812, Paris, Prance. Jonathan A. Marshall. 1992. Development of perceptual context-sensitivity in unsupervised neural networks: Parsing, grouping and segmentation. In International Joint Conference on Neural Networks^ Vol. 3, pp. 315-320, Baltimore, MD. Jonathan A. Marshall. 1995. Adaptive perceptual pattern recognition by self-organizing neural networks: Context, uncertainty, multiplicity and scale. Neural Networks 8:335-362, April 1995. Risto Miikkulainen and Michael Dyer. 1991. Natural language processing with modular PDP networks and distributed lexicon. Cognitive Science, 15:343-399. Albert Nigrin. 1990. The Stable Learning of Temporal Patterns with an Adaptive Resonance Circuit. Ph.D. thesis, Duke University. Albert Nigrin. 1993. Neural Networks for Pattern Recognition. MIT Press, Cambridge, MA. Terrence J. Sejnowski and C. R. Rosenberg. 1987. Parallel networks that learn to pronounce Enghsh text. Complex Systems, 1:145-168. M.F. St. John and J. L. McClelland. 1990. Learning and applying contextual constraints in sentence comprehension. Artificial Intelligence, 46:217-258. Chen Sung and W. Jones. 1988. Temporal pattern recognition. In IEEE 1988 International Conference on Neural Networks, /, pp. 689-696. Chen Sung and W. Jones. 1990. A speech recognition system featuring neural network processing of global lexical features. In IJ CNN 1990 Proceedings of the International Conference on Neural Networks^ II, pages 437-440.
284
Nigrin
David W. Tank and John J. Hopfield. 1987. Neural computation by concentrating information in time. Proc. Natl. Acad. Sci. USA, 84:18961900. K.P. Unnikrishnan, J. J. Hopfield, and D. W. Tank. 1992. Speaker - independent recognition using a neural network with time-delayed connections. Neural Computation, 4:108-119. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. 1989. Phoneme recognition using time-delay neural networks. IEEE Trans. ASSP 26:328-339.
Chapter 9 On t h e Use of High-Level P e t r i N e t s in t h e ModeUng of Biological Neural Networks Kurapati Venkatesh Abhijit Pandya Sam Hsu ABSTRACT H i g h - l e v e l P e t r i n e t s ( H P N s ) — a class of P e t r i n e t s ( P N s ) — a r e powerful and versatile tools for modeling, simulating, analyzing, designing, and controlling complex asynchronous concurrent systems. Some of the important applications of HPNs can be found in addressing problems related to computer hardware and software and flexible factory automation. In this research, an initial attempt is made to model biological neural networks (BNNs) with HPNs, since the interactions among neurons are basically asynchronous and concurrent in nature. Even though there are many types of HPNs reported in the hterature, none have the constructs to model BNNs. Hence a new class of HPNs is proposed in this chapter. With this aim the analogies between BNNs and HPNs are explored. The detailed procedure of PN modeling is elucidated by modeling the mammalian olfactory bulb. By studying the dynamic behavior of petri net models (PNMs), temporal dynamics and time-varying pattern recognition in BNNs can be investigated. Various timing durations can be associated with the transitions in PNMs to study the temporal characteristics regarding pattern recognition. This is achieved by associating three timing functions with each transition in a PNM.
1
Introduction
Modeling of biological neural networks is p a r a m o u n t for t h e clear understanding of t h e functioning of t h e brain. This is corroborated by t h e D A R P A r e p o r t [1], which emphasizes t h a t researchers should explore biological models to guide experimental work. Models with these aims have been actively pursued [2, 3]. As t h e field of neural networks uses t h e o r e t ical results and insights from m a n y research areas, t h e D A R P A r e p o r t [1] also emphasizes t h e need for hierarchical network models t h a t can b e easily 285
286
Venkatesh, Pandya, and H s u
understood by people of different backgrounds. With these models, interactions among modelers, neurobiologists, engineers, cognitive researchers, physicists, mathematicians, and computer scientists can be encouraged. Current BNN models are understood by specialists and are usually specific to the systems they model [4]. For an overview of these models the reader is referred to the DARPA report [1] and Arbib [2]. General hierarchical models and model frameworks that are easily understood are direly needed for neural system models. In this chapter, high-level Petri nets (HPNs)—a class of Petri nets ( P N s ) are explored in order to model biological neural networks ( B N N s ) . The model of the system obtained using PNs is called a Petri net model ( P N M ) of the system. The advantages of PNs to model BNNs include hierarchical modeling and elegant graphical representation of BNNs. This will result in models that are easily understood by experts in different related areas. Using PNs, various interactions among the neurons in BNNs along with their timing values can be modeled. The resulting PNMs can be simulated and analyzed to study the dynamics of temporal patterns in BNNs. PNs are claimed to be ideal modeling tools for simulating, analyzing, designing, and controlling complex asynchronous concurrent systems. The details of PN theory can be found in Peterson [5]. Some of the important and diversified applications of PNs can be found in performance analysis of multiprocessor systems [6]; communication protocols [7]; software design [8]; data base design [9]; process control [10]; VLSI design optimization and testing [12]; and flexible factory automation [13, 14, 15, 16, 17]. Surveys on PN applications can be found in Silva and Valette [18] and Venkatesh and Ilyas [19]. As the biological brain is also basically a complex asynchronous system, PNs can be exploited to study the brain. Modeling BNNs with PNs has many advantages, and these are detailed in the later sections of the chapter. This chapter is organized as follows. In the next section, the fundamentals of PN modeling are briefly presented. Also, the need for a new class of PNs to model BNNs is also discussed. In the third section, modeling of BNNs with HPNs is described by giving analogies between the elements of HPNs to the elements of BNNs. In the fourth section, the new/modified elements added to HPNs are presented. The fifth section is devoted to illustrating the detailed modeling of BNNs with HPNs by considering an example of the olfactory bulb of a rabbit. In the same section, the analysis of the obtained HPN model and the results that can be drawn from the analysis are discussed. Finally, conclusions are presented.
9. High Level Petri N e t s in Modeling Biological Neural N e t s
287
Robot (R)
FIGURE 1. Simple assembly cell.
2
Fundamentals of PNs
2.1
Concepts and Terminology of PNs
PNs are graphical and mathematical tools for modeling information and control flow in event-driven systems. A PN has two types of nodes: transitions and places. Directed arcs Unk places to transitions and transitions to places. Tokens reside in places and are used to describe the state of the system being modeled. The following paragraph illustrates PN concepts by modeling a simple assembly cell before a formal definition of PNs is given. The assembly cell considered is shown in Figure 1. It contains two part feeders (PFl and PF2) and a robot (R). Part feeders supply the parts required for assembly, and the robot does all assembly operations. PF2 feeds a part to the empty assembly area automatically. The operational specifications of this system are as follows: 1. To start a cycle, robot (R) and parts must be available. 2. R transfers a part from P F l to assembly area and starts assembly. 3. R assembles the parts and transfers the finished product to the output buffer.
288
Venkatesh, Pandya, and H s u
Robot ready
PFl rcadv
rcadv
P3 Transfer t part and start a^^mbly (t ) A.sscmbJy
P 4 Q in pn">grr«;'; Finish assembly and ... transfer to output buffer (t^)
Initial marking:
6
Finished product available
( 1 , 2, 3, 0, 0)
F I G U R E 2. P e t r i n e t m o d e l of t h e assembly cell.
Figure 2 shows the PN model for this assembly cell. Conditions stated in specification 1 are modeled by three places (pictured by circles): Rjready (pi), FFl,ready (P2), and FF2.ready (pa). Putting a token (pictured by a dot) in pi represents that R is ready; two tokens in p2 means that there are two parts in PFl; and that three in ps means three parts are available in PF2. Specification 2 is modeled by a transition (pictured by a bar): transfer.part.and. start .assembly (t/). Once robot R starts assembly, a new condition results, i.e., assembly.in.progress modeled by P4. Specification 3 is modeled by finish.assembly.and.transfer.to.output.buffer (t2), a transition. After R finishes assembly and transfers the finished product to the output buffer, two new conditions result. The first one is that R is free again to do the next assembly task. It is modeled by an output arc from t2 to place p/, which deposits a token in p^ when t2 "fires". The second one is that a finished product is ready at the output buffer. This is modeled by another place, finished.product.available (ps). The distribution of tokens in all places is called a marking of the PN. A marking indicates the status of all system components, called the system's state. It is formally defined as a vector whose components represent the number of tokens in the corresponding places. For example, the initial state of the system for the assembly cell is (1, 2, 3, 0, 0), which models that the robot is ready, P F l and PF2 contains two and three parts respectively, assembly is not in progress, and there is no finished product in the output buffer. The mark-
9. High Level Petri Nets in Modeling Biological Neural Nets
289
ing changes when a transition fires, i.e., an event occurs. This results in a new marking according to the rules given later. Sometimes, weights (pictured as labels on the arcs) may also be presented in a PN to facilitate the modeling. If there is no weight on an arc, a unit weight is assumed. Formally a PN, Z is a five tuple, Z = (P, T, I, O, m) where 1. P is a finite set of places; 2. T is a finite set of transitions with P U T 7 « ^ 0 a n d P n T = 0 ; P n T = 0; 3. I: P X T -^ N, is an input function that defines the set of directed arcs from P to T where, N = { 0 , 1,2, . . . ; } 4. 0 : P x T ^ N i s a n output function that defines the set of directed arcs from T to P; 5. m: P —)- N is a marking whose 2th component represents the number of tokens in the zth place. An initial marking is denoted by moThe execution rules of a PN include enabling and firing rules: 1. A transition t G T is enabled if and only if m(p) > I(p, t), V p G P. 2. Enabled in a marking m, t fires and results in a new marking m' following the rule
M'(p) = M(p) + 0(p,t) - I(p,t), V p € P. The marking m' is said to be reachable from m. Given Z and its initial marking m©, the reachability set is the set of all markings reachable from mo through various sequences of transition firings. Several important PN qualitative properties such as boundedness and liveness that are related to stability and deadlock freeness can be defined, and their implications for system modeling are reported in Peterson [5] and Murata [19].
2.2
Timed PNs
When a PN does not model timing information of the operations in the system, it is called an untimed PN as shown in Figure 2. However, for the quantitative analysis and control of the system, timing information has to be included in a PN. For example, assume that the time to transfer parts and start assembly takes one time unit, and time to do assembly and transfer the finished product takes two units. These are modeled by associating
290
Venkatesh, Pandya, and Hsu
one time unit to ti and two to t2. The time associated with a transition is called the firing duration and is shown on the right hand side of the transition. When PN models timing information in the system, it is called timed PN (TPN). Figure 3 shows various timed PNMs in chronological order. Formally, a timed PN is a net Z in which each transition is associated with either a deterministic or random firing delay time. Note that the random time delays may follow a general distribution. There are two events for a transition firing, namely, start.firing and end.firing. Between these two events, the firing is in progress. The deposition of tokens to a transition's output places(s) occurs at end.firing. While the firing of a transition is in progress, the time to end firing, called the remaining firing time, decreases from firing duration to zero, at which firing of the transition is completed. Instantaneous description (ID) [17] defines the state of a TPN and is a four tuple ID = (m, F, Q, A) where: 1. m is a marking; 2. F is a binary selector function, F: T -^ {0, 1}. If F(t) = 1, t is enabled; otherwise it is disabled; 3. Q: T -> R~^, is the remaining firing time function, where R"^ is the set of all positive integers. If Q(t) = q, there is q amount of time to complete firing t. Q is a cumulatively decreasing time function; 4. A: T —)• R"^, is the active time function. If Q(t) = q', t is said to be active for q' amount of time. A is a cumulatively increasing time function. ID is useful for the quantitative and behavioral analysis of the system. The importance of ID can be observed from various PNMs shown in Figure 3 corresponding to diflFerent times. Consider for example. Figure 3(a) modeling the assembly cell before starting assembly. At time zero, initial marking models the initial system state. The F-function models that ti is enabled; the Q-function models that there is one time unit necessary to finish ti's firing; A-function models that there is no transition active. After one time unit. Figure 3(b) shows the assembly cell after assembly starts. Notice that the marking has changed now, indicating that R is not ready because it is doing assembly, P F l and PF2 contain one and two parts respectively, assembly is in progress, and there is no finished product. F-function shows that t2 is enabled, Q-function shows that t2 needs two time units to complete its firing, and A-function shows that ti is active for one time unit. After two time units. Figure 3(c) shows the assembly cell during the assembly operation. Observe that the marking has not changed. F-function shows that no transition is enabled; Q-function shows that there is one time unit necessary to finish t2's firing, and A-function models that
9. H i g h Level Petri N e t s in M o d e l i n g Biological Neural N e t s
Robot ready
(a) Before Hring of transition I (before assembly staned)
rt mw
Time:
0 units
Marking;
( 1 . 2 . ?. 0.0)
P-function: (1.0) 0-funclion: (1.0) A-function; (0. 0) 1
Robot ready
F-inished
n:
ready ready
(b) After firing of transition I (after assembly staned)
Time:
After I lime unit
Marking:
(0.1.2.1.0)
F-fiinciion: (0. I) 0-function: (0. 2)
V-
/
A-fiinclion: (1.0)
Robot PF» ready ready ready <)P.
u)—i—
PF2 (c) During Hring of irnnsiiion 2 (during assembly)
ILL,
p^or;;
Timr:
After 2 time imils
Marking:
(0. 1. 2. 1.0)
P.f.mclion: (0.0) Q-function: (0. I)
P56 Robot PFI ready ready ready
•Qpi
Ftninhed product •vailable PF2
A-function: (1.1)
(d) After firing of transition 2 (after assembly finished)
OPJ OP3
>, Assembly
Time:
After 3 time units
Marking:
(1.1.2.0.1)
F-funclion: (1,0) Q-fimction: (1.0) A-function: (1.2)
FIGURE 3. Timed Petri net model of the assembly cell.
291
292
Venkatesh, Pandya, and Hsu
both ti and t2 are active for one time unit. After three time units, Figure 3(d) models the assembly cell after assembly is finished. The new marking indicates that R is ready, P F l and PF2 contain one and two parts respectively, assembly is not in progress, and there is one finished product. F-function models that ti is enabled, Q-function shows that there is one time unit necessary to finish ti's firing, and A-function shows that ti and t2 are active for one and two time units respectively. The use of ID to analyze BNNs is discussed later, in Section 5.
2.3
High^Level PNs (HPNs)
HPNs are extensions of timed PNs and are capable of modeling complex asynchronous concurrent systems. There are several classes of HPNs reported in the literature. A typical HPN is shown in Figure 4. The additional constructs in this HPN compared to earlier PNs are: (i) colors (associated with tokens) representing different conditions simultaneously; (ii) predicates (conditions embedded in the transitions), representing the conditions to be fulfilled for the occurrence of activities; (iii) labels on some arcs modeling actions that are to be performed after firing a corresponding transition. The discussion on the theory of these classes of HPNs falls beyond the scope of this chapter. These available classes of HPNs [20, 21, 22, 23, 24, 25] are not powerful enough to model BNNs, as they cannot model some complex functions (explained in subsequent sections) taking place among the elements present in BNNs. Hence a new class of HPNs are needed to address the problem at hand. The Petri net model (PNM) can replicate the biological structure of the brain if the available HPNs are extended using additional constructs to the model: presynaptic cleft, axon transition, dendrite transition, receptor generating excitatory pulse, and receptor generating inhibitory pulse.
3
Modeling of Biological Neural Systems with High-Level PNs
Before suggesting the detailed methodology for modeling BNNs with HPNs, it is important to recapitulate the interactions taking place in BNNs. The brain contains over one hundred billion neurons, which perform all of the computational and communication functions within the brain. This is achieved by transmitting the information among the neurons in the form of electrochemical signals (action potentials). Before explaining the events underlying the transmission of the signals in the brain, it is mandatory to know about the elements of the neuron and its functions [11]. The neuron consists of three sections: (i) the cell body, (ii) the dendrites.
9. High Level Petri Nets in Modeling Biological Neural Nets
293
T1
P2
P3
1
-O Q O T2 r
PREDICATE 1
I
T3 I
PREDICATE 2
I t3
P8
T4
FIGURE 4. A typical high-level Petri net. and (iii) the axon, each with separate but complementary functions [26]. Functionally, the dendrites receive the signals from other cells at connection points called synapses. From there, the signals are passed on to the cell body, where they are essentially averaged with other such signals. If the average over a short time interval is sufficiently large, the cell fires, producing a pulse through its axon that is passed on to succeeding cells. Primarily, the axon carries the signal in the form of an action potential. Near its end, the axon has multiple branches, each terminating in a synapse, where the signal is transmitted to the next neuron through a dendrite or in some cases directly to a cell body [26]. The new elements/constructs as well as the available ones in HPNs that
294
Venkatesh, Pandya, and Hsu
are proposed to model BNNs are explained below and summarized in Table 1. From here on, for simplicity, these PN models are called PNMs. 1. Normal places in the PNM can represent cell bodies. In the biological neural networks (BNNs), one of the primary functions of the cell body is to receive the input signals from axons/dendrites and transmit the output signals to other axons/dendrites. Similarly, a place in a PNM receives information from several input arcs and transmits the information to output arcs. 2. Arcs in PNMs replicate the functions of axons/dendrites in BNNs. Axons/dendrites are channels through which information travels. Similarly, arcs in PNMs act as channels for transmitting the information. In the PNM, a normal directed arc models an axon carrying an excitatory pulse, and an arc with a small square at its edge (called an inhibitory arc in PN terminology) models an axon carrying an inhibitory pulse. 3. Timed transitions combined with their output arcs in the PNM can model the function of axons/dendrites in BNNs. To be more specific, an output arc from an "axon transition" represents the axon, and an output arc from a "dendrite transition" represents a dendrite. Axons and dendrites connect to each other at synapses. Dendrites send signals to the cell body. Similarly, transitions in the PNM receive signals from their input places. In BNNs, all the incoming signals are averaged at the cell body. But in PNMs, the calculation of the output from a place is carried out in the axon transition. The threshold value that decides the firing of the cell body is modeled as a weight (a standard term for threshold in PN terminology) on the arc connecting the place modeling that cell body and the axon/dendrite transition. For some neurons that do not have axons (e.g. granule cells in the olfactory bulb), this output calculation is done in the dendrite transition. In PNMs, the output signal from this transition can't be distinguished either as excitatory or inhibitory until it passes through the "predicate transitions" that are modeling diflFerent receptors in BNNs. (The transition is called predicate transition because it models the receptor that checks for a condition whether the transmitter passing through it produces an excitatory pulse or an inhibitory pulse.) 4. In PNMs, weights on arcs represent (i) the threshold value of a cell body, and (ii) the number of input arcs for the place. For example, the threshold value of a cell body can be modeled as a weight on the arc connecting the place (modeling that cell body) and the transition (modeling the dendrite connecting to the cell body). The number
9. High Level Petri Nets in Modeling Biological Neural Nets Element in HPNt
Element(i)/implcmentatlon parameter(f) in BNNs
Place
a) Cell Body b) Pre synaptic cleft a) Axon generator b) Dendrite generator c) Any activity Axon/dendrite generating excitatory pulse
Transition Nomud arc Inhibited arc firom transition Weight Token Transition with predicate Colors for tokens Firing sequence Initial Marking Transition associated with time
Axon /dendrite generating inhibitory pulse a) Threshold required to fire the neuron , b) the number of input axons/dendrites for a neuron, c) the amount of neuro transmitterreleasedfrom presynaptic cleft to postsynaptic cleft Chemical molecule flowing or information stored in the neuron Receptors at inter-neuron communication allowing selected type of chemicals to pass through them a) Chemical gate allowing molecules causing excitatory pulse b) Chemical gate allowing molecules causing inhibitory pulse Different chemical molecules traveling through the axon Neuron firing sequence Initial sute of the BNN Timing duration (tj) of the activity modeled by that transition(Tj)
295
Symbol 5 ^ ''
'' ' '
I ' °
_»____--^ f ^ t f
TABLE 1. Analogy between HPNs and BNNs. of dendrites carrying signals to a cell body can be represented as a weight on the arc that acts as an input arc to the place modeling the cell body. For example, if there are four axons carrying excitatory signals, and five axons carrying inhibitory signals to the cell body, two weights, 4 and 5, can be associated with the two input arcs carrying information to the place. Weights on arcs also represent the amount of neurotransmitter released from the presynaptic cleft to postsynaptic cleft. Modeling of the transmitter transmitted is essential, as it decides the changes in electrochemical potential at the postsynaptic membrane, which in turn affects the firing of the neuron. 5. In PNMs, tokens represent both the information and the chemical molecules passing through the axons. In BNNs, there are many types of chemical molecules passing through the axons simultaneously. Tokens with two attributes, the name of the chemical molecule and the amount at a given time, can be used to model all these types of molecules. 6. In PNMs, places represented as concentric circles model presynaptic areas. These places receive the signals from the normal transitions and transmit the signals to their output transitions. Each normal transition representing the dendrite can have many output places, representing many postsynaptic membrane areas. 7. The output transitions of the concentric circles are associated with predicates to model the chemical gates (receptors) for interneuron
296
Venkatesh, Pandya, and Hsu communication in BNNs. A transition with a "-f" signal at its right models a receptor allowing the chemical molecules that produce an excitatory pulse. Similarly, a transition with a "—" signal at its right models a receptor allowing the chemical molecules that produce an inhibitory pulse.
8. Times associated with transitions represent the time delays involved in the interactions among neurons such 2ts the time required for the axon potential to travel from the neuron to the presynaptic membrane area and the time required for the chemical gate at a receptor to open. Transition firing sequences in the PNM represent the flow of information among the neurons.
4
New/Modified Elements Added to HPNs to Model BNNs
As discussed in an earlier section, none of the available classes of HPNs have the necessary elements to model BNNs. For example, there are no specific elements to model presynaptic and postsynaptic areas, cell body, threshold required to fire a cell body, etc. In this section, new elements will be added to HPNs, and several other elements will be modified in order to make it possible to model BNNs. The formal description of this new class of HPNs and its specific use are described below.
4-1
New Types of Places
For the sake of discussion, let us assume that places can model both cell bodies and presynaptic areas. But places modeling cell bodies and presynaptic areas have to be distinguished, since the function of the cell body and presynaptic areas are diff^erent. In other words, the cell body is the primary building block of BNNs at which the signals from other cell bodies are averaged and propagated if the threshold required for that cell body is accumulated. In contrast, the function of the presynaptic area is to act as a terminal point of an axon and is mainly involved in communication between neurons. Hence, there should be two diff'erent types of places to model cell bodies and presynaptic clefts: P = {Pc, Ppsc}, where Pc = {Pi, P2, P 3 , . . . , Pn } is a set of cell bodies and Ppsc = {Ppsc/, Pp5c2, • • • , Pscn} is a sct of presynaptic areas.
9. High Level Petri Nets in Modeling Biological Neural Nets
4-2
297
New Types of Transitions
Timed transitions (time associated with transition) and predicate transition (predicate associated with transition) are important elements in earUer HPNs [9]. However, these transitions alone are not sufficient to model axons and dendrites in BNNs. Hence, two additional types of transitions are proposed. The first type of transition is named the dendrite generator, as the output arcs of such transitions model dendrites. Another type of transition is called axon generator, as the output arcs of such transitions represent axons. Further, there are some specific actions associated with such transitions. The action to be performed at such transition is to add all the incoming signals of its parent cell body (the parent cell body for a transition is the cell body for which it is the output transition) and compare the resultant with the threshold required to fire the cell body. If the resultant sum exceeds the threshold required, then the axon generator allows the signal to pass through the axon (the output arc of the axon generator models the axon corresponding to its parent cell body). This summation of incoming signals of the parent cell body usually is done at the axon generator. When a cell body fires, the information flows in the form of an axon potential through the axon. However, there are some specific cell bodies that do not have axons. For example, in the olfactory bulb of the rabbit, granule cells do not have axons. In such cases, the aforementioned summation takes place at the dendrite generator corresponding to the granule cell: T = {TAG. T ^ G , T ^ T , T p } , where TAG
=
TDG
= {
TAT
= { T A , ta} where
{ T A G I , T^AG2, TDGIJ
•••,
T£)G2r • • ,
T^GO}
is a set of axon generators,
TDGP}
IS a set of dendrite generators;
T^ = {TAI, T ^ 2 J • • • 5 ^AQ} is a set of activities modeling flow of information from presynaptic area to postsynaptic area, t^ = {ta/, ta2,--- 5 tag} is a sct of time durations associated with the corresponding activities modeled by "TA";
T p = {TpEy T p / } is set of receptors, where TpE ={ TpEi, TpE2, • •. , TpER } is a set of predicate transitions modeling receptors generating an excitatory pulse,
298
Venkatesh, Pandya, and Hsu
T p / = {Tp//, T p / 2 , . . . , Tpjr} is a set of predicate transitions modeling receptors generating an inhibitory pulse.
4^3 New Type of Weights In the conventional HPN, weights are integers. But to model BNNs, weights have to be both integers and real numbers, since they represent either the amount of transmitter released from presynaptic area to postsynaptic area or the threshold required to fire a neuron. For example, the threshold required to fire a mitral cell in the olfactory bulb of a rabbit is a real number. Hence, it is logical to keep such weights as real numbers. Weights also model the number of input neurons of one type to a neuron of another type. For example, in the case of the olfactory bulb of a rabbit, each mitral cell receives information from 200 granule cells. Hence, weights modeling such connections should be integers. Hence, there should be two different types of weights — weights that are real numbers and weights that are integers. As in the conventional HPN, where arcs transfer information and control among the places, axons and dendrites in BNNs carry information from one neuron to other neuron (s). The normal arc with a solid line represents an axon. An arc modeling an axon/dendrite that generates an excitatory pulse has the arrow at its end. Similarly, an arc modeling an axon/dendrite that generates an inhibitory pulse has a small square at its end: IA= {P X T} —^ S and 0 A = {T x P} —> S, where lA and OA represent input and output functions that define directed arcs between places and transitions, S is the set of all values of K, where K = {IK, RK} is a set of weights on arcs; IK is a set of all integers, used to model the number of input axons / dendrites for a cell body; RK is the set of all real numbers modeling thresholds required to fire a cell body; M = Marking of an HPN from set P to Q, i.e., M: P ^ Q, where M inputs tokens with attributes to every place;
9. High Level Petri Nets in Modeling Biological Neural Nets
299
M = M(pi) indicating the number of tokens in place pi with two attributes: (1) name of chemical molecule, and (2) amount of chemical molecule; Q = { 0 , 1 , 2 , . . . } , and the attributes of each token are given by a set TR modeling the transmitter information, where TR = {TRi, T R 2 , . . . , TRu; }, where w is the total number of different types of transmitter molecules; TRi = { NTRi, ATRi }, where NTR^ is the name of the transmitter molecule and ATRj is the amount of the transmitter molecule. Note that ATRi is a positive real number because the amount of transmitter molecule can be represented as a real number.
5 Example of a BNN: The Olfactory Bulb To illustrate the modeling concepts of HPNs described above, the olfactory bulb of a rabbit given in Shepherd [26] is considered. The neuronal elements of the olfactory bulb are shown in Figure 5. The olfactory bulb is considered in the present investigation because of its distinct laminations and several sharply differentiated neurons. Furthermore, the results of earlier studies show that the olfactory bulb goes significantly beyond the framework of the classical neuronal structure as formulated in the motoneuron model. Hence, it is appropriate to consider modeling the olfactory bulb and then using similar concepts to model other highly organized regions of the brain, such as the retina and the cerebellum. Before illustrating the concepts modeling BNNs with HPNs, it is logical first to summarize the internal details of the olfactory bulb. In this section, the input and output connections of the olfactory bulb, different types of neurons present in the olfactory bulb, and the connections among them, are described. For a detailed description of the olfactory bulb, see Shepherd [26].
5.1
Inputs
From Figure 5, it can be noted that the afferent (peripheral) input to the olfactory bulb is through the axons of the receptor cells in the olfactory mucosa in the nasal cavity. The olfactory axons enter at the bulb surface and terminate in a layer composed of spherical regions of neuropil, called glomeruli. There are several centra/inputs to the bulb from the brain. Their sites of origin are indicated in Figure 5. Axons that are relatively large but
300
Venkatesh, Pandya, and H s u
OLFACTORY MUCOSA
OLFACTORY NERVES GLOMERULI EXTERNAL PLEXJFORM LAYER (EPL) MITRAL B O D Y T
LAYER
i
C.RANULL LAYKR
A~^4J:^T [|()n;iii l(. I O l •
Inputs: afferent fibers (nbovc) from olfacrorx* receptors; central fibers (below) from three sources; centrifugal fibers ( C ) fron) the nucleus of the horizontal limb df the diagonal batul; ij)sii.ircr:il fibers from tlic anterior olfacrorv nucleus ( A O N ) ; conrrnhircial fibers from the anterior commissure ( A ( ] ) . Principal netirovs: mitral cell ( M ) , with primarx- (i M MMX sccondnrv dendrites ( 2 ° ) and recurrent axon collaterals ( r e ) ; uifrcd cell ( T ) . Intrinsic neitrons: |)criglomernlar short-axon cell (PCI); dccj) shortnxon cell ( S A ) ; granule cell ( ( i r ) . L O T , lateral olfactor\- tract. FIGURE 5. Steps for modeling BNNs by HPNs. Neuronal elements of the mammalian olfactory bulb. (See [26].
9. High Level Petri Nets in Modeling Biological Neural Nets
301
few in number, come from a region at the base of the brain called the diagonal band (DB). Other axons, finer and more in number come from the region just posterior to the bulb, the anterior olfactory nucleus (AON). Some of these come from the AON of the same side; others come from the contralateral side through the anterior commissure (AC).
5.2
Principal Neuron
The output from the olfactory bulb is directed centrally and is carried in the axons of mitral cells. Each cell sends an unbranched primary dendrite to a glomerulus, to terminate there in a tuft of branches. Each mitral cell also gives rise to several secondary dendrites, which branch sparingly and terminate in the external plexiform layer (EPL). The mitral cell axons proceed to the depths of the bulb and then run posteriorly to emerge together to form the lateral olfactory tract (LOT). During their presence within the bulb they give off two kinds of collaterals: recurrent collaterals that terminate in the EPL and deep collaterals that terminate in the granule layer (GRL). Even though there are smaller versions of mitral cells—called tufted cells, they are not considered here, as their specific function is not known [26].
5.3
Intrinsic Neurons
There are mainly three types of intrinsic neurons: periglomerular cells, granule cells, and short axon cells. Surrounding the glomeruli are the intrinsic neurons, called periglomerular (PG) cells. Each of these cells has a short, bushy dendrite tree that arborizes within one of the glomeruU. The axon of this cell distributes to neighboring glomeruli, but not to the glomerulus containing the dendrite tree of its parent cell. Below the layer of mitral cell bodies is a thick layer containing the cell bodies of granule cells. Each granule cell has a superficial process that starts and terminates in the EPL. Each granule cell also gives off an inner process that terminates deeper in the granule layer. The outstanding feature of the granule cell is that it lacks a morphological axon. (The implications of this during modeling with HPNs are described later in this section.) The detailed ratios of the principal neurons to intrinsic neurons can be seen in Cotterill [4]. The basic circuit of the olfactory bulb is shown in Figure 6.
5.4
PNM Formulation and Analysis
To illustrate the concepts of PNs described above, the olfactory bulb of a rabbit given in Shepherd [26] is shown in Figure 5. The basic circuit of the olfactory bulb is shown in Figure 6. Certain assumptions are made while
302
Venkatesh, Pandya, and Hsu
FIGURE 6. Steps for modeling BNNs by HPNs. Basic circuit diagram for the mammalian olfactory bulb. (See [26].) modeling. They are (i) for simplicity's sake as given in Shepherd [26], the terminals of dendrites are assumed as synaptic areas; (ii) the postsynaptic area (the synaptic area on the process that is receiving the signal) is embedded in the output transition of the place modeling the presynaptic area (synaptic area on the process that is sending the signal). The PNM of neuronal elements of the mammalian olfactory bulb considered is shown in Figure 7. Table 2 shows the interpretation of the PNM elements to the activities in brain. Figures 5-7 represent the logical steps needed to model BNNs with HPNs. First, the biological structure is converted into a circuit that describes the information flow among cells, and then BPNs are used to model the circuit. In the PNM, each transition represents the activity of information flowing from its input place to the output place. Time duration for this activity is associated at the right-hand side of the transition. For example, T l rep-
9. High Level Petri N e t s in Modeling Biological Neural N e t s
303
Jrmrm»»on modeling a ch«mical 9«t« allowing mc»<«cula« *»t*ch gan«f«ta an araialory ptilM
molaculaa wfiich gaoaraM an inNMory pulaa
a
A»orV(*aodfHa ganaraMng kiMMofy pulaa
FIGURE 7. Steps for modeling BNNs by HPNs. PNM of the olfactory bulb.
304
Venkatesh, Pandya, and Hsu
Interpretation
Element
PAONl PA0N2 PAIC PPDMl PDPG Ml PG PAPG M2 PA2C PAAON PAAC PSDMl GR PDIGR PD2GR
Presynaptic area (P) on the axon of olfactory nerve (ON) 1 P on the axon of 0 N 2 P on the axon 1 of centrifugal fibre (C) P on the primary dendrite of mitral cell 1 (M1) P on the dendrite of pcriglomerular short-axon cell (PG) Mitral cell 1 Pcriglomerular short-axon cell P on the axon of PG Mitral cell 2 P on the axon 2 of C P on the axon of anterior olfactory nucleus (AON) P on the axon of anterior commissure (AC) P on the secondary dendrite of MI Granule cell (GR) P on the dendrite I of GR P on the dendrite 2 of GR
Transitions: Each transition represents the activity of information flowing from its input place lo the output place. Time duration for this activity is associated at the right hand side of the transition. For e.g. TI represents the information flow from PAONl to PPDMl and tl models the time duration for this activity to take place. Weights: T^ TPG TGR
Threshold required to fire MI Threshold required to fire PG Threshold required to fire GR
W represents the amount of transmitter relca.scd from presynaptic area to postsynaptic area. For c.g Wl models the amount of transmitter molecules released from PA( )N I to the posfsv napiic area of MI.
TABLE 2. Interpretation of elements in the PNM shown. resents the information flow from PAONl to P P D M l , and tl models the time duration for this activity to take place. W represents the amount of transmitter released from presynaptic area to postsynaptic area. For example, W l models the amount of transmitter molecules released from PAONl to the postsynaptic area of Ml. Similarly, certain weights on the output arcs from places modeling neurons represent the threshold required to fire the neuron. For example, TM models the threshold required to fire Ml.
5.5
Token Flow through the Olfactory Bulb PNM
Token flow in a PNM models the information flow in a BNN. In order to get an insight into the functioning of a PNM, consider the firing of mitral cell 1 modeled by place Ml. From olfactory nerve I (ONI), the transmitter comes to the presynaptic area on the axon of olfactory nerve I (PAONl).
9. High Level Petri Nets in Modeling Biological Neural Nets
305
When PAONl receives a sufficient amount of transmitter molecules that is equal to or greater than W l , transition T l fires. For each mitral cell, there are 1000 afferent axons entering the olfactory bulb. The information flow from each of these axons is exactly the same as described above. Hence, this is modeled by weight 1000 on the output arc of T l . Now, at Ml all the signals corresponding to this information flow from ON through PAONl are summed up. In other words, the ATRj attribute of the token in Ml after firing T l is given as (1000*ATRi). Similarly, Ml receives 20 and 200 inhibitory signals from PDPG and GR, respectively. All these signals are summed up at M l , and if the threshold caused due to these signals is greater than the threshold required to fire Ml (TM), transition T13 fires. The information flow from Mi's axon to the presynaptic area of GR is modeled by firing of T13. The transfer of this excitatory signal caused due to firing of T13 is modeled by firing of T14. Note that T14 models a chemical gate allowing transmitter molecules that generate an excitatory pulse. To summarize, the token movement in the path, PAONl —)• T l —^ MI —> T13 —> T14 —> GR models the information flow from olfactory nerve (ON) to granule cell via mitral cell 1. By knowing the values of 24 attributes of tokens in places, the exact amount of transmitter molecules, the signal status, and the status of the cell body can be determined. Similar explanations can be derived from the PNM for the firing of other neurons in the BNN.
5.6
Information That Can Be Obtained from an HPN Model
The PN approach to analysis of a system consists of two parts: modeling with a PN and analysis of the PNM by either analytical methods or simulation [5, 17, 18, 19]. The latter is applied in this chapter. The analysis of a PN using simulation can potentially be used for the discrete-event control of the system. For the quantitative analysis of PNMs, a software package has been developed in the C language. The following information related to ID (described in section 2.2) can be obtained from it: 1. Marking of the PNM; 2. Enabled transitions in a marking and confiicts among them; 3. Remaining firing durations of transitions for completion of their firing; 4. Active firing times of transitions, all with respect to real time. The advantages of PN modeling over other models are (i) for the analysis, the PNMs are easily understandable, as they are graphically elegant; (ii) the PNMs can be both qualitatively and quantitatively analyzed, yielding
306
Venkatesh, Pandya, and Hsu
invaluable information about the functioning of the brain, and (iii) PNMs give a general framework to express an entire class of related models. The qualitative information yields properties such as liveness, deadlock, boundedness, and safeness. For example, if the PNM is alive, it implies that at a given time at least one transition is ready to fire, which in turn implies that at least one activity in the brain is ready to take place. If there is a deadlock in the PNM, it implies that at some time not even a single transition is enabled to fire, which in turn implies that there is not even a single activity in the brain ready to take place. Quantitative information yields (i) the number of tokens deposited in each place, representing the stimulus and status present for each neuron. This is very important because as stated earlier, each token has attributes that describe the chemicals and its quantities that reside at a particular cell body that is modeled by the corresponding place; (ii) the active times remaining times for transition represent the state of activities modeled by the transitions; (iii) the marking of the PNM along with the firing vector, active times, and remaining times of transitions, all with respect to real time, represent the dynamic behavior of the brain. Dynamic behavior represents various states of the brain with respect to real time using ID, as described in Section 2.2. This can be used to study the characteristics of temporal patterns in BNNs, as there are three timing functions, F, Q, and A, associated with each transition in PNM. The time that it takes for a system to come to a stable state is called the system transient time, and the time that a system takes to complete one cycle of its operation is called the system cycle time. The amount of the simulation cost and time can be drastically reduced if PNMs of the brain are formulated prior to simulation, since the system transient time and system cycle time can be estimated. These times can be determined using a PNM by checking the attributes of places and transitions in the model with respect to real time. Once the transient time and cycle time of BNNs are determined, the simulation can be stopped and the results of the simulation can be extrapolated for longer durations without actually simulating the system under study. The details of estimating the system transient and system cycle times can be seen in Venkatesh and Ilyas [16] and Venkatesh et al [17]. Furthermore, using the PNM, the effect of different firing sequences of neurons on the functioning of the brain can be investigated. To investigate the temporal effects of activities in the brain, various timing durations can be associated with transitions to study temporal dynamics. Note that each firing sequence in a PNM results in a unique neuron filing sequence, which in turn results in a particular sensory output. For the analysis of the PNM, a software package is being developed that comprises all the principles of PNs developed here. This package is an extension of the package reported in Venkatesh et al. [17].
9. High Level Petri Nets in Modeling Biological Neural Nets
6
307
Conclusions
In this chapter, an initial attempt has been made to model BNNs with HPNs. The motivation to use HPNs for modeling BNNs and its advantages are their efficiency and simple framework for expressing a broad set of BNN models. The advantages of PN modeling over other models are (i) for the analysis, the PNMs are easily understandable, as they are graphically elegant; (ii) the PNMs can be both qualitatively and quantitatively analyzed, yielding invaluable information about the functioning of the brain; and (iii) PNMs give a general framework to express an entire class of related models. The analogies between the functioning of HPNs and BNNs have been explored. The formulation of the PNM corresponding to BNNs has been elucidated by modeling the olfactory bulb of a rabbit. The qualitative and quantitative results that can be drawn using the software package to be developed have been presented. By studying the dynamic behavior of PNMs, recognition of various aspects of temporal patterns in BNNs can be investigated. Various timing durations can be associated with the transitions in PNMs to study temporal dynamics. This is achieved by associating three timing functions with each transition in the PNM. Thus this chapter has attempted to generate further interest in many groups of people with different backgrounds for applying HPNs to solve the related problems in the area of neural networks. For the class of PNs presented here, there is a need to develop theories that guarantee the well-behaved properties of PNs such as liveness, safeness, and reversibility. Also, the application of PNs for studying different examples of BNNs has to be explored. Acknowledgments The authors thank Oren Masory for his involvement in the initial discussions of this paper.
7
References 1. DARPA Neural Network Study, Oct. 1987-Feb. 1988, AFCEA, International Press. 2. Arbib, M.A., 1987, Brains, Machines and Mathematics. SpringerVerlag, Berlin. 3. Grossberg, S., and Kuperstein, M., 1986, Neural Dynamics of Adaptive Sensory-motor Control: Ballistic Eye Movements. Elsevier / North Holland, Amsterdam.
308
Venkatesh, Pandya, and Hsu
4. Cotterill, R.M.J., 1988. Computer Simulation in Brain Science, Cambridge University Press, Cambridge, UK. 5. Peterson, J.L., 1989. Petri Net Theory and the ModeUng of Systems, Prentice Hall, Englewood Cliffs, NJ. 6. Ajmone M.M., Balbo, G., and Conte, G., 1987, A Class of Generalized Petri Nets for the Performance Evaluation of Multi Processor System, MIT Press, Cambridge, MA. 7. Garg, K., 1985, An approach to performance specification of communication protocols using timed petri nets, IEEE Transactions on Software Engineering, Vol. SE-11, No. 10, pp. 1216-1225. 8. Mekly, L.J., and Yau, S.S., 1980, Software design representation using abstract process networks, IEEE Transactions on Software Engineering, Vol. SE-6, No. 5, pp. 420-435. 9. Ozsu, M.T., 1985, Modeling and analysis of distributed database concurrency control algorithms using an extended Petri nets formalism, IEEE Transactions on Software Engineering, Vol. SE 11, No. 10. 10. Bruno, G., and Marchetto, M., 1986, Process-translatable Petri nets for the rapid prototyping of process control systems, IEEE Transactions on Software Engineering, Vol. SE-12, No. 2, pp. 346-357. 11. Dayhoff, J.E., 1990, Neural Network Architectures: An Introduction, Van Nostrand Reinhold, New York. 12. Distante, F., 1985, A Petri net matrix approach in VLSI functional testing, microprocessing and microprogramming. Vol. 16, Nos. 2-3, p. 194. 13. Venkatesh, K., Chetty, O.V.K., and Ravi Raju, K., 1990a, Simulating flexible automated forming and assembly systems. Journal of Material Processing and Technology, Vol. 24, pp. 453-462. 14. Venkatesh, K., Ravi Raju, K., and Chetty, O.V.K., 1990b, Augmenting the performance of flexible multi robot assembly systems with Petri nets. Proceedings of the International Conference on Automation, Robotics, and Computer Vision, Singapore, pp. 341-345. 15. Venkatesh, K., 1990c, Petri nets—An expeditious tool for simulation, modeling and analysis of flexible multi robot assembly systems, M.Tech. thesis, Indian Institute of Technology, Madras, India.
9. H i g h - L e v e l P e t r i N e t s in M o d e l i n g Biological Neural N e t s
309
16. Venkatesh, K.. and Ilyas, M., 1993, Modeling, controlling, a n d simulation of local area networks for flexible manufacturing systems using Petri nets, C o m p u t e r s and Industrial 28 Engineering, Vol. 25, Nos. 1-4, pp. 155-158. 17. Venkatesh. K.. Zhou, M.C., Kaighobadi, M., and Caudill, R., 1994, A u g m e n t e d timed Petri nets for modeling, simulation, a n d analysis of robotic systems with breakdowns. J o u r n a l of Manufacturing Systems, Vol. 13, No. 4, pp. 289-301. 18. Silva, M., and Vaiette, R., 1990, Petri Nets and Flexible Manufacturing, Advances in Petri Nets, Lecture Notes in C o m p u t e r Science, Springer-Verlag, Berlin, pp. 3 7 - 4 1 . 19. M u r a t a , T., 1989, Petri nets — Properties, analysis and applications. Proceedings of I E E E , pp. 541-580. 20. Ghezzi, C . Maiidrioh. D., Morasca, S., and Pezze, M., 1991, A unified high-l(^v(4 Petri net formalism for time-critical systems, I E E E Transactions on Software Engineering, Vol. SE-17, No. 2, 160-172. 21. Madhavji. H.N.. and Schafer, W., 1991, Prism-methodology a n d process - oriented environment, I E E E Transactions on Software Engineering, Vol. SE-17, No. 2, pp. 127-283. 22. Belli, Fevzi. and Grosspietsch, K.E., 1991, Specification of faulttolerant system issues by p r e d i c a t e / t r a n s i t i o n nets a n d regular expressions--approach and case study, I E E E Transactions on Software Engineering, Vol. SE-17, No. 6, p p . 513-525. 23. D o t a n , Y.. and Arazi. B., 1991, Using flat concurrent prolog in system modeling, I E E E Transactions on Software Engineering, Vol. SE-17, No. 6, pp. 493-512. 24. Peterka, C . and M u r a t a , T., 1989, Proof procedure a n d answer extraction in Petri net model of logic programs, I E E E Transactions on Software Engmeering, Vol. SE-15, No. 2, pp. 209-217. 25. BiUington. J.. Wheeler, R.G., and W i l b u r - H a m , C M . , 1988, P R O T E A N : A higii-level Petri net tool for t h e specification a n d verification of comnnniication protocols, I E E E Transactions on Software Engineering. Vol. SE-14, No. 3, pp. 301-316. 26. Shepherd, M.G.. 1974, T h e Synaptic Organization of t h e Brain: A n Introduction. 1974, Oxford University, Oxford.
This Page Intentionally Left Blank
Chapter 10 Locally Recurrent Networks: The Gamma Operator^ Properties, and Extensions Jose C. Principe Samel Celebi Bert de Vries John G. Harris ABSTRACT Locally recurrent networks have shown great potential for processing timevarying signals. This paper reviews various memory structures for timevarying signal processing with neural networks. In particular, we focus on the gamma structure and variations such as the Laguerre and gamma II memory networks. The paper presents the basic theory of memory structures and several interpretations of their function.
1
Introduction
In engineering and biology interesting patterns are often presented sequentially over time. Therefore, an information processing system needs some kind of short-term memory to store the recent past. The most common connectionist network mechanisms for short-term memory are feedforward delays (as in the Time Delay Neural Network, Lang et a/., 1990) and feedback delays (as, for example, in the fully recurrent networks by Williams and Zipser, 1989). Feedforward tapped delay lines are static representations in the sense that the depth (i.e., order, number of taps) and resolution (sampling period) have to be chosen a priori. These parameters should match the characteristics of the signals and the processing goal, which in practice almost always leads to suboptimal design. Note that we should not "over-design," because too many taps leads to spurious signals in the system, which often show up as noise. In particular, for nonstationary signals (almost all real-world signals), the optimal depth and resolution may be time varying, and the fixed representation of tapped delay lines is not optimal. However, the guaranteed stability, simple training algorithms, and conceptual simplicity make tapped delay lines still the favorite memory 311
312
Principe, Celebi, de Vries, and Harris
mechanism in neural networks. In unrestricted recurrent (feedback) networks, stability is hard to control. Moreover, feedback connections introduce local minima in the error surface, which leads to poor training convergence. When feedback is restricted to the local processing units, which in turn are connected as a feedforward network, we may hope to combine some of the lucrative properties of feedforward and feedback structures. In connectionist circles, several studies have appeared that investigate the properties and applicability of locally recurrent networks (Back and Tsoi, 1995). In this paper we present a summarizing review of a number of publications on the gamma locally recurrent network. We will start by analyzing the different types of memory structures and provide basic definitions. The gamma memory will then be introduced as a special case of the generalized feedforward filters, and training algorithms will be provided. Section 4 presents some applications of the gamma memory, while in section 5 several important interpretations of the gamma memory are reviewed. In sections six and seven, Laguerre and gamma II memories are presented. Finally some importa-nt extensions to the basic gamma memory topology are presented and conclusions drawn.
2
Linear Finite Dimensional Memory Structures
In de Vries and Principe (1992), a general linear delay mechanism was defined by the convolution model as
y{t)
z= I U)(t^0
S)x{s)ds
(1)
in continuous time and as N
y{t)^Y^w{n-k)x{k)
(2)
in the discrete time domain; x{n) is an input signal^ w{n) a filter response, and y{n) a memory trace. We will call w{n) a memory filter if w{n) is causal and normalized in the sense that Yl^=o l^('^)l — ^ (^^ Vries and Principe, 1992). If x{n) and w{n) are vector signals, w{n — k)x{k) should be read as the inner product ^ ^ Wi{n — k)xi{k). If w{n) is the impulse response of a finite-dimensional linear system, then it can be implemented as an autoregressive moving average (ARMA) structure. The ARMA memory can be written as
10. Locally Recurrent Networks: The Gamma Operator
313
x(n)
y(n)
FIGURE 1. The ARM A memory filter. The light shaded area is the feedforward filter, and the dark shaded area the leaky integrator. N
M
y{^) = Yl ^riy{n -l)+Yl n=l
^rnx(n - m).
(3)
m=0
The ARMA system in (3) has M-hN-l free weights. The appropriate values for the memory filter parameters W = {ambm} depend on the characteristics of x{n) and the processing goal. An implementation of the ARMA memory is shown in Figure 1. As mentioned before, the ARMA memory can be unstable for particular choices of parameters W. Training of ARMA models is also a nontrivial problem. As a result, in practice it is common to use simpler structures such as the tapped delay line (a.k.a. transversal filter), which can be written as M
2/(^) = 5 ^ bmXm{n - m),
(4)
m=0
and the leaky integrator (a.k.a. context unit, memory neuron), which evaluates to y{n) = ay{n - 1) -h x{n).
(5)
The tapped delay line and leaky integrator are outlined by a lightly shaded and more darkly shaded area, respectively, in Figure 1. The tapped delay line memory filter is stable for all real values of bm, and the leaky integrator is stable for \a\ < 1. In Principe et al. (1992), another memory filter is introduced, the generalized feedforward filter (GFF). In the GFF, the tap impulse response can be recursively computed from the previous tap by 9k{n) = g{n)9gk-i{n),
for A; > 1,
(6)
314
Principe, Celebi, de Vries, and Harris
FIGURE 2. The Generalized feedforward filter. where • is the convolution operation and ^o(^) is either a delta function or another predefined operator. In the ^-transformation domain, the same equation reads Gk{z) = G{z)Gk-i{z),
(7)
The memory traces in the OFF are computed by Xk{n) = g{n) • Xk-i{n)
(8)
and the filter output by y{n) = Ylk=o '^kXki'n)' The generalized feedforward filter is shown in Figure 2. In the generalized feedforward filter, the memory operator (kernel) g(n) is unspecified. Clearly, by inspection of Figure 2, if g{n) is stable, then the OFF is stable. Also, as we will outline in Section 3, the filter weights {wk} can be adapted by standard feedforward adaptation algorithms such as least mean squares or recursive least squares. The advantage of using generalized feedforward structures for memory filters is that g{n) may have an adaptable parameter set, and consequently, the memory traces can be optimized with respect to a performance criterion. Note that this freedom does not exist in the case of the regular tapped delay line filter. The gamma memory filter is a special case of the generalized feedforward filter, where gin) =
M(1
- /i)"
(9)
and go — S{n). For the kth tap, the impulse response evaluates to 9k
(") = (^_J)/(i-Mr-*-
(10)
10. Locally Recurrent Networks: The Gamma Operator
315
FIGURE 3. The gamma memory, also called the gamma delay line. Reprinted with permission from Principe et a/., 1994. The functions gk (n) happen to be discrete versions of the integrands of the gamma function (de Vries and Principe, 1992). They are complete in L2 space (i.e. one can approximate a finite energy signal arbitrarily closely as a weighted sum of these functions). An interesting property of this family is that the time axis is scaled by the parameter /x, which means that there is a change in time scale from the input to the memory traces. As we will see, the parameter /x can be adapted to minimize the mean square output error, thus finding an optimal time scale (w.r.t. MSE) to represent the input signal (or signals in hidden layers if gamma filters are applied to hidden nodes). In the z-domain, the gamma kernel becomes G{z) =
(11)
z-{i-f^y
Figure 3 shows the gamma memory structure and its characteristics. As can be seen from Figure 3, when K = 1, the gamma memory reduces to the leaky integrator, and when /x = 1, the gamma memory becomes the tapped delay line. So the gamma memory unifies the tapped delay line and the leaky integrator into a single parametrized structure. In fact, the gamma memory is a delay line made of leaky integrators. When the outputs of the gamma memory are linearly combined, we obtain a gamma filter (Principe and de Vries, 1992). The describing equations for the gamma filter are (12)
xo{n) = u(n).
Xk{n) = {1 - jjL)xk{n - 1)-\-fixk-i{n
- 1)
for k =
1,2,...,K,
316
Principe, Celebi, de Vries, and Harris K
2/H =
^WkXk{n). k=0
The gamma filter is a generalization of the linear combiner, and when the weights are adapted to minimize the output mean square error, it extends the Adaline (de Vries et a/., 1991). The gamma filter is also a generalization of the FIR synapse as defined by Wan (1994) and is a building block for gamma locally recurrent networks. Sometimes it is useful to write (12) as a state space model, as X{n) = [/ + M^]^n-l + y^n^
(13)
y{n) =wlx^, where x„ = [xo(n),xi(n), ...,x/<:(n)]'^ is the state vector, u„ = [w(n),0, ...,0]-^ is the input vector. 0 1 0
0 1 1
0 1
0
•• 0 •• 0 • • 0
is a state transition signature matrix, ui^ = [it^o(^), ^1(^)5 • • • , ^ K ( ^ ) ] is the filter weight vector, and y{n) is the filter output. The state equation can be compacted to x^ = ^ ^ n - i "*" Mn? where A = I -\- fiA. Hence, the gamma filter can be written as a linear model in state space. In nonlinear neural networks it is common to take y{n) = a{wlx^
+ b{n)),
(14)
where cr(.) is a nonlinear squashing function and b a bias term. A gamma locally recurrent neural net, then, is a nonrecurrent (no loops) circuit of nonlinear gamma filters where y{n) is computed by (14).
2.1
Analysis of Depth and Resolution
It is interesting to compare gamma delay lines with regular tapped delay lines and leaky integrators with respect to their properties as memory devices. First, let us quantify the notion of memory depth. As a convenient
10. Locally Recurrent Networks: The Gamma Operator
317
measure of memory depth for a Kth order gamma memory we take the first moment (mean value) of the last {Kth) delay kernel in the memory. Such a measure can be interpreted as the mean sampling time for the last tap. The mean memory depth D for the Kth. order memory is thus defined as
D ^f^ngKin) n=0
= Z{n9An)}U=i
- z ^ ^ ^ |,^, = ^ . dz ''=^ fi
(15)
Next we define the (temporal) resolution R of the memory as the number of parameters of freedom (i.e., the number of tap variables) per unit of time in the memory. This is equivalent to the number of taps (K) divided by the mean memory depth D. Thus
^ = § = M-
(16)
Clearly, there is a resolution versus memory depth trade-off in a linear memory structure for fixed order K. Such a trade-off is not possible in a nondispersive tapped delay line, since the fixed choice of /z = 1 sets the depth and resolution to D = K and R = 1, respectively. However, in the gamma memory, depth and resolution can be adapted by variation of /x. The choice fi = 1 represents a memory structure with maximal resolution and minimal depth. In this case, the order K and depth D of the memory are equal. Thus, when /x = 1, the number of weights equals the memory depth. Very often this coupling leads to overfitting of the data set (using parameters to model the noise). Hence, the parameter /i provides a means to uncouple the memory order and depth.
3
The Gamma Neural Network
Equation (14) represents the input-output map of a nonlinear processing element (PE), which we call the gamma PE (Figure 4). The input is fed to the gamma memory and the taps linearly combined to produce the PE output. This PE can be considered an extension of the well-known McCulloch-Pitts (M-P) neuron model. The M-P PE is a static model of the neuron, while the gamma PE includes a dynamic component modeling the integration over time that is known to occur at the dendritic tree. The gamma neural network is any multilayer feedforward interconnection of nonlinear PEs and gamma PEs. When the gamma PE is restricted to the input layer, the topology is called the focused gamma neural network.
318
Principe, Celebi, de Vries, and Harris
FIGURE 4. The gamma PE and a one hidden layer gamma network.
3.1
Training the Gamma Network
Training of a gamma neural network consists of adaptation of ui and /i. If we use gradient descent learning, the general procedures known as real time recurrent learning (RTRL, Williams and Zipser, 1989) and backpropagation through time (BPTT, Werbos, 1990) can be applied without restrictions to update both w_ and /x. As discussed in de Vries and Principe (1992), we assume that the network consists of a feedforward circuit of the filters described by ^n = [^ + / ^ W ^ k n - l + Mn.
y{n) =
(17)
G{wlx^-\-b{n)).
Assume that either BPTT or RTRL has been used to compute an estimated error e(n) for the output y{n) of (17). Defining the local cost as E{n) = |e^(n), we can compute the gradients with respect to the weights as follows: h SE{n)
-x^e{n)a'{net{n)),
(18)
where we assumed | 4 ^ = - 1 and defined net{n) = nin^n + K''^)- ^^^ A*? we derive SEjn) Sfi{n)
•^nP^^(^)^'(^^^(^))'
(19)
10. Locally Recurrent Networks: The Gamma Operator
mse
2. U
.
0
. • :•
319
K=l
•'''
^ ^
K=^
X
^"3"
0.4 0.2
0.2
0.4
0.6
0.8
1
^
FIGURE 5. Normalized mecin squaxe error for diflFerent orders K. Reprinted with permission from Principe et at., 1993. © 1993 IEEE.
£„ = [^ + Kn)A]p^_1 + MnSx
where we defined p = -j^. It is clear that for intricate topologies of gamma memories, the update equations for w and /x become uncomfortable to derive on paper. However, in practice these derivations are not necessary. Nowadays object-oriented simulation environments take care of the derivation of the update equations when the dual topology concept is utilized to implement the update equations. The user only has to specify the local "forward" equations and the global connectivity pattern of the network. For example, the NeuroSolutions package implements arbitrarily connected gamma neural nets (also globally recurrent) as a standard option (Lefebvre and Principe, 1993; NeuroSolutions, 1994). Since the parameter fx is part of a feedback loop, we found that the mean squared error performance surface has several minima as a function of /x. This sometimes leads to convergence to a local minimum when gradient descent is used. Figure 5 shows an example of the performance curve computed analytically with Mathematica for the identification of a third-order elliptic filter with the gamma filter.
320
Principe, Celebi, de Vries, and Harris
5 tap gamma
final |i=0.53905
3 tap gamma
final \i= 0.306825
FIGURE 6. Adaptation of m for different memory sizes.
4 4..I
Applications of the Gamma Memory Control of Memory Depth in an Identification Problem
To get some intuition on how different a recurrent memory is from a tapped delay line, we present the following problem. We wish to construct a dynamic neural network that will double the frequency of an input sinusoid. The neural network consists of a focused gamma network with one hidden layer, and with an input layer formed by a gamma delay line with 5 taps. We use 2 tanh nodes in the hidden layer, and 1 Unear output node. The input signal was a 40 samples per period sine wave, and the target was another sine wave with 20 samples per period. Backpropagation through time over 80 samples is utilized to adapt all the weights, including the parameter /i of the gamma memory. This simulation was carried out using NeuroSolutions. Figure 6 shows the /x track when the dynamic net is adapting. Notice that the /x parameter starts at 1 (the default value that corresponds to the tap delay line). The value decreases to 0.54in 150 iterations, yielding a mean memory depth of about 9 samples (J9 = ^ ) and a final MSB of 0.0005. This means that with 5 taps the system is actually processing information corresponding to 9 samples, which is beyond the 5 tap limit. This memory depth was found through adaptation of /i. Next we reduced the size of the gamma memory from 5 taps to 3 taps, while keeping the network architecture and task the same. This time /i converged to 0.3, yielding a mean depth of about 10 samples and a final MSE of 0.0007, which is the optimal value. In both cases, the mean memory depth converged to about the same
10. Locally Recurrent Networks: The Gamma Operator
321
value. Apparently, the adaptive system was able to compensate for the smaller number of taps by decreasing the value of the parameter /x, thus achieving a similar memory depth. The memory resolution for the 3-tap system is worse than for the 5-tap system. A regular tapped delay line filter (/x = 1) with 5 taps converges to an MSE of 0.09, and a 3-tap filter never solves this problem.
4-2
Linear-Time Warping Control with the Gamma Memory
We performed the following experiment to show that the gamma memory can compensate automatically for linear time warping. We adapted ui and set /i = 0.5 for a gamma filter with 4 taps. The input signal was white Gaussian noise and the target a low-passed version of this signal. After convergence, the ni vector was fixed. In the second phase of the experiment, the input signal is decimated so as to mimic a linear warping of the time axis. We want to find out if the gamma filter can compensate for time warping by adjusting the /x while w_ remains unchanged. Figure 7 shows a graph of the value of /i found through adaptation for eight decimation (and interpolation) ratios between 0.5 and 2. As we can expect from the memory depth equation (17), the recursive parameter is linearly related to the time scale. This relation is experimentally demonstrated in Figure 7. /i changed from the initial value of 0.5 to the range 0.2 to 1.1, in a linear fashion as expected. In conclusion, the fi parameter, if continuously adapted, can compensate for time warping.
4.3
Other Applications
When utilized as a linear adaptive filter, the gamma filter extends Widrow's Adaline (de Vries et a/., 1991), and results in a more efficient filter for echo cancellation (Palkar and Principe, 1994), system identification (Motter and Principe, 1994), (Tsoi and Back, 1994), and nonlinear prediction (Mozer, 1994). Preliminary results with the gamma memory in isolated word recognition also showed that the performance of the system improved when /x is different from 1 (i.e., when it is not the tapped delay line) (Principe and Tracey, 1994). Renals (Renals, 1994) also showed that the gamma memory can be used advantageously as the front end of hidden Markov models for speech recognition. The gamma memory has also been utilized in noise reduction applications to stop the training of nonlinear predictors before the noise distorted the dynamics (Kuo and Principe, 1994a, 1994b), and in a new embedding of time series for nonlinear dynamical analysis where it would reduce noise and select the appropriate time delay for the reconstruction (Kuo and Principe, 1993).
322
Principe, Celebi, de Vries, and Harris
FIGURE 7. Input Signal, value of /j, to compensate warping and Adaline /x used in experiment.
5
Interpretations of the Gamma Memory
One of the open issues is to choose the best set for a memory basis, for a given application (Back and Tsoi, 1995). Without a methodology to select a memory kernel, it seems very important to formulate the function of the memory from different points of view to guide the selection of the basis. We have investigated the representation provided by the gamma basis during adaptation, as a state space embedding, a representation in terms of Taylor series, and a multiscale interpretation. Hopefully, the knowledge of the input signal can be cast in one of these frameworks and will help the designer select one memory kernel versus another. The discussion will be restricted to the gamma memory, but it can be extended to the other memories presented later in this chapter.
5.1
Vector Space Interpretation of the Gamma Filter Adaptation
The vector space interpretation (3), where a signal x{n) is approximated by a weighted sum of signals Xk{n), is presented in full detail in (Principe et ah, 1994) and will be here briefly reviewed. These signals are the basis of the vector space. Let us present the most familiar connectionist memories in this framework. The context unit represents a projection of the
10. Locally Recurrent Networks: The Gamma Operator
323
FIGURE 8. /i changes the relative position of the manifold to the signal vector.
large-dimensionality input signal x{n) onto a single basis function, which is the convolution of (10) with the input. As can be expected, this representation compromises the information preserved in the memory trace x{n). Changing /x to minimize the output MSE means that one is finding the best projection of x{n) onto a single basis vector, i.e., onto a line. This representation is appropriate when one wants long memories but low resolution. Likewise, in the tap delay line, we are projecting x{n) in a memory space that is uniquely determined by the input signal, i.e. once the input signal x{n) is set, the axes become x{n — k) and the only degree of freedom is the memory order K. This memory structure has the highest resolution but lacks versatility, since one can improve the input signal representation only by increasing the order of the memory. In terms of versatility, the simple context unit is better (or any memory with a recursive parameter), since the neural system can adapt the parameter fi to better project the input signal. The memory depth is changed without changing the topology. We recently proved that the gamma basis in continuous time represents a rigid memory space, even when the parameter fi is changed to minimize the output mean square error (Celebi and Principe, 1995). This means that the relative angle among the basis vectors does not change with fi. Hence, a decrease in the error must be associated with a decrease in the relative angle between the input signal and the projection space. So at least for the case of a white noise input, the recursive parameter in the gamma structure changes the span of the memory space with respect to the input signal (which can be visualized as a relative rotation between the input signal and the projection space). In terms of time domain analysis, the recursive parameter finds the length of the time window (the memory depth) containing the relevant information to decrease the output mean square error.
324
Principe, Celebi, de Vries, and Harris
x(n-2), x(n-l), x(n)
Reconstructed Trajectory z P(n) P(n-l)P(n-2) x(n) x(n-l) x(n-2).. x(n-l)x(n-2)x(n-3) .
go(n)=5(n)
^
x(n-2) x(n-3) x(n-4),
g(n)=5(n-l) FIGURE 9. State space reconstruction from the memory outputs.
5.2
State Space Interpretation of the Gamma Memory
Let us for a moment shift our attention from the time series to the dynamical system that produces it. The system state will change with time, describing a trajectory in a multidimensional space called the state space. This time evolution defines the dynamical system. How can we reconstruct some of the properties of this state space evolution from the time series? Takens (Takens, 1981) proved that some properties (dynamic invariants) can be preserved if the time series is embedded into a sufficiently large reconstruction space (AT, the size of the space, should be at least 2Z)-fl, where D is the number of degrees of freedom of the dynamical system (Whitney, 1936)). He proposed that the point coordinates of the reconstructed trajectory be read as AT-tuples from the time series. For instance, for a reconstruction in a 3-D space, consecutive 3-tuples of the time series should be read together, with the first time series sample being the x coordinate of the first point in the reconstruction space, the 2nd sample the y coordinate, the 3rd sample the z coordinate, the 4th sample the x coordinate of the second point, etc. (Figure 9). When the points of the reconstructed space are connected, a trajectory is found from which properties of the original dynamical system can be estimated (such as dimension and Liapunov exponents). This means that many of the important dynamical properties of the original signal are preserved in this reconstruction. What is interesting is that the Takens embedding is naturally implemented by generalized feedforward structures. In fact, a TDNN with 3 taps provides x{n), x{n - 1), and x{n - 2), which are exactly the coordinates
10. Locally Recurrent Networks: The Gamma Operator
325
needed to reconstruct the points of the trajectory in the reconstruction space. Sauer et al. (1991) recently generahzed the Takens embedding for Unear filters, which correspond to the memory filter class. We have also proposed an embedding based on the gamma memory kernel (Kuo and Principe, 1994b). This alternative view of memory structures is very enlightening, because it shows that the function of the memory at the first layer of a neural network is to provide the representation space to reconstruct the time evolution of the state of the system that produced the time series. Then the nonlinear PEs in the hidden layers extract the relationships that characterize the dynamics. This architecture has been extensively explored in time series modeling with neural networks (Lapedes and Farber, 1987). Even if the memory is placed at the hidden layers, the interpretation is the same — reconstruction of the dynamics in the projection space constructed by the nonlinear PEs. Invoking topological arguments by Taken and Whitney, this methodology suggests setting the size N of the memory at twice the size of the dimension D of the dynamical system that produced the time series. The dimension of the dynamical system can be estimated from the time series using, for instance, the correlation dimension algorithm (Grassberger and Proccacia, 1983). Unfortunately, this is not as straightforward as it may seem because in Taken's embedding theorem there is an unspecified parameter, the delay parameter r. Experimentally it was verified that the quality of the reconstructed trajectories varies tremendously with the choice of the delay parameter (Albano et a/., 1987), (Theiler, 1990). There are experimental methods to determine r that are based on the linear and nonlinear correlation time of the input signal (Eraser and Swinney, 1986). In our previous example, r = 1, which is normally a very poor choice. For other r's, the memory depth should span the time interval N = {2D + l)r.
(20)
This required depth may produce very large input layers, which produce very large networks. Although this is the time span where there are meaningful correlations, using the arguments of signal reconstruction, the first hidden layer PEs only need to receive 2D-\-l connections from the input layer. So, one should use uniformly sparse connections in the layer connecting the input taps to the first hidden layer PE (i.e., the nonzero connections should be taken r samples apart). One can use larger reconstruction spaces, i.e., more than A^ connections, but this would complicate the learning unnecessarily. We have proposed a method to determine the product D X r experimentally (Kuo and Principe, 1994b). Recursive memory structures also provide a natural method to embed experimental time series. Note that the time delay between the taps in the
326
Principe, Celebi, de Vries, and Harris
gamma memory filter is controlled independently by /x. At the same time, a soft lowpass filtering is performed that can filter unwanted noise without affecting the dynamics. So we can select the size of the gamma memory equal to iV, and let ji select the best r. The interesting question is how to find a training paradigm that adapts /i to the r value.
5.3
Gamma Memories as Implementation of Taylor Series
In this section we will bring a new interpretation to the contents of the gamma memory by relating them to the input signal spectrum. We will show how the information at the taps of this memory can be used as an alternative time-frequency representation. The memory traces can be regarded as the moments of the input signal. Actually, an alternative name used in the control and signal processing community for contents of the gamma memory is the Poisson moment (Saha 1982). Let's expand the convolution sum of (6) using the definition of the gamma kernel given in (15) as a:fc(n)=/
^ m = —CO
a:vy(m,n) f ^ — y j , ^
(21)
^
where xw{f^, n) is the delayed, inverted, and windowed version of the input signal x(n) formulated as xw{'fTi, n) = x{n — m)(l — fi)^~^u{m
— k)u{n — m),
(22)
where u{n) is the unit step. The term {1 — fi)^~^u{m — k) can be regarded as a decaying window whose effective width is controlled by the parameter /i. This window is responsible for emphasizing the recent past of the signal and thereby achieving locality in time, which we think is vital in processing time-varying signals. In (22) there is also a rectangular window due to the terms u{n — m)u{m, — k)^ but its effect can be safely ignored for n —fclarger than the decaying window width. As far as the magnitude spectrum goes (which is generally the main concern in speech recognition problems, for example), delay and sign inversion have no effect on the signal spectrum. The decaying window, however, has the side effect of blurring the spectrum of the original signal. Hence, IXvre-^^l can be seen as a low resolution approximation to the recent magnitude spectrum |X(e'^^)| of the original signal. Having established the relationship between the magnitude spectra of x^y (m, n) and x(n), let's go back to (21), which may be rewritten as Xk{n)=
r._^.y {k-l)\
Y^ m= — oo
xw(m,n)[(m-l)(m-2)..-(m-A:-hl)].
(23)
10. Locally Recurrent Networks: T h e G a m m a Operator
327
Differentiating the z-transform of xv^(m,n) iteratively gives Xk{n) =
(24)
(k-iy.dzf^-^
2=1
Here, the term Xw{z,n) can be interpreted as a short-term z-transform that takes into account only the recent values of a:(n). Expanding (24) and writing it in matrix form, one gets ^ xi{n) =
ADXw{z,n)\^^^,
(25)
Xk(n)
1 -1 1
0 -1 2
0 1 2
0
• • •
0 0 0
A =
0 a*i
O-kk
^*{-l)*+l,
m = 1,
Ckn
r=m—1 ^
^ ^
1)!
m > 1.
D Examining (25), the memory traces XA;(n) can be recognized as the linear sum of the Taylor series coefficients of Xw{z, n) when the series expansion is done at zero frequency. In other words, the contents of the gamma memory correspond to the derivatives of the spectrum at z = 1. Hence, they may certainly be used to estimate the recent magnitude spectrum of the original signal near zero frequency. In that sense, the memory traces can be regarded as a cost-effective time-frequency representation. However, one should be wary of the finite region of support of the Taylor approximation. At frequencies away from the zero frequency, the Taylor series approximation diverges rapidly. For that reason, the memory traces make only a local representation of the spectrum. Representations at other frequencies can be obtained either by frequency shifting the signal spectrum or by employ^If a continuous-time gamma memory was used, matrix A would reduce to the identity matrix.
328
Principe, Celebi, de Vries, and Hcurris
Moment reconstruction of the spectrogram
FIGURE 10. Comparison of the FFT with the moment reconstructed power spectrum. ing gamma II structures (gamma memory with complex /x) or bandpass filters (Principe and Tracey, 1994). With this approach, the entire magnitude spectrum can be represented as a vector of memory traces that are obtained by concatenating the outputs of several gamma II structures each tuned to a different band. As an example of the power of this representation, we display the spectrograms of the word "greasy" from the TIMIT database (sal.wav) obtained by the conventional short-term Fourier transform (STFT) technique and by a piecewise polynomial approximation that uses the input memory traces as its coefficients. In obtaining the memory traces we used 16 gamma-II filters each with 4 taps and tuned to frequencies in the range 0 to 4 kHz. The proposed technique preserves the main features of the conventional spectrogram method. Hence, the memory traces are time-frequency representations of temporal patterns. Whenever a gamma memory is used in the neural net, it will represent the spectrum of its input around zero frequency, with each tap estimating a higher derivative of the Taylor series of the spectral band coefficient. This feature is particularly appealing for speech recognition, where researchers have shown the importance of using the derivative of the spectral coefficients for good performance.
5.4
Gamma Memories as Multiresolution
Representations
Another view of the gamma memory is as a multiresolution representation in the tap order domain and in the delay parameter (/i) domain. Multiresolution representations such as wavelets perform a complete decomposition in a scale-translation domain, i.e., the signal information is preserved in the coefficients of the wavelet representation. An alternative to the translation domain is the delay domain of the gamma basis, because it effectively spans the time axis just as the translation operation does, although with a change in waveshape. The different waveshapes produce a better approxi-
10.
Locally Recurrent Networks: The Gamma Operator
329
mation of the signal closer to the present time, which is recommended for on-line operation. The gamma memory has basically two parameters, /x that controls the delay scale and k the tap order. Up to now we sought to adapt the delay scale // to best represent the signal of interest. An alternate approach is to consider /i as a parameter space that can be discretized and where several different versions of the input signal will be projected. Together with k, the tap order, these versions will constitute a multiresolution representation in delay and tap order. The interesting question is. Can we represent signals with the gamma memory kernel without loss of information in this multiresolution representation? If one can show that this multiresolution representation is a wavelet, then the substantial body of mathematics guarantees that the representation is complete. We will present the development in continuous time. The gamma kernel in continuous time reads (de Vries and Principe, 1992)
5*W = (^^Yyr**''^""''
k = h...,K{ti>0).
(26)
Since we are seeking a representation that is parametric in the delay, we will drop the dependence of gk{t) on /x; i.e., we are defining a multiresolution generating function given by ^* ^*^ = ( i f c ^ * * " ' ^ " * '
*' = l ' - ' - ^ -
(27)
Due to the shape of the gamma kernels, which are all positive, we can immediately say by the admissibility condition (Daubechies, 1992) that (27) is not a wavelet. The admissibility condition will require that lim.^oo / "
^-^^dw
< oo.
(28)
So the memory arrangement has to be slightly modified. The generating function for the multiresolution representation is obtained by computing the difference of consecutive memory taps Jkit)=9tit)-9t-i{t),
(29)
which means that
7fc(0 = ^^^^4^<'-'e-S
k = l,...,K
(30)
One can show that 7A;(^) now obeys the admissibility condition (28) for a wavelet basis. This wavelet is implemented by discretizing the /i parameter to lead to a bank of modified gamma memories according to (29), i.e..
330
Principe, Celebi, de Vries, and Harris
t^k x(t) ^2
fed
-
i^\
^
-U^ ^l| ^ \i-\
P-
FIGURE 11. A wavelet representation made of a parallel bank of gamma memories with fixed but different /i's. where the multiresolution signals are a difference of two consecutive gamma memory taps. There are several parallel memory structures with K-\-l taps but different /x's ( Figure 11) The importance of this view for neural networks is the following. Instead of adapting the time scale // as we presently do, an alternative approach is to utilize multiple memory structures with fixed /x's. This scheme also preserves the information of the input, since it is a wavelet representation. It may seem a waste of memory blocks, but in classification of temporal patterns where the adaptation of /x is nontrivial, this arrangement circumvents the problem. We are currently exploring this representation in practical problems. This prefixed arrangement, in fact, resembles the method proposed by Hopfield for speech recognition (Tank and Hopfield, 1987), except that our functions are recursively computed.
6
Laguerre and Gamma II Memories
There are many other linear filters fitting our definition of generalized feedforward filters that can be used as memory structures for neural networks. A recent paper by Back and Tsoi (1995) presents several variations of locally recurrent networks with applications. We present here Laguerre and gamma II filters, two structures that are closely related to gamma structures.
10. Locally Recurrent Networks: The Gamma Operator
331
Laguerre memory I domain
l-^i L(zl)
•40
gx(t)
gl(t)
gKit)^
Delay operator: 3-(l-^l)
z-{\-\i)
FIGURE 12. The block diagram for the Laguerre memory. Reprinted with permission from Principe it et al., 1994.
6.1
Laguerre Memories
Laguerre functions are intimately related to the gamma function structures. In the z-domain, the ith Laguerre function is defined by
Li(z,/i) = >/l - (1 - /i)-
(l-(l-/i)z-i)-
(31)
which can be decomposed as a low pass filter GQ{Z) = i-h-u.)z-'^ ^ followed by a cascade of i — 1 similar all-pass sections G{z) = ^zTiz^J^ypT- A block diagram of the Laguerre memory is shown in Figure 12. Like the gamma memory, the Laguerre memory has one free parameter /i. In fact, it can be shown that a Gram-Schmidt orthogonalization of the gamma memory leads to the Laguerre memory (Celebi and Principe, 1995). Figure 13 compares the Laguerre and the Gamma kernels. The Laguerre kernels are orthogonal. As a result, for uncorrelated (white) input signals, the tap signals are also uncorrelated in the Laguerre filter. This is not the case for gamma filters, and therefore in general, the filter weights adapt somewhat faster for Laguerre filters, depending on the input signal correlation matrix (Silva et al, 1994). Gamma kernels, on the other hand, use less computational resources (half the number of additions) than the Laguerre filter for the same filter order. Also, whereas traditionally Laguerre filters have been used with fixed fj, (e.g., Wahlberg 1991), the gamma filter framework has introduced LMS-adaptive /i to the Laguerre filters.
332
Principe, Celebi, de Vries, and Harris
FIGURE 13. Gamma and Laguerre kernels.
6.2
Gamma II Memories
The gamma memory has a multiple pole that can be adaptively moved along the real Z domain axis; i.e., the gamma memory can implement only low-pass (0 < /i < 1) or high-pass (1 < /x < 2) transfer functions. For some applications, however, a resonant memory structure that favors storage of a certain frequency band is desired. As an example, if we want to store the output of a filter bank, a common preprocessor in speech classification, a resonant memory structure seems adequate. Resonating circuits can be constructed by filters of minimal second order. The gamma II structure, which is shown in Figure 14, adheres to the philosophy of the generalized feedforward filter. Thus the memory parameters V and /i are global, and the feedback is local between taps. As a result, LMS for the gamma II filter scales also by 0{K), and the stability conditions remain trivial. Next, a few properties of the gamma II memory element are derived. The transfer function Gii{z) = ^ i | j of this structure evaluates to Gn{z)
=
(l + i/)G(z) l-\-uG^z)
^^{l-^u)[z-{l-fi)] [z - (1 -//)]2 + ^1/ •
(32)
Thus Gii{z) has a zero dX ZQ = 1 — JJL and poles at ^p = (1 — /z) ± JVA"^. The forward gain factor 1 •\- u ensures the normalization of the gamma II delay element. Similar to the gamma I operator, we have Y,Gnk{n)
= G//fc(z)U=i = {Gn{\)]'' = 1.
(33)
n=0
In order to derive the stability condition, we assume // > 0 and i/ > 0. Then the system is stable if
10. Locally Recurrent Networks: The Gamma Operator
333
Gamma II I domain u(l) ® -»ojt>©
-V
KH Gu(z)
m
gi(t)
.
Delay operator:
gK^Hr
^l[^-(l-^l)]
[z-(l-n)]^ + V FIGURE 14. The gamma II memory structure. G(z) represents the gamma kernel. Reprinted with permission from Principe et al, 1994.
(1 -/x)2 + / i z / < 1. Equation (34) can be reduced to /i(/i -f z/ - 2) < 0 and together with it follows that sufficient conditions for stability are given by (0 < /i < 1) A (0 < 1/ < 1).
(34) n>0
(35)
The gamma II can be related to a broader class of functions called the Kautz functions (Kautz, 1954).
6.3
Two-Dimensional Gamma Kernels
The goal of the gamma memory was to create a signal processing structure that would display a variable memory depth for a fixed number of stages (taps), unUke the finite impulse response (FIR) filter. This was accomplished by introducing a local feedback loop around the delay operator, creating a generalized feedforward filter structure. The concept of a time warping parameter extrapolates to the spatial domain as a scale parameter that controls the region of support of the corresponding two-dimensional structure, which we call the 2-D gamma kernel. The 2-D gamma kernel defined as
^fc,M(^l»^2) = C'^A:,MWIt=./;ifT^
(36)
where the constant C is a normalization factor. The resulting 2-D gamma kernels have circularly symmetric shapes given by
334
Principe, Celebi, de Vries, and Harris
FIGURE 15. The 2-D gamma kernels {k scale parameter /i.
1, A: = 15) for different values of the
k+l
QkA'^i.n^)
= /^ {^n\-^nl) 27rk\
n = {{nun2);
k-l
-fiy/n^-{-nl
(37)
-N < m.n^ < N],
where f] is the region of support of the kernel, k the kernel order, and // the parameter that controls the shape and scale of the kernel. Figure 15 depicts the characteristic of 2-D gamma kernels in the spatial domain. The 1st order (fc = 1) kernel has its peak at the pivot point (0,0) with an exponentially decaying amplitude. The gamma kernels with a higher order {k > 1) have peaks at the radius k/fi, creating concentric smooth rings around the pivot point. For a fixed kernel order, the radial distances where the kernels peak are still dependent upon the parameter /x, as in the 1-D case. The 2-D gamma kernels are circularly symmetric, so they lose the property of completeness. Another slight disadvantage is that there is no recursive implementation, so they have to be implemented as 2-D FIR filters. Neverthless, they provide a very convenient set of functions to estimate the image intensity statistics in the neighborhood of the pivot point. They have been utilized with good results to enhance the constant false alarm rate (CFAR) detector for synthetic aperture radar (SAR) imagery (Kim et a/., 1996). The advantage of the gamma kernel is that the parameter //, that controls the size of the stencil, can be adapted to optimize the dis-
10. Locally Recurrent Networks: The Gamma Operator
335
y.(n)
training subimages
ci(n)
t — • e(n) I
PWF and PWF^ subimages
V
V^^
y/i2)
FIGURE 16. Implementation of the NL-QGD and its adaptation. crimination between targets and clutter in the same way as the 1-D counterpart. Figure 16 shows a neural network implementation of the nonlinear quadratic gamma (NL-QGD) detector, which has been utiUzed effectively in automatic target recognition (Kim, 1996). Once again, one can recognize the 2-D gamma kernel as a preprocessor for an MLP, but this time the application is image processing.
7 Analog VLSI Implementations of the Gamma Filter Since it is impossible to design an ideal delay line in continuous-time hardware, many analog designers believe that the best that can be done is to try to "approximate" the ideal delays using a cascade of low-pass filters. Figure 17 shows such a strategy using a cascade of transconductance amplifiers and capacitors. Indeed, such a technique is shown in Mead (Mead, 1989). We have studied exactly this structure (cascaded low-pass filters) as memory elements for adaptive filters and neural networks for many years. An irony is that this structure—called the gamma filter—generally outperforms the ideal delay line with the same number of taps, since the former provides a mechanism to let the network choose the most appropriate memory depth/resolution for the task at hand. This is easily done by adapting the memory depth using the output mean square error in training, as we discussed early in this chapter. We have implemented the delay-line component of the gamma filter ex-
336
Principe, Celebi, de Vries, and Harris
u
XJI)
Xi(t)
b
X2(t)L
XgO)
>
^
\W3
\W0,
N/y(t) FIGURE 17. Continuous time 4 tap gamma filter. actly as shown in Figure 17. Each stage consists of a transconductance ampUfier connected as a follower, with its output driving a capacitor — realizing a first-order low-pass filter with a 3 dB frequency equaling r . The CMOS transamp is operated within the subthreshold region so that a large dynamic range of r can be obtained. For speech processing applications, the necessary dynamic range of r is from 100 Hz to 10 kHz, which can be achieved by an exponentially controlled bias voltage. The ideal and measured impulse tap impulse responses for each tap of the gamma filter are very similar, as is to be expected (Juan et a/., 1996). The weight adaptation can be formulated as a parametric least-squares problem that accepts an iterative solution based on the LMS gradient descent method. In order to adapt the weights Wk, we use the following continuous-time gradient descent update: (38) rw-ji'^kit) = e{t) -Xkit), at where T^ (time constant of the weight update) > > r. This dynamic equation requires basic primitives such as adders, multipliers, and integrators that we have efficiently implemented in analog VLSI. Although gradient descent can still be used to adapt the time-scale parameter r , the equations are more difficult. They yield a nonconvex optimization problem, and here r will be preset. We have experimented with our continuous-time adaptive gamma filter with this hardware for a simple system ID problem where the goal is to identify the input-output relationship of an unknown linear system. In
10. Locally Recurrent Networks: The Gamma Operator
0
0.2
0.4
O.t
0.»
1.2
t.4
1.«
I.I
2
0
0.2
0.4
0.6
O.S
1
1.2
1.6
337
1.8
2
FIGURE 18. Desired (solid) and gamma filter output (dashed) in the beginning of adaptation (left figure) and after convergence. order to make the gamma filter more resistant to consistent offsets in each stage, we have used the difference between adjacent taps as our input to the multiplier. Since we used a diflFerential-input multiplier, this was a very easy change. The circuitry is designed to minimize the square error between the output of the two filters. The unknown system is a discrete analog circuit designed to implement a standard Sallen-Key low-pass filter. This type of problem is typically solved with adaptive filters, but rarely are continuoustime aspects considered. The input to the system was a pseudo-random bit stream that was filtered to achieve a flat spectrum noise input. The system is set up so that the gamma filter will adapt its weights to minimize the mean square error between the filter output and the output of the unknown plant. Figure 18 shows the output of the filter during and after convergence. The gamma filter is not able to exactly match the Sallen-Key circuit because the two systems have different forms. However, the circuit does a good job in approximating the unknown plant. In (Juan et a/., 1996) we also present the weight tracks to show convergence with different initial conditions.
8
Conclusions
This paper provides a review of an important class of neural topologies of intermediate complexity between the purely feedforward nets and the globally recurrent networks. Since the processing elements of locally recurrent networks (LRNs) are a mixture of a nonlinear PE and a linear memory filter, the processing role of the new PE can still be studied at the local level as a projection operation of the PE activity in a linear manifold that we called the memory space. So the global input-output map is a nonlinear
338
Principe, Celebi, de Vries, and Harris
combination of local linear maps. We present a generating function for the memory filters (the generalized feedforward structures) that unifies the conventional connectionist memory structures (the context unit, the time delay, and the gamma memory) and that motivates the research in other kernels that have the potential to outperform the presently studied memories. In this line of research the outstanding problems are: • The choice of the generating function, which is equivalent to the choice of the basis vectors. • The size of the memory kernels. • The adaptation of the recursive parameter that controls the time scale of the memory representations. • The mapping power of this class of neural networks. The choice of the generating function is intrinsically related to the problem of finding the best basis to represent a given signal. This is a problem that has no known solution in signal processing. As long as the bases form a complete set, the information contained in the input signal is preserved, as long as enough taps are utilized. However, for practical purposes we would like to minimize the number of taps that are needed to solve our problem with good performance. In the case of the LRNs, the representation of past information is only one of the components for the mapping, since the nonlinear PEs in the net play a determinative role in the overall input-output map and can compensate for a less than optimal representation of the past information. So, in our opinion, the choice of the generating function is important, but it is not as determinative as in other domains. After all, engineers have used Fourier analysis extensively in spite of the fact that complex sinusoids are hardly ever a good model for real-world signals. Criteria to help choose the size of the memory kernel are still needed. For applications related to signal representation, the view of memory as preserving the information from the past is very appropriate. The class of recursive memories have a nice property of allowing the system to find the best compromise depth/resolution for a given memory order. So they do the best they can with the available number of taps. But this does not mean that the choice of the number of taps is arbitrary. In linear system theory the problem of best model order is solved with statistical criteria such as those of Akaike and Rissanen. These can potentially be applied to set the number of basis functions. But it will be very interesting if new, incrementally adaptive methods are devised to grow the network topologies to best match the incoming signal properties. We believe that the view of memory as the implementation of an embedding operation will be very important in the future. The advantage is that
10. Locally Recurrent Networks: T h e G a m m a Operator
339
one can utilize concepts from nonlinear dynamics to help describe signals in a different (probably more relevant) way, such as in terms of the properties of the system that created the time series. We have seen that this view is able to help us develop a criterion to set up the memory size. The problem of the adaptation of the time scale is genuinely an adaptive system problem. Here we have a parameter that needs to be adapted, but the performance surface has many local minima. Moreover, there are applications such as classification, where the output mean square error is not an appropriate criterion to determine the time scale that best discriminates among a set of classes. So innovative ways to adapt the time scale of LRNs are necessary. Speech recognition is a key area that would benefit from these developments. We showed that the gamma memory has the potential to compensate for time warping. But this means that the recursive parameter must be adapted all the time, during both training and testing. The universality of MLPs is a strong result that lends credibility and supports the continuing research interest in this topology. We believe that this is also needed for the cla^s of LRNs, so characterizing the functional mapping produced by LRNs seems a very important topic. The major issue in analog neural design is how to build systems with enough precision and resolution using components that are fundamentally noisy and imprecise. Not only must we cope with noise in the input signals, but we must deal with noise in the computation itself. These considerations suggest that the Laguerre structure may provide the best low-precision implementation, since the signals at the output taps are less correlated than the corresponding signals in the gamma memory. While analog systems are necessary to interface to the fundamentally analog world, they are limited both in their ability to implement large time constants and in their algorithmic fiexibility. The fact that these two drawbacks of analog systems are exactly the strengths of their digital counterparts suggests that hybrid analog/digital systems will ultimately be necessary. The primary feedforward structures would be built in dedicated analog circuitry, while the mechanisms for updating the parameters and the choice of the learning scheme would be left to a slower digital processor. Using clever feature extraction techniques such as described in Section 5.3, the slow digital processor could potentially sample the analog outputs at a rate much slower than the Nyquist rate. Such hybrid systems provide a reasonable compromise between the constraints of analog and digital hardware. There is also an inescapable link to biology that we would like to mention. Leaky integrators are perva^sive in the central nervous system, both at the dendritic tree as well as in the response of the neurons. The analysis conducted in this paper shines light on the use of delays from a signal processing point of view, as projection operators in a space controlled by the component through the feedback parameter.
340
Principe, Celebi, de Vries, and Hcuris
As a final note we would like to point out that although all the work developed here on the extension of PEs with short term memory deals with supervised networks, there are strong reasons to believe that the incorporation of memory in unsupervised nets will allow the extension to time of some of the conventional unsupervised paradigms such as principal component analysis and Kohonen self-organizing nets. We are currently exploring this path. A c k n o w l e d g m e n t s : This work was partially supported by NSF grant ECS - 9208789, and ARPA/ONR NOOO14-94-1-0858. The authors also want to acknowledge many generations of graduate students of the Computational Neuroengineering Laboratory, who all contributed to the ideas presented in this chapter.
9
References
Albano, A., Mees, A., Guzman, G., and Rapp, P., "Data requirements for reliable estimation of correlation dimension," in Chaos in Biological Systems, Degn, Holden, and Olsen, (Eds.), 207-220, Plenum, New York, 1987. Back, A. and Tsoi, A., "FIR and IIR synapses, a new neural network architecture for time series modelling," Neural Computation, 3(3), 375385, 1991. Back, A.D., and Tsoi, A.C., "A comparison of discrete time operators for nonlinear system identification," in Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), pp. 883-890, MIT Press, Cambridge, MA, 1995. Celebi, S., and Principe, J., "Parametric least squares approximation using gamma bases," IEEE Trans, on Signal Processing, 43(3), 781-784, 1995. Daubechies, I., Ten Lectures on Wavelets^ Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1992. De Vries, B. and Principe, J. C , "The gamma model—a new neural model for temporal processing," Neural Networks, 5(4), 565-576, 1992.
10. Locally Recurrent Networks: The Gamma Operator
341
De Vries, B., Principe, J., and Oliveira P., "Adaline with adaptive recursive memory," Proc. 1991 IEEE Workshop Neural Networks in Signal Processing, 101-110, Princeton, NJ., 1991 Elman, J. L., "Finding structure in time," Cognitive Science, 14, 179211, 1990. Eraser, A., and Swinney, H., "Independent coordinates for strange attractors from mutual information," Phys. Rev. A33, 1134, 1986. Grassberger, P., and Procaccia, I., "Measuring the strangeness of strange attractors," Physica 9D, 189-208, 1983. Haykin, S. Adaptive Filter Theory, Prentice Hall, Englewood Cliffs, NJ, 1991. Home, B., and Giles, C. L., "An experimental comparison of recurrent neural networks," Neural Information Processing Systems, NIPS-7, 697704, 1995. Jordan, M., "Attractor dynamics and parallelism in a connectionist sequential machine," Proc. 8th Annual Conf. on Cognitive Science, Erlbaum, Hillsdale, NJ, 531-546, 1986. Juan, J., Harris, J., and Principe, J., "Analog VLSI implementations of continuous-time memory structures," Proc. IEEE Int. Symp. on Circuits and Systems, 338-340, Atlanta, GA, May 1996. Kautz, W., "Transient synthesis in the time domain," IRE Trans, on Circuit Theory, 1, 29-39, 1954. Kim, M., "Focus of attention based on Gamma kernels for automatic target recognition," Ph.D. Dissertation, University of Florida, 1996. Kim, M., Fisher, J., and Principe, J., "A new CFAR stencil for target detection in SAR imagery," Proc. SPIE, 2757, 432-442, 1996. Kuo, J-M, and Celebi, S., "Adaptation of memory depth in the gamma filter," Proc. ICASSP94, 5, 373-376, Adelaide, Australia, 1994. Kuo, J-M., and Principe, J., "Using the Poisson filter chain to reconstruct
342
Principe, Celebi, de Vries, and Harris
attractors," Proc SPIE Conf. on Chaos and Nonlinearities, 2037, 59-65, 1993. Kuo, J-M., and Principe, J., "Noise reduction in state space using the focused gamma model," Proc. ICASSP94, 2, 533-536, 1994a. Kuo, J-M., and Principe, J., "Reconstructed dynamics and chaotic time series modelhng," Proc. IEEE World Congr. on Computational Intelligence (WCII), 5, 3131-3136, Orlando, FL, 1994b. Lang, K., Waibel, A., and Hinton, G., "A time delay neural network architecture for isolated word recognition," Neural Networks, 3(1), 23-44, 1990. Lapedes, A., and Farber, R., "Nonlinear signal processing using neural networks: prediction and system modeling," Tech. Rep. LA-UR-87-2662, Los Alamos National Laboratory, Los Alamos, NM, 1987. Lefebvre, C , and Principe, J., "Object-oriented artificial neural network implementations," Proc. World Conf. on Neural Networks, IV, 436-439, Portland, OR, 1993. Mead C , Analog VLSI and Neural Systems, Addison-Wesley, Reading, MA, 1989. Motter M. and Principe J., "A gamma memory neural network for system identification," Proc. IEEE World Congr. on Computational Intelligence (WCII), 5, 3232-3237, Orlando, FL, 1994. Mozer M., "Neural architectures for temporal sequence processing, in Predicting the Future and Understanding the Past," Weigand and Gerschenfeld (Eds.), Addison-Wesley, Reading, MA, 1994. NeuroSolutions User's Manual, NeuroDimension, Inc., Gainesville, Fl, 1994. Palkar, M., and Principe, J., "Echo cancellation with the gamma filter," Proc. ICASSP94, 3, 369-372, Adelaide, Austraha, 1994. Principe J., de Vries B., and Guedes de Oliveira P., "Generalized feedfoward structures: a new class of adaptive filters," Proc. ICASSP 92, IV
10. Locally Recurrent Networks: The Gamma Operator
343
244-248, San Francisco, 1992. Principe, J., de Vries B., and Guedes de Oliveira P., "The gamma filters: a new class of adaptive IIR filters with restricted feedback," IEEE Trans, on Signal Processing, 41(2), 649-656, 1993. Principe, J., Kuo J-M., and Celebi, S., "An analysis of short term memory structures in dynamic neural networks," IEEE Trans, on Neural Networks, Special Issue in Dynamic Nets, 5(2), 331-337, 1994. Principe, J., and Tracey J., "Isolated word speech recognition using the gamma model," J. Art. Neural Net., 1(14), 481-489, 1994. Principe, J. et a/., "Analysis of short-term memories for neural networks," Advances in Neural Information Processing Systems (NIPS 6), Morgan Kaufmann, 1011, 1018, 1994. Renals, S., Hochberg, M., and Robinson, T., "Learning temporal dependencies in connectionist speech recognition," Neural Inf. Proc. Syst. NIPS6, Cowan, Tesauros, and Alspector (Eds.), 1051-1058, 1994. Saha, D. C , and Rao, G. P., "A general algorithm for parameter identification in lumped continuous systems—The Poisson moment functional approach," IEEE Trans, on Automatic Control, 1, 223-225, 1982. Sastry, P. S., Santharam, G., and Unnikrishnan, K. P., "Memory neuron networks for identification and control of dynamical systems," IEEE Trans, on Neural Networks 5(2), 306, 1994. Sauer, T., Yorke, J. A. and Casdagh, M., "Embedology," Journal of Statistical Physics, 65, (3/4), 579-616, 1991. Silva, T., de Oliveira, G., Principe, J. C , and de Vries, B., "Generalized feedforward filters with complex poles," IEEE Workshop on Neural Networks for Signal Processing, 1992. Takens, P., "Detecting strange attractors in turbulence," Lecture Notes in Mathematics, 898, 365-381, 1981. Tank, D. W., and Hopfield, J. J., "Neural computation by concentrating information in time," Proceedings of the National Academy of Sciences, 84,
344
Principe, Celebi, de Vries, and Harris
1896-1900, 1987. Theiler, J., "Estimating the fractal dimension of chaotic time series," Lincoln Lab J., 3(1), 63-85, 1990. Tsoi, A., and Back, A., "Locally recurrent globally feedforward networks: a critical review of architectures," IEEE Trans, on Neural Networks, 5(2), 229-239, 1994. Wahlberg, B., "System identification using Laguerre models," IEEE Trans, on Automatic Control, 36(5), 551-562, 1991. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K., "Phoneme recognition using time-delay neural networks," IEEE Trans. ACSSP, 37(3), 328-339, 1989. Wan, E., "Time series prediction by using a connectionist network with internal delay lines," In: Times Series Prediction: Forecasting the Future and Understanding the Past, Weigand and Gerschenfeld (Eds.), 195-217, Addison-Wesley, Reading, MA, 1994. Werbos, P. J., "Backpropagation through time: what it does and how to do it," Proceedings of the IEEE, 78(10), 1550-1560, 1990. Whitney, H., "Differentiable manifolds," Ann. Math., 37, 645, 1936. Wiener, N., Extrapolation, Interpolation, and Smoothing of Stationary Time Series, with Engineering Applications, Wiley, New York, 1949. Williams, R. J., and Zipser, D., "A learning algorithm for continually running fully recurrent neural networks," Neural Computation, 1, 270-280, 1989.
Index Action potentials (APs), in impulse trains, 131-34 Adaptive time-delay neural network (ATNN), 129 Analog VLSI implementations of gamma filter, 335-37 Aperture problem, in image flow computation, 63-65 Artificial neural network (ANN) biased random walk, 222, 22630 research, 225-26 Axon generator, 297
Delta function pulses, 20 Dendrite generator, 297 Dendritic tree, 3-4 Difference equations, 59-60 Dulac's criterion, 156, 157-59,166 Dynamical systems (DSs), discretetime analysis of, 175-77 basin of attraction, 175 fixed points, 175 periodic orbits, 175 recurrent neural networks with two state neurons and fixed points, 191-200 repulsive points/repellors, 176, 197 saddle points, 176, 195, 19697 stability types, 175-76 Dynamic binary networks, 12428 Dynamic image processing, 58 Dynamic neural networks action potentials in impulse trains, 131-34 attractor basins and dynamic binary networks, 124-28 chaotic attractors and attractor locking, 114-20 description of, 108-14 developing multiple attractors, 120-24 perturbation schedule, use of, 122-23 self-sustained activity in, 106 symmetric sigmoid squashing function, 110 temporal synchronies, 134 time delay mechanisms and attractor training, 12931
Basin class, 125 Basin class capacity, 127 Biased random walk, 222 biological evidence, 231-33 efficiency of, 228-29 experimental observation and, 230 first attempts with, 226-27 performance of, 229-30 random structural variation, 231-33 reinforcement signals, 233 trapped in local minima, 22728 Binary neural networks, 124-28 Biological neural networks (BNNs). See High-level Petri nets Chaotic attractors and attractor locking, 114-20 developing multiple, 120-24 Chemotaxis algorithm, 221, 227, 228 efficiency of, 228-29 performance of, 229-30 Correlation dimension algorithm, 325 345
346
Index
unresolved issues regarding, 106-7 Dynamic spatial warping (DSW), 88 Dynamic time warping (DTW), 77, 78 comparisons with direct template matching, 90, 95 computer simulation results, 88-95 energy function, 77, 102 Hopfield network and, 81-88 Itakura path constraints, 83, 87 optimization problems solved using, 78-81 performance measurement with random signals, 89-90 piecewise linear function, 89, 102 Eckhorn linking field coupling, 1 See also Pulse-coupled neural networks Embedded patterns, classification of, 272-74 Energy function, 77, 102 Equilibrium states, stability of, 5659 Euler's method, 88 Extradimensional bypass, 228 Feedback delays, 311-12 Feedforward delays, 311-12 Finite state machines (FSMs) See also Recurrent neural networks (RNNs) defined, 173 experiments with, 182-83 learning loops of, 201-11 Firing times, Petri net, 290 Fixed points attractive, at vertices, 19799
attractive, at saddle point intersections, 199-200 defined, 175 recurrent neural networks with two state neurons and, 191-200 Fourier coefficients, 160 Fourier transform, 21 Fourier series, 159 Gabor phase functions, 71 Gamma delay line, 315 Gamma filter, 315-16 analog VLSI implementations of, 335-37 Gamma kernels, two-dimensional, 333-35 Gamma locally recurrent neural net, 316 Gamma memory applications, 320-21 filter, 314-15 interpretations, 322-30 Laguerre and gamma II memories, 330-35 multiresolution representations and, 328-30 state space interpretation, 32426 Taylor series implementation and, 326-28 vector space interpretation, 32223 Gamma neural network, 317-19 Gated dipole, 267 Generalized feedforward filter (GFF), 313-14 Genetic algorithms, 225, 228 Gradient-descent learning, 226, 22829 Gram-Schmidt orthogonalization, 331 Group linking waves, 22-25 Grossberg competitive law, 44
Index Guided, accelerated random search (GARS), 226-27 Hebbian decay learning law, 44 development of, 222 limitations of, 224-25 long-term potentiation and the NMDA receptor, 222-24 High-level Petri nets (HPNs) advantages of Petri nets, 286 applications for, 285 arcs in, 294 cell bodies, 294 classes of, 292 concentric circles, 295-96 fundamentals of Petri nets, 287-92 for modeling biological neural networks, 285-86, 29296, 299-306 olfactory bulb example, 299306 places for modeling cell bodies and presynaptic clefts, 296 timed transitions, 294, 29798 tokens, 295 weights on arcs, 294-95, 29899 Hopfield associative memory, 120, 123 Hopfield network, 77 comparisons with direct template matching, 90, 95 components in, 77 computer simulation results, 88-95 dynamic time warping and, 81-88 optimization problems solved using, 78-81 performance measurement with random signals, 89-90
347
piecewise linear function, 89, 102 Image segmentation defined, 34 factors that affect, 34 natural firing, 40-41 pulse-coupled neural networks for, 35-44 random noise, 43 smoothing images, 43 techniques, 34 Impulse trains, action potentials in, 131-34 Inhomogenous nodes, 272-73 Initial state, 173 Instantaneous description, 290 Invariances description of translational, rotational and scale, 2534 image distortion, 31 image intensity overlays, 31 receptive field patterns and, 26 simulation results, 30-31 time signatures, 31-34 Kirchhoff's current law, 147 Kohonen map, 182, 183 Kohonen self-organizing nets, 340 Kronecker delta function, 14, 88 Laguerre and gamma II memories, 330-35 Law of Large Numbers, 148 Learning rules, 44 biased random walk, 222, 22630 biological evidence, 231-22 biological requirements for, 226 chemotaxis algorithm, 221, 227, 228 Hebb's rule, 222-25
348
Index
research on, 221-22 theoretical, 225-30 trial-and-error, 222, 234 Liapunov function, 67, 77, 79, 81 Liapunov's theorem, 156 Linear finite dimensional memory structures, 312-17 Linear systems theory, 67 Linking decay tail, 15-18 Linking modulation, 4-5 Linking waves and time scales, 2 1 22 group linking waves, 22-25 Locally recurrent networks (LRNs) analog VLSI implementations of gamma filter, 335-37 applications of, 311 feedback connections, 311-12 gamma memory applications, 320-21 gamma memory filter, 31415 gamma memory interpretations, 322-30 gamma neural network, 31719 Laguerre and gamma II memories, 330-35 linear finite dimensional memory structures, 312-17 outstanding problems, 33840 Long-term memory (LTM) invariance principle, 254-58 invariance principle with oncenter off-surround circuit, 260-64 weights, 250-51 Long-term potentiation (LTP), NMDA receptor and, 222-24 McCuUoch-Pitts neuron model, 317 Mean memory depth, 317 Memory
linear finite dimensional memory structures, 312-17 trace, 312, 314 Memory, gamma applications, 320-21 filter, 314-15 interpretations, 322-30 Laguerre and gamma II memories, 330-35 multiresolution representations and, 328-30 state space interpretation, 32426 Taylor series implementation and, 326-28 vector space interpretation, 32223 Motion perception, challenges in, 57-58 Multiresolution representations, gamma memory and, 328-30 Network transition (NT) graphs, use of, 124-27 Neuron gain, 187 Neurons, oscillation in inhibitory and excitatory characterization of cell assemblies, 148-50 individual cells described, 14648 interactions between two neural groups, 151-56 macroscopic model for cell assemblies, 146-50 oscillation frequency estimation, 159-61 random process, 148 research on, 143-46 stability of equilibrium states, 156-59 system-level parameter, 148 validation of experiments, 16162 NMDA (N-methyl-D-aspartate),
Index long-term potentiation and the, 222-24 Nonuniform pattern of connectivity between nodes, 27374 Olfactory bulb example, HPN modeling of information obtained from, 3056 inputs, 299-301 intrinsic neurons, 301 Petri net model formulation and analysis, 301-4 principal neuron, 301 token flow through, 304-5 On-center off-surround circuit, 26064 Optical flow computation advantages of, 72 aperture problem, 63-65 formulation for neural computing, 59-61 Gabor phase functions, 71 Horn's model, 68, 71, 73 introduction of, 58 laboratory images experimentation, 68 as a minimization of functionals, 59 properties used, 62-63 recurrent neural network architecture for, 65-68 research on, 58 smoothness constraints, 59, 63, 71 stability and convergence rate, 67-68 test pattern experiments, 68 Optimization problems, Hopfield network and, 78-81 Oscillation behavior. See Neurons, oscillation in inhibitory and excitatory
349
Pattern matching comparisons with direct template matching, 90, 95 components in, 77 computer simulation results, 88-95 dynamic time warping, 77, 78 dynamic time warping using Hopfield network, 81-88 energy function, 77, 102 Hopfield network for, 77, 7881 pattern matcher, 77, 78 performance measurement with random signals, 89-90 piecewise linear function, 89, 102 Periodic time series, 18, 20-21 Petri nets (PNs) See also High-level Petri nets concepts and terminology of, 287-89 timed, 289-92 Poincare-Bendixson theorem, 156, 157, 166 Poisson moment, 326 Presynaptic changes, 223-24 Pulse-coupled neural networks (PCNNs) adaptation, 44-48 basic model, 3-10 dendritic tree, 3-4 group linking waves, 22-25 image segmentation, 34-44 implementations, 50-51 integration of, 51-53 invariances, 25-34 learning laws, 44 linking decay tail, 15-18 linking field model of Eckhorn for, 2 linking modulation, 4-5 linking waves and time scales, 21-22
350
Index
multiple pulses, 10-13 multiple receptive field inputs, 13 periodic time series, 18, 2021 pulse generator, 5-7 pulse periods, 7-10 synaptic weights, 44-45 time evolution of pulse outputs in a two-cell system, 13-18 time-to-space mapping, 4850 Pulse function, 6 Pulse generator, 5-7 Pulse periods, 7-10 Quasiharmonic pulse rates, 18 Random walk. See Biased random walk Real-time operation, classifying networks and, 269-72 Real time recurrent learning (RTRL), 318 Receptive field inputs, multiple, 13 Recurrent neural network architecture for optical flow computation, 65-68 Recurrent neural networks (RNNs) automata theory and, 180 background research on, 17173 as a collection of dynamical systems, 186-91 dynamical systems, analysis of, 175-77 experiments with trained finite state machines, 18283 hidden neurons, 177-78 Kohonen map, use of 182,183 language acceptance, 212-13
learning loops of finite state machines, 201-11 loops and cycles, transformation of, 188-91 second-order of, 212 state degradation diagrams, 207-9 as state machines, 179-85 state machines, description of, 173-75 state space clustering, 182 training procedure, 179 transfer function, 204, 207 with two state neurons, 191200 as universal computing devices, 221 Rehearsal to process long lists, 25860 Reinforcement learning components of, 2331 random structural variation, 231-33 signals, 233 Repulsive points/repellors, 176, 197 Retina, as an example of a preprocessor, 51 Rotational invariance, 26 Saddle points, 176, 195, 196-97 biased random walks and, 22728 fixed points at the intersection of, 199-200 Saturable law, 44 Scale invariance, 26-29 Segmented patterns. See SONNET 1, segmentation of patterns and Serotonin, 231-33 Short-term memory (STM), 249, 256, 258 Sigmoid function, 6 Smooth change of illumination, 71
Index Smooth motion, 71 Smoothness constraints, 59, 63, 71 SONNET 1, segmentation of patterns and black box, use of, 246-48 classification of embedded patterns, 272-74 learning isolated and embedded spatial patterns, 25052 long-term memory (LTM) invariance principle, 25458 long-term memory invariance principle with on-center off-surround circuit, 26064 long-term memory weights, 25051 operational requirements, 24546 properties of a classifying system, 267-74 real-time operation, 269-72 resetting items once they are classified, 264-67 short-term memory (STM), 249, 256, 258 simulations, 274-80 storing items with decreasing activity, 252-54 structure of units, 249-50 transient memory span (TMS), 258-59 using rehearsal to process long lists, 258-60 Speech recognition, 246 State degradation diagrams, 2079 State machines description of, 173-75 recurrent neural networks as, 179-85 State space interpretation, 324-
351
26 System cycle time, 306 System transient time, 306 Taylor series implementation, gamma memory and, 326-28 Temporal chunking problem, 273 Time delay mechanisms and attractor training, 129-31 Time-delay neural network (TDNN), 129 Timed Petri nets (TPNs), 289-92 Timed transitions, HPNs and, 294, 297-98 Time scales, linking waves and, 21-22 Time-to-space mapping, 48-50 Transient memory span (TMS), 258-59 Translational invariance, 26 Trial-and-error learning, 222, 234 Vector space interpretation, 32223 Waves hnking of, 21-22 linking of group, 22-25
This Page Intentionally Left Blank